首页 > 最新文献

2019 15th International Conference on eScience (eScience)最新文献

英文 中文
EDISON Data Science Framework (EDSF) Extension to Address Transversal Skills Required by Emerging Industry 4.0 Transformation EDISON数据科学框架(EDSF)扩展以解决新兴工业4.0转型所需的横向技能
Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00076
Y. Demchenko, T. Wiktorski, J. Cuadrado-Gallego, Steve Brewer
The emerging data-driven economy (also defined as Industry 4.0 or simply 4IR), encompassing industry, research and business, requires new types of specialists that are able to support all stages of the data lifecycle from data production and input, to data processing and actionable results delivery, visualisation and reporting, which can be collectively defined as the Data Science family of professions. Data Science as a research and academic discipline provides a basis for Data Analytics and ML/AI applications. The education and training of the data related professions must reflect all multi-disciplinary knowledge and competences that are required from the Data Science and handling practitioners in modern, data-driven research and the digital economy. In the modern era, with ever faster technology changes, matched by strong skills demand, the Data Science education and training programme should be customizable and deliverable in multiple forms, tailored for different categories of professional roles and profiles. Referring to other publications by the authors on building customizable and interoperable Data Science curricula for different types of learners and target application domains, this paper is focused on defining a set of transversal competences and skills that are required from modern and future Data Science professions. These include workplace and professional skills that cover critical thinking, problem solving, and creativity required to work in highly automated and dynamic environment. The proposed approach is based on the EDISON Data Science Framework (EDSF) initially developed within the EU funded Project EDISON and currently being further developed in the EU funded MATES project and also the FAIRsFAIR projects.
新兴的数据驱动经济(也被定义为工业4.0或简单的4IR),涵盖工业,研究和商业,需要能够支持数据生命周期各个阶段的新型专家,从数据生产和输入,到数据处理和可操作的结果交付,可视化和报告,这些可以统称为数据科学专业家族。数据科学作为一门研究和学术学科,为数据分析和ML/AI应用提供了基础。数据相关专业的教育和培训必须反映现代数据驱动研究和数字经济中数据科学和处理从业者所需的所有多学科知识和能力。在当今时代,随着技术变革的加快,与强大的技能需求相匹配,数据科学教育和培训计划应该以多种形式进行定制和交付,为不同类别的专业角色和概况量身定制。参考作者关于为不同类型的学习者和目标应用领域构建可定制和可互操作的数据科学课程的其他出版物,本文的重点是定义一套现代和未来数据科学专业所需的横向能力和技能。这些技能包括工作场所和专业技能,包括在高度自动化和动态的环境中工作所需的批判性思维、解决问题和创造力。拟议的方法是基于EDISON数据科学框架(EDSF),该框架最初是在欧盟资助的EDISON项目中开发的,目前正在欧盟资助的MATES项目和FAIRsFAIR项目中进一步开发。
{"title":"EDISON Data Science Framework (EDSF) Extension to Address Transversal Skills Required by Emerging Industry 4.0 Transformation","authors":"Y. Demchenko, T. Wiktorski, J. Cuadrado-Gallego, Steve Brewer","doi":"10.1109/eScience.2019.00076","DOIUrl":"https://doi.org/10.1109/eScience.2019.00076","url":null,"abstract":"The emerging data-driven economy (also defined as Industry 4.0 or simply 4IR), encompassing industry, research and business, requires new types of specialists that are able to support all stages of the data lifecycle from data production and input, to data processing and actionable results delivery, visualisation and reporting, which can be collectively defined as the Data Science family of professions. Data Science as a research and academic discipline provides a basis for Data Analytics and ML/AI applications. The education and training of the data related professions must reflect all multi-disciplinary knowledge and competences that are required from the Data Science and handling practitioners in modern, data-driven research and the digital economy. In the modern era, with ever faster technology changes, matched by strong skills demand, the Data Science education and training programme should be customizable and deliverable in multiple forms, tailored for different categories of professional roles and profiles. Referring to other publications by the authors on building customizable and interoperable Data Science curricula for different types of learners and target application domains, this paper is focused on defining a set of transversal competences and skills that are required from modern and future Data Science professions. These include workplace and professional skills that cover critical thinking, problem solving, and creativity required to work in highly automated and dynamic environment. The proposed approach is based on the EDISON Data Science Framework (EDSF) initially developed within the EU funded Project EDISON and currently being further developed in the EU funded MATES project and also the FAIRsFAIR projects.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130268761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Support for HTCondor high-Throughput Computing Workflows in the REANA Reusable Analysis Platform REANA可重用分析平台支持HTCondor高吞吐量计算工作流
Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00091
Rokas Maciulaitis, T. Simko, P. Brenner, Scott S. Hampton, M. Hildreth, K. H. Anampa, Irena Johnson, Cody Kankel, Jan Okraska, D. Rodríguez
REANA is a reusable and reproducible data analysis platform allowing researchers to structure their analysis pipelines and run them on remote containerised compute clouds. REANA supports several different workflows systems (CWL, Serial, Yadage) and uses Kubernetes' job execution backend. We have designed an abstract job execution component that extends the REANA platform job execution capabilities to support multiple compute backends. We have tested the abstract job execution component with HTCondor and verified the scalability of the designed solution. The results show that the REANA platform would be able to support hybrid scientific workflows where different parts of the analysis pipelines can be executed on multiple computing backends.
REANA是一个可重用和可重复的数据分析平台,允许研究人员构建他们的分析管道,并在远程容器化计算云上运行它们。REANA支持几种不同的工作流系统(CWL, Serial, Yadage),并使用Kubernetes的作业执行后端。我们设计了一个抽象的作业执行组件,它扩展了REANA平台的作业执行能力,以支持多个计算后端。我们使用HTCondor对抽象作业执行组件进行了测试,验证了所设计解决方案的可扩展性。结果表明,REANA平台将能够支持混合科学工作流,其中分析管道的不同部分可以在多个计算后端执行。
{"title":"Support for HTCondor high-Throughput Computing Workflows in the REANA Reusable Analysis Platform","authors":"Rokas Maciulaitis, T. Simko, P. Brenner, Scott S. Hampton, M. Hildreth, K. H. Anampa, Irena Johnson, Cody Kankel, Jan Okraska, D. Rodríguez","doi":"10.1109/eScience.2019.00091","DOIUrl":"https://doi.org/10.1109/eScience.2019.00091","url":null,"abstract":"REANA is a reusable and reproducible data analysis platform allowing researchers to structure their analysis pipelines and run them on remote containerised compute clouds. REANA supports several different workflows systems (CWL, Serial, Yadage) and uses Kubernetes' job execution backend. We have designed an abstract job execution component that extends the REANA platform job execution capabilities to support multiple compute backends. We have tested the abstract job execution component with HTCondor and verified the scalability of the designed solution. The results show that the REANA platform would be able to support hybrid scientific workflows where different parts of the analysis pipelines can be executed on multiple computing backends.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131402793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
SATVAM: Toward an IoT Cyber-Infrastructure for Low-Cost Urban Air Quality Monitoring SATVAM:面向低成本城市空气质量监测的物联网网络基础设施
Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00014
Yogesh L. Simmhan, M. Hegde, Rajesh Zele, S. Tripathi, S. Nair, S. Monga, R. Sahu, Kuldeep Dixit, R. Sutaria, Brijesh Mishra, Anamika Sharma, A. Svr
Air pollution is a public health emergency in large cities. The availability of commodity sensors and the advent of Internet of Things (IoT) enable the deployment of a city-wide network of 1000's of low-cost real-time air quality monitors to help manage this challenge. This needs to be supported by an IoT cyber-infrastructure for reliable and scalable data acquisition from the edge to the Cloud. The low accuracy of such sensors also motivates the need for data-driven calibration models that can accurately predict the science variables from the raw sensor signals. Here, we offer our experiences with designing and deploying such an IoT software platform and calibration models, and validate it through a pilot field deployment at two mega-cities, Delhi and Mumbai. Our edge data service is able to even-out the differential bandwidths from the sensing devices and to the Cloud repository, and recover from transient failures. Our analytical models reduce the errors of the sensors from a best-case of 63% using the factory baseline to as low as 21%, and substantially advances the state-of-the-art in this domain.
空气污染是大城市的突发公共卫生事件。商品传感器的可用性和物联网(IoT)的出现使得部署全市范围内的1000个低成本实时空气质量监测仪网络能够帮助应对这一挑战。这需要物联网网络基础设施的支持,以实现从边缘到云的可靠和可扩展的数据采集。这种传感器的低精度也激发了对数据驱动的校准模型的需求,这些模型可以从原始传感器信号中准确地预测科学变量。在这里,我们提供了我们在设计和部署这样一个物联网软件平台和校准模型方面的经验,并通过在德里和孟买两个大城市的试点现场部署来验证它。我们的边缘数据服务能够平衡来自传感设备和云存储库的差异带宽,并从瞬态故障中恢复。我们的分析模型将传感器的误差从使用工厂基线的最佳情况下的63%降低到低至21%,并大大提高了该领域的最先进水平。
{"title":"SATVAM: Toward an IoT Cyber-Infrastructure for Low-Cost Urban Air Quality Monitoring","authors":"Yogesh L. Simmhan, M. Hegde, Rajesh Zele, S. Tripathi, S. Nair, S. Monga, R. Sahu, Kuldeep Dixit, R. Sutaria, Brijesh Mishra, Anamika Sharma, A. Svr","doi":"10.1109/eScience.2019.00014","DOIUrl":"https://doi.org/10.1109/eScience.2019.00014","url":null,"abstract":"Air pollution is a public health emergency in large cities. The availability of commodity sensors and the advent of Internet of Things (IoT) enable the deployment of a city-wide network of 1000's of low-cost real-time air quality monitors to help manage this challenge. This needs to be supported by an IoT cyber-infrastructure for reliable and scalable data acquisition from the edge to the Cloud. The low accuracy of such sensors also motivates the need for data-driven calibration models that can accurately predict the science variables from the raw sensor signals. Here, we offer our experiences with designing and deploying such an IoT software platform and calibration models, and validate it through a pilot field deployment at two mega-cities, Delhi and Mumbai. Our edge data service is able to even-out the differential bandwidths from the sensing devices and to the Cloud repository, and recover from transient failures. Our analytical models reduce the errors of the sensors from a best-case of 63% using the factory baseline to as low as 21%, and substantially advances the state-of-the-art in this domain.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129609615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Modeling and Matching Digital Data Marketplace Policies 建模和匹配数字数据市场政策
Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00078
Sara Shakeri, Valentina Maccatrozzo, L. Veen, R. Bakhshi, L. Gommans, C. D. Laat, P. Grosso
Recently, Digital Data Marketplaces (DDMs) are gaining wide attention as a sharing platform among different organizations. That is due to the fact that sharing the information and participating in research collaborations play an important role in addressing multiple scientific challenges. To increase trust among participating organizations multiple contracts and agreements should be established in order to determine regulations and policies about who has access to what. Describing these agreements in a general model to be applicable in different DDMs is of utmost importance. In this paper, we present a semantic model for describing the access policies by means of semantic web technologies. In particular, we use and extend the Open Digital Rights Language (ODRL) to describe the pre-established agreements in a DDM.
最近,数字数据市场(ddm)作为不同组织之间的共享平台而受到广泛关注。这是因为共享信息和参与研究合作在应对多种科学挑战方面发挥着重要作用。为了增加参与组织之间的信任,应该建立多种合同和协议,以确定关于谁可以访问哪些内容的法规和政策。在通用模型中描述这些协议以适用于不同的ddm是至关重要的。本文利用语义web技术,提出了一个描述访问策略的语义模型。特别地,我们使用并扩展了开放数字权利语言(ODRL)来描述DDM中预先建立的协议。
{"title":"Modeling and Matching Digital Data Marketplace Policies","authors":"Sara Shakeri, Valentina Maccatrozzo, L. Veen, R. Bakhshi, L. Gommans, C. D. Laat, P. Grosso","doi":"10.1109/eScience.2019.00078","DOIUrl":"https://doi.org/10.1109/eScience.2019.00078","url":null,"abstract":"Recently, Digital Data Marketplaces (DDMs) are gaining wide attention as a sharing platform among different organizations. That is due to the fact that sharing the information and participating in research collaborations play an important role in addressing multiple scientific challenges. To increase trust among participating organizations multiple contracts and agreements should be established in order to determine regulations and policies about who has access to what. Describing these agreements in a general model to be applicable in different DDMs is of utmost importance. In this paper, we present a semantic model for describing the access policies by means of semantic web technologies. In particular, we use and extend the Open Digital Rights Language (ODRL) to describe the pre-established agreements in a DDM.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134628264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Towards a Computer-Interpretable Actionable Formal Model to Encode Data Governance Rules 面向数据治理规则编码的计算机可解释可操作形式化模型
Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00082
Rui Zhao, M. Atkinson
With the needs of science and business, data sharing and re-use has become an intensive activity for various areas. In many cases, governance imposes rules concerning data use, but there is no existing computational technique to help data-users comply with such rules. We argue that intelligent systems can be used to improve the situation, by recording provenance records during processing, encoding the rules and performing reasoning. We present our initial work, designing formal models for data rules and flow rules and the reasoning system, as the first step towards helping data providers and data users sustain productive relationships.
随着科学和商业的需要,数据共享和重用已经成为各个领域的密集活动。在许多情况下,治理规定了有关数据使用的规则,但是没有现有的计算技术来帮助数据用户遵守这些规则。我们认为,智能系统可以通过在处理过程中记录来源记录、编码规则和执行推理来改善这种情况。我们介绍了我们的初步工作,为数据规则、流规则和推理系统设计正式模型,作为帮助数据提供者和数据用户维持有效关系的第一步。
{"title":"Towards a Computer-Interpretable Actionable Formal Model to Encode Data Governance Rules","authors":"Rui Zhao, M. Atkinson","doi":"10.1109/eScience.2019.00082","DOIUrl":"https://doi.org/10.1109/eScience.2019.00082","url":null,"abstract":"With the needs of science and business, data sharing and re-use has become an intensive activity for various areas. In many cases, governance imposes rules concerning data use, but there is no existing computational technique to help data-users comply with such rules. We argue that intelligent systems can be used to improve the situation, by recording provenance records during processing, encoding the rules and performing reasoning. We present our initial work, designing formal models for data rules and flow rules and the reasoning system, as the first step towards helping data providers and data users sustain productive relationships.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132546605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Dynamic Sizing of Continuously Divisible Jobs for Heterogeneous Resources 异构资源下连续可分作业的动态分级
Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00026
Nicholas L. Hazekamp, Benjamín Tovar, D. Thain
Many scientific applications operate on large datasets that can be partitioned and operated on concurrently. The existing approaches for concurrent execution generally rely on statically partitioned data. This static partitioning can lock performance in a sub-optimal configuration, leading to higher execution time and an inability to respond to dynamic resources. We present the Continuously Divisible Job abstraction which allows statically defined applications to have their component tasks dynamically sized responding to system behavior. The Continuously Divisible Job abstraction defines a simple interface that dictates how work can be recursively divided, executed, and merged. Implementing this abstraction allows scientific applications to leverage dynamic job coordinators for execution. We also propose the Virtual File abstraction which allows read-only subsets of large files to be treated as separate files. In exploring the Continuously Divisible Job abstraction, two applications were implemented using the Continuously Divisible Job interface: a bioinformatics application and a high-energy physics event analysis. These were tested using an abstract job interface and several job coordinators. Comparing these against a previous static partitioning implementation we show comparable or better performance without having to make static decisions or implement complex dynamic application handling.
许多科学应用程序操作的大型数据集可以进行分区和并发操作。现有的并发执行方法通常依赖于静态分区数据。这种静态分区可以在次优配置中锁定性能,从而导致更长的执行时间和无法响应动态资源。我们提出了连续可分割的工作抽象,它允许静态定义的应用程序动态地调整其组件任务的大小,以响应系统行为。连续可分割作业抽象定义了一个简单的接口,该接口指示如何递归地划分、执行和合并工作。实现这个抽象允许科学应用程序利用动态作业协调器来执行。我们还提出了虚拟文件抽象,它允许将大文件的只读子集视为单独的文件。在探索连续可分作业抽象的过程中,使用连续可分作业接口实现了两个应用程序:生物信息学应用程序和高能物理事件分析。使用一个抽象作业接口和几个作业协调器对它们进行了测试。将这些与以前的静态分区实现进行比较,我们可以显示出相当或更好的性能,而无需做出静态决策或实现复杂的动态应用程序处理。
{"title":"Dynamic Sizing of Continuously Divisible Jobs for Heterogeneous Resources","authors":"Nicholas L. Hazekamp, Benjamín Tovar, D. Thain","doi":"10.1109/eScience.2019.00026","DOIUrl":"https://doi.org/10.1109/eScience.2019.00026","url":null,"abstract":"Many scientific applications operate on large datasets that can be partitioned and operated on concurrently. The existing approaches for concurrent execution generally rely on statically partitioned data. This static partitioning can lock performance in a sub-optimal configuration, leading to higher execution time and an inability to respond to dynamic resources. We present the Continuously Divisible Job abstraction which allows statically defined applications to have their component tasks dynamically sized responding to system behavior. The Continuously Divisible Job abstraction defines a simple interface that dictates how work can be recursively divided, executed, and merged. Implementing this abstraction allows scientific applications to leverage dynamic job coordinators for execution. We also propose the Virtual File abstraction which allows read-only subsets of large files to be treated as separate files. In exploring the Continuously Divisible Job abstraction, two applications were implemented using the Continuously Divisible Job interface: a bioinformatics application and a high-energy physics event analysis. These were tested using an abstract job interface and several job coordinators. Comparing these against a previous static partitioning implementation we show comparable or better performance without having to make static decisions or implement complex dynamic application handling.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115820744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Survey of Scalable Deep Learning Frameworks 可扩展深度学习框架综述
Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00102
Saba Amiri, Sara Salimzadeh, Adam Belloum
Machine learning models recently have seen a large increase in usage across different disciplines. Their ability to learn complex concepts from the data and perform sophisticated tasks combined with their ability to leverage vast computational infrastructures available today have made them a very attractive choice for many challenges in academia and industry. In this context, deep Learning as a sub-class of machine learning is specifically becoming an important tool in modern computing applications. It has been successfully used for a wide range of different use cases, from medical applications to playing games. Due to the nature of these systems and the fact that a considerable portion of their use-cases deal with large volumes of data, training them is a very time and resource consuming task and requires vast amounts of computing cycles. To overcome this issue, it is only natural to try to scale deep learning applications to be able to run them across in order to achieve fast and manageable training speeds while maintaining a high level of accuracy. In recent years, a number of frameworks have been proposed to scale up ML algorithms to overcome the scalability issue, with roots both in the academia and the industry. With most of them being open source and supported by the increasingly large community of AI specialists and data scientists, their capabilities, performance and compatibility with modern hardware have been honed and extended. Thus, it is not easy for the domain scientist to pick the tool/framework best suited for their needs. This research aims to provide an overview of the relevant, widely used scalable machine learning and deep learning frameworks currently available and to provide the grounds on which researchers can compare and choose the best set of tools for their ML pipeline.
最近,机器学习模型在不同学科中的使用大幅增加。它们从数据中学习复杂概念并执行复杂任务的能力,加上它们利用当今可用的大量计算基础设施的能力,使它们成为学术界和工业界面临许多挑战的一个非常有吸引力的选择。在这种背景下,深度学习作为机器学习的一个子类正在成为现代计算应用中的重要工具。它已经成功地用于各种不同的用例,从医疗应用到玩游戏。由于这些系统的性质以及它们的相当一部分用例处理大量数据的事实,训练它们是一项非常耗时和消耗资源的任务,并且需要大量的计算周期。为了克服这个问题,很自然地尝试扩展深度学习应用程序,以便能够在保持高水平准确性的同时实现快速和可管理的训练速度。近年来,学术界和工业界都提出了许多框架来扩展ML算法以克服可扩展性问题。它们中的大多数都是开源的,并得到越来越多的人工智能专家和数据科学家社区的支持,它们的能力、性能和与现代硬件的兼容性都得到了磨练和扩展。因此,对于领域科学家来说,选择最适合他们需要的工具/框架并不容易。本研究旨在概述当前可用的相关、广泛使用的可扩展机器学习和深度学习框架,并提供研究人员可以比较和选择最佳ML管道工具集的基础。
{"title":"A Survey of Scalable Deep Learning Frameworks","authors":"Saba Amiri, Sara Salimzadeh, Adam Belloum","doi":"10.1109/eScience.2019.00102","DOIUrl":"https://doi.org/10.1109/eScience.2019.00102","url":null,"abstract":"Machine learning models recently have seen a large increase in usage across different disciplines. Their ability to learn complex concepts from the data and perform sophisticated tasks combined with their ability to leverage vast computational infrastructures available today have made them a very attractive choice for many challenges in academia and industry. In this context, deep Learning as a sub-class of machine learning is specifically becoming an important tool in modern computing applications. It has been successfully used for a wide range of different use cases, from medical applications to playing games. Due to the nature of these systems and the fact that a considerable portion of their use-cases deal with large volumes of data, training them is a very time and resource consuming task and requires vast amounts of computing cycles. To overcome this issue, it is only natural to try to scale deep learning applications to be able to run them across in order to achieve fast and manageable training speeds while maintaining a high level of accuracy. In recent years, a number of frameworks have been proposed to scale up ML algorithms to overcome the scalability issue, with roots both in the academia and the industry. With most of them being open source and supported by the increasingly large community of AI specialists and data scientists, their capabilities, performance and compatibility with modern hardware have been honed and extended. Thus, it is not easy for the domain scientist to pick the tool/framework best suited for their needs. This research aims to provide an overview of the relevant, widely used scalable machine learning and deep learning frameworks currently available and to provide the grounds on which researchers can compare and choose the best set of tools for their ML pipeline.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123691694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Timing is Everything: Identifying Diverse Interaction Dynamics in Scenario and Non-Scenario Meetings 时间决定一切:在情景会议和非情景会议中识别不同的互动动态
Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00029
Chreston A. Miller, Christa Miller
In this paper we explore the use of temporal patterns to define interaction dynamics between different kinds of meetings. Meetings occur on a daily basis and include different behavioral dynamics between participants, such as floor shifts and intense dialog. These dynamics can tell a story of the meeting and provide insight into how participants interact. We focus our investigation on defining diversity metrics to compare the interaction dynamics of scenario and non-scenario meetings. These metrics may be able to provide insight into the similarities and differences between scenario and non-scenario meetings. We observe that certain interaction dynamics can be identified through temporal patterns of speech intervals, i.e., when a participant is talking. We apply the principles of Parallel Episodes in identifying moments of speech overlap, e.g., interaction "bursts", and introduce Situated Data Mining, an approach for identifying repeated behavior patterns based on situated context. Applying these algorithms provides an overview of certain meeting dynamics and defines metrics for meeting comparison and diversity of interaction. We tested on a subset of the AMI corpus and developed three diversity metrics to describe similarities and differences between meetings. These metrics also present the researcher with an overview of interaction dynamics and presents points-of-interest for analysis.
在本文中,我们探讨了使用时间模式来定义不同类型会议之间的交互动态。会议每天都在进行,包括参与者之间不同的行为动态,如楼层轮换和激烈的对话。这些动态可以讲述会议的故事,并提供参与者如何互动的见解。我们的研究重点是定义多样性指标,以比较情景会议和非情景会议的互动动态。这些量度可能能够提供对场景会议和非场景会议之间的异同的洞察。我们观察到,某些互动动态可以通过言语间隔的时间模式来识别,即当参与者说话时。我们将平行情节的原则应用于识别语音重叠的时刻,例如,交互“爆发”,并引入情境数据挖掘,一种基于情境上下文识别重复行为模式的方法。应用这些算法提供了某些会议动态的概述,并定义了会议比较和交互多样性的度量。我们在AMI语料库的一个子集上进行了测试,并开发了三个多样性指标来描述会议之间的相似性和差异性。这些指标还向研究人员展示了交互动力学的概述,并提出了分析的兴趣点。
{"title":"Timing is Everything: Identifying Diverse Interaction Dynamics in Scenario and Non-Scenario Meetings","authors":"Chreston A. Miller, Christa Miller","doi":"10.1109/eScience.2019.00029","DOIUrl":"https://doi.org/10.1109/eScience.2019.00029","url":null,"abstract":"In this paper we explore the use of temporal patterns to define interaction dynamics between different kinds of meetings. Meetings occur on a daily basis and include different behavioral dynamics between participants, such as floor shifts and intense dialog. These dynamics can tell a story of the meeting and provide insight into how participants interact. We focus our investigation on defining diversity metrics to compare the interaction dynamics of scenario and non-scenario meetings. These metrics may be able to provide insight into the similarities and differences between scenario and non-scenario meetings. We observe that certain interaction dynamics can be identified through temporal patterns of speech intervals, i.e., when a participant is talking. We apply the principles of Parallel Episodes in identifying moments of speech overlap, e.g., interaction \"bursts\", and introduce Situated Data Mining, an approach for identifying repeated behavior patterns based on situated context. Applying these algorithms provides an overview of certain meeting dynamics and defines metrics for meeting comparison and diversity of interaction. We tested on a subset of the AMI corpus and developed three diversity metrics to describe similarities and differences between meetings. These metrics also present the researcher with an overview of interaction dynamics and presents points-of-interest for analysis.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121304260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Understanding ML Driven HPC: Applications and Infrastructure 理解机器学习驱动的高性能计算:应用程序和基础设施
Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00054
Geoffrey Fox, S. Jha
We recently outlined the vision of "Learning Everywhere" which captures the possibility and impact of how learning methods and traditional HPC methods can be coupled together. A primary driver of such coupling is the promise that Machine Learning (ML) will give major performance improvements for traditional HPC simulations. Motivated by this potential, the ML around HPC class of integration is of particular significance. In a related follow-up paper, we provided an initial taxonomy for integrating learning around HPC methods. In this paper which is part of the Learning Everywhere series, we discuss ``how'' learning methods and HPC simulations are being integrated to enhance effective performance of computations. This paper describes several modes --- substitution, assimilation, and control, in which learning methods integrate with HPC simulations and provide representative applications in each mode. This paper discusses some open research questions and we hope will motivate and clear the ground for MLaroundHPC benchmarks.
我们最近概述了“处处学习”的愿景,它抓住了学习方法和传统HPC方法如何结合在一起的可能性和影响。这种耦合的主要驱动因素是机器学习(ML)有望为传统的高性能计算模拟提供重大性能改进。在这种潜力的推动下,围绕HPC类的ML集成具有特别重要的意义。在一篇相关的后续论文中,我们提供了一个围绕HPC方法整合学习的初步分类。本文是“无处不在的学习”系列的一部分,我们将讨论“如何”将学习方法和高性能计算模拟集成在一起,以提高计算的有效性能。本文描述了几种模式——替代、同化和控制,其中学习方法与HPC仿真相结合,并在每种模式下提供了代表性的应用。本文讨论了一些开放的研究问题,我们希望将激励和清理基础的MLaroundHPC基准。
{"title":"Understanding ML Driven HPC: Applications and Infrastructure","authors":"Geoffrey Fox, S. Jha","doi":"10.1109/eScience.2019.00054","DOIUrl":"https://doi.org/10.1109/eScience.2019.00054","url":null,"abstract":"We recently outlined the vision of \"Learning Everywhere\" which captures the possibility and impact of how learning methods and traditional HPC methods can be coupled together. A primary driver of such coupling is the promise that Machine Learning (ML) will give major performance improvements for traditional HPC simulations. Motivated by this potential, the ML around HPC class of integration is of particular significance. In a related follow-up paper, we provided an initial taxonomy for integrating learning around HPC methods. In this paper which is part of the Learning Everywhere series, we discuss ``how'' learning methods and HPC simulations are being integrated to enhance effective performance of computations. This paper describes several modes --- substitution, assimilation, and control, in which learning methods integrate with HPC simulations and provide representative applications in each mode. This paper discusses some open research questions and we hope will motivate and clear the ground for MLaroundHPC benchmarks.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126775311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Toward an Elastic Data Transfer Infrastructure 迈向弹性数据传输基础设施
Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00036
Joaquín Chung, Zhengchun Liu, R. Kettimuthu, Ian T Foster
Data transfer over wide area networks is an integral part of many science workflows that must, for example, move data from scientific facilities to remote resources for analysis, sharing, and storage. Yet despite continued enhancements in data transfer infrastructure (DTI), our previous analyses of approximately 40 billion GridFTP command logs collected over four years from the Globus transfer service show that data transfer nodes (DTNs) are idle (i.e., are performing no transfers) 94.3% of the time. On the other hand, we have also observed periods in which CPU resource scarcity negatively impacts DTN throughput. Motivated by the opportunity to optimize DTI performance, we present here an elastic DTI architecture in which the pool of nodes allocated to DTN activities expands and shrinks over time, based on demand. Our results show that this elastic DTI can save up to ~95% of resources compared with a typical static DTN deployment, with the median slowdown incurred remaining close to one for most of the evaluated scenarios.
广域网上的数据传输是许多科学工作流程的一个组成部分,例如,必须将数据从科学设施移动到远程资源以进行分析、共享和存储。然而,尽管数据传输基础设施(DTI)不断增强,我们之前对四年来从Globus传输服务收集的大约400亿个GridFTP命令日志的分析表明,数据传输节点(dtn)在94.3%的时间里是空闲的(即不执行传输)。另一方面,我们还观察到CPU资源稀缺性对DTN吞吐量产生负面影响的时期。在优化DTI性能的机会的激励下,我们在这里提出了一个弹性DTI体系结构,其中分配给DTN活动的节点池根据需求随着时间的推移而扩展和缩小。我们的结果表明,与典型的静态DTN部署相比,这种弹性DTI可以节省高达95%的资源,在大多数评估场景中,所产生的中位数减速仍然接近1。
{"title":"Toward an Elastic Data Transfer Infrastructure","authors":"Joaquín Chung, Zhengchun Liu, R. Kettimuthu, Ian T Foster","doi":"10.1109/eScience.2019.00036","DOIUrl":"https://doi.org/10.1109/eScience.2019.00036","url":null,"abstract":"Data transfer over wide area networks is an integral part of many science workflows that must, for example, move data from scientific facilities to remote resources for analysis, sharing, and storage. Yet despite continued enhancements in data transfer infrastructure (DTI), our previous analyses of approximately 40 billion GridFTP command logs collected over four years from the Globus transfer service show that data transfer nodes (DTNs) are idle (i.e., are performing no transfers) 94.3% of the time. On the other hand, we have also observed periods in which CPU resource scarcity negatively impacts DTN throughput. Motivated by the opportunity to optimize DTI performance, we present here an elastic DTI architecture in which the pool of nodes allocated to DTN activities expands and shrinks over time, based on demand. Our results show that this elastic DTI can save up to ~95% of resources compared with a typical static DTN deployment, with the median slowdown incurred remaining close to one for most of the evaluated scenarios.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"11 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130725156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2019 15th International Conference on eScience (eScience)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1