Scaling Notebooks as Re-configurable Cloud Workflows

IF 1.3 3区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Data Intelligence Pub Date : 2022-04-01 DOI:10.1162/dint_a_00140
Yuandou Wang, Spiros Koulouzis, Riccardo Bianchi, N. Li, Yifang Shi, J. Timmermans, W. Kissling, Zhiming Zhao
{"title":"Scaling Notebooks as Re-configurable Cloud Workflows","authors":"Yuandou Wang, Spiros Koulouzis, Riccardo Bianchi, N. Li, Yifang Shi, J. Timmermans, W. Kissling, Zhiming Zhao","doi":"10.1162/dint_a_00140","DOIUrl":null,"url":null,"abstract":"Abstract Literate computing environments, such as the Jupyter (i.e., Jupyter Notebooks, JupyterLab, and JupyterHub), have been widely used in scientific studies; they allow users to interactively develop scientific code, test algorithms, and describe the scientific narratives of the experiments in an integrated document. To scale up scientific analyses, many implemented Jupyter environment architectures encapsulate the whole Jupyter notebooks as reproducible units and autoscale them on dedicated remote infrastructures (e.g., highperformance computing and cloud computing environments). The existing solutions are still limited in many ways, e.g., 1) the workflow (or pipeline) is implicit in a notebook, and some steps can be generically used by different code and executed in parallel, but because of the tight cell structure, all steps in the Jupyter notebook have to be executed sequentially and lack of the flexibility of reusing the core code fragments, and 2) there are performance bottlenecks that need to improve the parallelism and scalability when handling extensive input data and complex computation. In this work, we focus on how to manage the workflow in a notebook seamlessly. We 1) encapsulate the reusable cells as RESTful services and containerize them as portal components, 2) provide a composition tool for describing workflow logic of those reusable components, and 3) automate the execution on remote cloud infrastructure. Empirically, we validate the solution's usability via a use case from the Ecology and Earth Science domain, illustrating the processing of massive Light Detection and Ranging (LiDAR) data. The demonstration and analysis show that our method is feasible, but that it needs further improvement, especially on integrating distributed workflow scheduling, automatic deployment, and execution to develop as a mature approach.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"4 1","pages":"409-425"},"PeriodicalIF":1.3000,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1162/dint_a_00140","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 4

Abstract

Abstract Literate computing environments, such as the Jupyter (i.e., Jupyter Notebooks, JupyterLab, and JupyterHub), have been widely used in scientific studies; they allow users to interactively develop scientific code, test algorithms, and describe the scientific narratives of the experiments in an integrated document. To scale up scientific analyses, many implemented Jupyter environment architectures encapsulate the whole Jupyter notebooks as reproducible units and autoscale them on dedicated remote infrastructures (e.g., highperformance computing and cloud computing environments). The existing solutions are still limited in many ways, e.g., 1) the workflow (or pipeline) is implicit in a notebook, and some steps can be generically used by different code and executed in parallel, but because of the tight cell structure, all steps in the Jupyter notebook have to be executed sequentially and lack of the flexibility of reusing the core code fragments, and 2) there are performance bottlenecks that need to improve the parallelism and scalability when handling extensive input data and complex computation. In this work, we focus on how to manage the workflow in a notebook seamlessly. We 1) encapsulate the reusable cells as RESTful services and containerize them as portal components, 2) provide a composition tool for describing workflow logic of those reusable components, and 3) automate the execution on remote cloud infrastructure. Empirically, we validate the solution's usability via a use case from the Ecology and Earth Science domain, illustrating the processing of massive Light Detection and Ranging (LiDAR) data. The demonstration and analysis show that our method is feasible, but that it needs further improvement, especially on integrating distributed workflow scheduling, automatic deployment, and execution to develop as a mature approach.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
将笔记本扩展为可重新配置的云工作流
摘要-Literate计算环境,如Jupyter(即Jupyter笔记本、JupyterLab和JupyterHub),已被广泛用于科学研究;它们允许用户以交互方式开发科学代码、测试算法,并在集成文档中描述实验的科学叙述。为了扩大科学分析的规模,许多实现的Jupyter环境架构将整个Jupyter笔记本封装为可复制单元,并在专用的远程基础设施(例如,高性能计算和云计算环境)上自动扩展。现有的解决方案在很多方面仍然受到限制,例如,1)工作流(或管道)隐含在笔记本中,一些步骤可以由不同的代码通用并并行执行,但由于单元结构紧凑,Jupyter笔记本中的所有步骤都必须按顺序执行,并且缺乏重用核心代码片段的灵活性,2)在处理大量输入数据和复杂计算时,存在需要提高并行性和可扩展性的性能瓶颈。在这项工作中,我们重点关注如何在笔记本电脑中无缝管理工作流。我们1)将可重用单元封装为RESTful服务,并将其容器化为门户组件,2)提供用于描述这些可重用组件的工作流逻辑的组合工具,以及3)在远程云基础设施上自动执行。从经验上讲,我们通过生态和地球科学领域的用例验证了该解决方案的可用性,说明了大规模光探测和测距(LiDAR)数据的处理。演示和分析表明,我们的方法是可行的,但还需要进一步改进,特别是在集成分布式工作流调度、自动部署和执行方面,以发展成为一种成熟的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Data Intelligence
Data Intelligence COMPUTER SCIENCE, INFORMATION SYSTEMS-
CiteScore
6.50
自引率
15.40%
发文量
40
审稿时长
8 weeks
期刊最新文献
The Limitations and Ethical Considerations of ChatGPT Rule Mining Trends from 1987 to 2022: A Bibliometric Analysis and Visualization Classification and quantification of timestamp data quality issues and its impact on data quality outcome BIKAS: Bio-Inspired Knowledge Acquisition and Simulacrum—A Knowledge Database to Support Multifunctional Design Concept Generation Exploring Attentive Siamese LSTM for Low-Resource Text Plagiarism Detection
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1