工程数据处理工作流程

IF 3.3 4区 计算机科学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING IEEE Software Pub Date : 2024-06-04 DOI:10.1109/ms.2024.3385665
Diomidis Spinellis
{"title":"工程数据处理工作流程","authors":"Diomidis Spinellis","doi":"10.1109/ms.2024.3385665","DOIUrl":null,"url":null,"abstract":"Effective data processing workflows are crucial in data science, business analytics, and machine learning. Domain-specific tools can be invaluable, but often custom workflows are needed. Key to their success is splitting data and tasks into manageable chunks to enhance reliability, troubleshooting, and parallelization. Avoid monolithic programs; instead, favor modular designs that simplify data management and processing. Utilizing tools like xargs and GNU parallel can leverage multiple cores or hosts efficiently. Logging and documenting your workflow are essential for monitoring progress and understanding the process. Handling data subsets allows for quicker feedback and testing. Prepare for invalid data and system failures by designing processes that can gracefully manage exceptions and ensure results are reproducible and incremental, avoiding over-engineering. Simplify where possible, leveraging powerful, mature Unix tools and focusing optimization efforts on parts of the code responsible for the bulk of runtime costs. Adhere to software engineering practices to maintain the quality and integrity of your workflow, ensuring it remains a reliable asset to your organization.","PeriodicalId":55018,"journal":{"name":"IEEE Software","volume":"49 1","pages":""},"PeriodicalIF":3.3000,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Engineering Data Processing Workflows\",\"authors\":\"Diomidis Spinellis\",\"doi\":\"10.1109/ms.2024.3385665\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Effective data processing workflows are crucial in data science, business analytics, and machine learning. Domain-specific tools can be invaluable, but often custom workflows are needed. Key to their success is splitting data and tasks into manageable chunks to enhance reliability, troubleshooting, and parallelization. Avoid monolithic programs; instead, favor modular designs that simplify data management and processing. Utilizing tools like xargs and GNU parallel can leverage multiple cores or hosts efficiently. Logging and documenting your workflow are essential for monitoring progress and understanding the process. Handling data subsets allows for quicker feedback and testing. Prepare for invalid data and system failures by designing processes that can gracefully manage exceptions and ensure results are reproducible and incremental, avoiding over-engineering. Simplify where possible, leveraging powerful, mature Unix tools and focusing optimization efforts on parts of the code responsible for the bulk of runtime costs. Adhere to software engineering practices to maintain the quality and integrity of your workflow, ensuring it remains a reliable asset to your organization.\",\"PeriodicalId\":55018,\"journal\":{\"name\":\"IEEE Software\",\"volume\":\"49 1\",\"pages\":\"\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2024-06-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Software\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1109/ms.2024.3385665\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Software","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/ms.2024.3385665","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

摘要

有效的数据处理工作流对数据科学、业务分析和机器学习至关重要。特定领域的工具可能非常宝贵,但往往需要定制工作流。成功的关键在于将数据和任务分割成易于管理的小块,以提高可靠性、故障排除和并行化。避免使用单体程序,而应采用模块化设计来简化数据管理和处理。利用 xargs 和 GNU parallel 等工具可以有效利用多个内核或主机。记录和文档化工作流程对于监控进度和了解流程至关重要。处理数据子集可以加快反馈和测试。通过设计能够从容管理异常的流程,为无效数据和系统故障做好准备,并确保结果的可重现性和增量性,避免过度工程化。尽可能简化,利用功能强大、成熟的 Unix 工具,并将优化工作重点放在造成运行时大部分成本的代码部分。坚持软件工程实践,保持工作流程的质量和完整性,确保其成为企业的可靠资产。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Engineering Data Processing Workflows
Effective data processing workflows are crucial in data science, business analytics, and machine learning. Domain-specific tools can be invaluable, but often custom workflows are needed. Key to their success is splitting data and tasks into manageable chunks to enhance reliability, troubleshooting, and parallelization. Avoid monolithic programs; instead, favor modular designs that simplify data management and processing. Utilizing tools like xargs and GNU parallel can leverage multiple cores or hosts efficiently. Logging and documenting your workflow are essential for monitoring progress and understanding the process. Handling data subsets allows for quicker feedback and testing. Prepare for invalid data and system failures by designing processes that can gracefully manage exceptions and ensure results are reproducible and incremental, avoiding over-engineering. Simplify where possible, leveraging powerful, mature Unix tools and focusing optimization efforts on parts of the code responsible for the bulk of runtime costs. Adhere to software engineering practices to maintain the quality and integrity of your workflow, ensuring it remains a reliable asset to your organization.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE Software
IEEE Software 工程技术-计算机:软件工程
CiteScore
5.50
自引率
6.10%
发文量
182
审稿时长
6-12 weeks
期刊介绍: IEEE Software delivers reliable, useful, leading-edge software development information to keep engineers and managers abreast of rapid technology change. Its mission is to build the community of leading software practitioners. The authority on translating software theory into practice, this magazine positions itself between pure research and pure practice, transferring ideas, methods, and experiences among researchers and engineers. Peerreviewed articles and columns by seasoned practitioners illuminate all aspects of the industry, including process improvement, project management, development tools, software maintenance, Web applications and opportunities, testing, and usability. The magazine''s readers specify, design, document, test, maintain, purchase, engineer, sell, teach, research, and manage the production of software or systems that include software. IEEE Software welcomes articles describing how software is developed in specific companies, laboratories, and university environments as well as articles describing new tools, current trends, and past projects'' limitations and failures as well as successes. Sample topics include geographically distributed development; software architectures; program and system debugging and testing; the education of software professionals; requirements, design, development, testing, and management methodologies; performance measurement and evaluation; standards; program and system reliability, security, and verification; programming environments; languages and language-related issues; Web-based development; usability; and software-related social and legal issues.
期刊最新文献
Providing Guidance to Software Practitioners: A Framework for Creating KPIs MLOps for Cyber-Physical Production Systems: Challenges and Solutions Generative Artificial Intelligence for Software Security Analysis: Fundamentals, Applications, and Challenges A State-of-the-practice Release-readiness Checklist for Generative AI-based Software Products IEEE Security and Privacy Subscribe
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1