{"title":"工程数据处理工作流程","authors":"Diomidis Spinellis","doi":"10.1109/ms.2024.3385665","DOIUrl":null,"url":null,"abstract":"Effective data processing workflows are crucial in data science, business analytics, and machine learning. Domain-specific tools can be invaluable, but often custom workflows are needed. Key to their success is splitting data and tasks into manageable chunks to enhance reliability, troubleshooting, and parallelization. Avoid monolithic programs; instead, favor modular designs that simplify data management and processing. Utilizing tools like xargs and GNU parallel can leverage multiple cores or hosts efficiently. Logging and documenting your workflow are essential for monitoring progress and understanding the process. Handling data subsets allows for quicker feedback and testing. Prepare for invalid data and system failures by designing processes that can gracefully manage exceptions and ensure results are reproducible and incremental, avoiding over-engineering. Simplify where possible, leveraging powerful, mature Unix tools and focusing optimization efforts on parts of the code responsible for the bulk of runtime costs. Adhere to software engineering practices to maintain the quality and integrity of your workflow, ensuring it remains a reliable asset to your organization.","PeriodicalId":55018,"journal":{"name":"IEEE Software","volume":"49 1","pages":""},"PeriodicalIF":3.3000,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Engineering Data Processing Workflows\",\"authors\":\"Diomidis Spinellis\",\"doi\":\"10.1109/ms.2024.3385665\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Effective data processing workflows are crucial in data science, business analytics, and machine learning. Domain-specific tools can be invaluable, but often custom workflows are needed. Key to their success is splitting data and tasks into manageable chunks to enhance reliability, troubleshooting, and parallelization. Avoid monolithic programs; instead, favor modular designs that simplify data management and processing. Utilizing tools like xargs and GNU parallel can leverage multiple cores or hosts efficiently. Logging and documenting your workflow are essential for monitoring progress and understanding the process. Handling data subsets allows for quicker feedback and testing. Prepare for invalid data and system failures by designing processes that can gracefully manage exceptions and ensure results are reproducible and incremental, avoiding over-engineering. Simplify where possible, leveraging powerful, mature Unix tools and focusing optimization efforts on parts of the code responsible for the bulk of runtime costs. Adhere to software engineering practices to maintain the quality and integrity of your workflow, ensuring it remains a reliable asset to your organization.\",\"PeriodicalId\":55018,\"journal\":{\"name\":\"IEEE Software\",\"volume\":\"49 1\",\"pages\":\"\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2024-06-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Software\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1109/ms.2024.3385665\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Software","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/ms.2024.3385665","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
摘要
有效的数据处理工作流对数据科学、业务分析和机器学习至关重要。特定领域的工具可能非常宝贵,但往往需要定制工作流。成功的关键在于将数据和任务分割成易于管理的小块,以提高可靠性、故障排除和并行化。避免使用单体程序,而应采用模块化设计来简化数据管理和处理。利用 xargs 和 GNU parallel 等工具可以有效利用多个内核或主机。记录和文档化工作流程对于监控进度和了解流程至关重要。处理数据子集可以加快反馈和测试。通过设计能够从容管理异常的流程,为无效数据和系统故障做好准备,并确保结果的可重现性和增量性,避免过度工程化。尽可能简化,利用功能强大、成熟的 Unix 工具,并将优化工作重点放在造成运行时大部分成本的代码部分。坚持软件工程实践,保持工作流程的质量和完整性,确保其成为企业的可靠资产。
Effective data processing workflows are crucial in data science, business analytics, and machine learning. Domain-specific tools can be invaluable, but often custom workflows are needed. Key to their success is splitting data and tasks into manageable chunks to enhance reliability, troubleshooting, and parallelization. Avoid monolithic programs; instead, favor modular designs that simplify data management and processing. Utilizing tools like xargs and GNU parallel can leverage multiple cores or hosts efficiently. Logging and documenting your workflow are essential for monitoring progress and understanding the process. Handling data subsets allows for quicker feedback and testing. Prepare for invalid data and system failures by designing processes that can gracefully manage exceptions and ensure results are reproducible and incremental, avoiding over-engineering. Simplify where possible, leveraging powerful, mature Unix tools and focusing optimization efforts on parts of the code responsible for the bulk of runtime costs. Adhere to software engineering practices to maintain the quality and integrity of your workflow, ensuring it remains a reliable asset to your organization.
期刊介绍:
IEEE Software delivers reliable, useful, leading-edge software development information to keep engineers and managers abreast of rapid technology change. Its mission is to build the community of leading software practitioners. The authority on translating software theory into practice, this magazine positions itself between pure research and pure practice, transferring ideas, methods, and experiences among researchers and engineers. Peerreviewed articles and columns by seasoned practitioners illuminate all aspects of the industry, including process improvement, project management, development tools, software maintenance, Web applications and opportunities, testing, and usability. The magazine''s readers specify, design, document, test, maintain, purchase, engineer, sell, teach, research, and manage the production of software or systems that include software. IEEE Software welcomes articles describing how software is developed in specific companies, laboratories, and university environments as well as articles describing new tools, current trends, and past projects'' limitations and failures as well as successes. Sample topics include geographically distributed development; software architectures; program and system debugging and testing; the education of software professionals; requirements, design, development, testing, and management methodologies; performance measurement and evaluation; standards; program and system reliability, security, and verification; programming environments; languages and language-related issues; Web-based development; usability; and software-related social and legal issues.