{"title":"Engineering Data Processing Workflows","authors":"Diomidis Spinellis","doi":"10.1109/ms.2024.3385665","DOIUrl":null,"url":null,"abstract":"Effective data processing workflows are crucial in data science, business analytics, and machine learning. Domain-specific tools can be invaluable, but often custom workflows are needed. Key to their success is splitting data and tasks into manageable chunks to enhance reliability, troubleshooting, and parallelization. Avoid monolithic programs; instead, favor modular designs that simplify data management and processing. Utilizing tools like xargs and GNU parallel can leverage multiple cores or hosts efficiently. Logging and documenting your workflow are essential for monitoring progress and understanding the process. Handling data subsets allows for quicker feedback and testing. Prepare for invalid data and system failures by designing processes that can gracefully manage exceptions and ensure results are reproducible and incremental, avoiding over-engineering. Simplify where possible, leveraging powerful, mature Unix tools and focusing optimization efforts on parts of the code responsible for the bulk of runtime costs. Adhere to software engineering practices to maintain the quality and integrity of your workflow, ensuring it remains a reliable asset to your organization.","PeriodicalId":55018,"journal":{"name":"IEEE Software","volume":"49 1","pages":""},"PeriodicalIF":3.3000,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Software","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/ms.2024.3385665","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Effective data processing workflows are crucial in data science, business analytics, and machine learning. Domain-specific tools can be invaluable, but often custom workflows are needed. Key to their success is splitting data and tasks into manageable chunks to enhance reliability, troubleshooting, and parallelization. Avoid monolithic programs; instead, favor modular designs that simplify data management and processing. Utilizing tools like xargs and GNU parallel can leverage multiple cores or hosts efficiently. Logging and documenting your workflow are essential for monitoring progress and understanding the process. Handling data subsets allows for quicker feedback and testing. Prepare for invalid data and system failures by designing processes that can gracefully manage exceptions and ensure results are reproducible and incremental, avoiding over-engineering. Simplify where possible, leveraging powerful, mature Unix tools and focusing optimization efforts on parts of the code responsible for the bulk of runtime costs. Adhere to software engineering practices to maintain the quality and integrity of your workflow, ensuring it remains a reliable asset to your organization.
期刊介绍:
IEEE Software delivers reliable, useful, leading-edge software development information to keep engineers and managers abreast of rapid technology change. Its mission is to build the community of leading software practitioners. The authority on translating software theory into practice, this magazine positions itself between pure research and pure practice, transferring ideas, methods, and experiences among researchers and engineers. Peerreviewed articles and columns by seasoned practitioners illuminate all aspects of the industry, including process improvement, project management, development tools, software maintenance, Web applications and opportunities, testing, and usability. The magazine''s readers specify, design, document, test, maintain, purchase, engineer, sell, teach, research, and manage the production of software or systems that include software. IEEE Software welcomes articles describing how software is developed in specific companies, laboratories, and university environments as well as articles describing new tools, current trends, and past projects'' limitations and failures as well as successes. Sample topics include geographically distributed development; software architectures; program and system debugging and testing; the education of software professionals; requirements, design, development, testing, and management methodologies; performance measurement and evaluation; standards; program and system reliability, security, and verification; programming environments; languages and language-related issues; Web-based development; usability; and software-related social and legal issues.