{"title":"Data preparation for Deep Learning based Code Smell Detection: A systematic literature review","authors":"Fengji Zhang , Zexian Zhang , Jacky Wai Keung , Xiangru Tang , Zhen Yang , Xiao Yu , Wenhua Hu","doi":"10.1016/j.jss.2024.112131","DOIUrl":null,"url":null,"abstract":"<div><p><u>C</u>ode <u>S</u>mell <u>D</u>etection (CSD) plays a crucial role in improving software quality and maintainability. And <u>D</u>eep <u>L</u>earning (DL) techniques have emerged as a promising approach for CSD due to their superior performance. However, the effectiveness of DL-based CSD methods heavily relies on the quality of the training data. Despite its importance, little attention has been paid to analyzing the data preparation process. This systematic literature review analyzes the data preparation techniques used in DL-based CSD methods. We identify 36 relevant papers published by December 2023 and provide a thorough analysis of the critical considerations in constructing CSD datasets, including data requirements, collection, labeling, and cleaning. We also summarize seven primary challenges and corresponding solutions in the literature. Finally, we offer actionable recommendations for preparing and accessing high-quality CSD data, emphasizing the importance of data diversity, standardization, and accessibility. This survey provides valuable insights for researchers and practitioners to harness the full potential of DL techniques in CSD.</p></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":null,"pages":null},"PeriodicalIF":3.7000,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems and Software","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0164121224001766","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Code Smell Detection (CSD) plays a crucial role in improving software quality and maintainability. And Deep Learning (DL) techniques have emerged as a promising approach for CSD due to their superior performance. However, the effectiveness of DL-based CSD methods heavily relies on the quality of the training data. Despite its importance, little attention has been paid to analyzing the data preparation process. This systematic literature review analyzes the data preparation techniques used in DL-based CSD methods. We identify 36 relevant papers published by December 2023 and provide a thorough analysis of the critical considerations in constructing CSD datasets, including data requirements, collection, labeling, and cleaning. We also summarize seven primary challenges and corresponding solutions in the literature. Finally, we offer actionable recommendations for preparing and accessing high-quality CSD data, emphasizing the importance of data diversity, standardization, and accessibility. This survey provides valuable insights for researchers and practitioners to harness the full potential of DL techniques in CSD.
期刊介绍:
The Journal of Systems and Software publishes papers covering all aspects of software engineering and related hardware-software-systems issues. All articles should include a validation of the idea presented, e.g. through case studies, experiments, or systematic comparisons with other approaches already in practice. Topics of interest include, but are not limited to:
• Methods and tools for, and empirical studies on, software requirements, design, architecture, verification and validation, maintenance and evolution
• Agile, model-driven, service-oriented, open source and global software development
• Approaches for mobile, multiprocessing, real-time, distributed, cloud-based, dependable and virtualized systems
• Human factors and management concerns of software development
• Data management and big data issues of software systems
• Metrics and evaluation, data mining of software development resources
• Business and economic aspects of software development processes
The journal welcomes state-of-the-art surveys and reports of practical experience for all of these topics.