Anum Yasmin, Wasi Haider, Ali Daud, Ameena T Banjar
{"title":"Machine learning based software effort estimation using development-centric features for crowdsourcing platform","authors":"Anum Yasmin, Wasi Haider, Ali Daud, Ameena T Banjar","doi":"10.3233/ida-237366","DOIUrl":null,"url":null,"abstract":"Crowd-Sourced software development (CSSD) is getting a good deal of attention from the software and research community in recent times. One of the key challenges faced by CSSD platforms is the task selection mechanism which in practice, contains no intelligent scheme. Rather, rule-of-thumb or intuition strategies are employed, leading to biasness and subjectivity. Effort considerations on crowdsourced tasks can offer good foundation for task selection criteria but are not much investigated. Software development effort estimation (SDEE) is quite prevalent domain in software engineering but only investigated for in-house development. For open-sourced or crowdsourced platforms, it is rarely explored. Moreover, Machine learning (ML) techniques are overpowering SDEE with a claim to provide more accurate estimation results. This work aims to conjoin ML-based SDEE to analyze development effort measures on CSSD platform. The purpose is to discover development-oriented features for crowdsourced tasks and analyze performance of ML techniques to find best estimation model on CSSD dataset. TopCoder is selected as target CSSD platform for the study. TopCoder’s development tasks data with development-centric features are extracted, leading to statistical, regression and correlation analysis to justify features’ significance. For effort estimation, 10 ML families with 2 respective techniques are applied to get broader aspect of estimation. Five performance metrices (MSE, RMSE, MMRE, MdMRE, Pred (25) and Welch’s statistical test are incorporated to judge the worth of effort estimation model’s performance. Data analysis results show that selected features of TopCoder pertain reasonable model significance, regression, and correlation measures. Findings of ML effort estimation depicted that best results for TopCoder dataset can be acquired by linear, non-linear regression and SVM family models. To conclude, the study identified the most relevant development features for CSSD platform, confirmed by in-depth data analysis. This reflects careful selection of effort estimation features to offer good basis of accurate ML estimate.","PeriodicalId":50355,"journal":{"name":"Intelligent Data Analysis","volume":"29 1","pages":""},"PeriodicalIF":0.9000,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent Data Analysis","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.3233/ida-237366","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Crowd-Sourced software development (CSSD) is getting a good deal of attention from the software and research community in recent times. One of the key challenges faced by CSSD platforms is the task selection mechanism which in practice, contains no intelligent scheme. Rather, rule-of-thumb or intuition strategies are employed, leading to biasness and subjectivity. Effort considerations on crowdsourced tasks can offer good foundation for task selection criteria but are not much investigated. Software development effort estimation (SDEE) is quite prevalent domain in software engineering but only investigated for in-house development. For open-sourced or crowdsourced platforms, it is rarely explored. Moreover, Machine learning (ML) techniques are overpowering SDEE with a claim to provide more accurate estimation results. This work aims to conjoin ML-based SDEE to analyze development effort measures on CSSD platform. The purpose is to discover development-oriented features for crowdsourced tasks and analyze performance of ML techniques to find best estimation model on CSSD dataset. TopCoder is selected as target CSSD platform for the study. TopCoder’s development tasks data with development-centric features are extracted, leading to statistical, regression and correlation analysis to justify features’ significance. For effort estimation, 10 ML families with 2 respective techniques are applied to get broader aspect of estimation. Five performance metrices (MSE, RMSE, MMRE, MdMRE, Pred (25) and Welch’s statistical test are incorporated to judge the worth of effort estimation model’s performance. Data analysis results show that selected features of TopCoder pertain reasonable model significance, regression, and correlation measures. Findings of ML effort estimation depicted that best results for TopCoder dataset can be acquired by linear, non-linear regression and SVM family models. To conclude, the study identified the most relevant development features for CSSD platform, confirmed by in-depth data analysis. This reflects careful selection of effort estimation features to offer good basis of accurate ML estimate.
期刊介绍:
Intelligent Data Analysis provides a forum for the examination of issues related to the research and applications of Artificial Intelligence techniques in data analysis across a variety of disciplines. These techniques include (but are not limited to): all areas of data visualization, data pre-processing (fusion, editing, transformation, filtering, sampling), data engineering, database mining techniques, tools and applications, use of domain knowledge in data analysis, big data applications, evolutionary algorithms, machine learning, neural nets, fuzzy logic, statistical pattern recognition, knowledge filtering, and post-processing. In particular, papers are preferred that discuss development of new AI related data analysis architectures, methodologies, and techniques and their applications to various domains.