Machine learning based software effort estimation using development-centric features for crowdsourcing platform

IF 0.8 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Intelligent Data Analysis Pub Date : 2023-11-30 DOI:10.3233/ida-237366

Anum Yasmin, Wasi Haider, Ali Daud, Ameena T Banjar

{"title":"Machine learning based software effort estimation using development-centric features for crowdsourcing platform","authors":"Anum Yasmin, Wasi Haider, Ali Daud, Ameena T Banjar","doi":"10.3233/ida-237366","DOIUrl":null,"url":null,"abstract":"Crowd-Sourced software development (CSSD) is getting a good deal of attention from the software and research community in recent times. One of the key challenges faced by CSSD platforms is the task selection mechanism which in practice, contains no intelligent scheme. Rather, rule-of-thumb or intuition strategies are employed, leading to biasness and subjectivity. Effort considerations on crowdsourced tasks can offer good foundation for task selection criteria but are not much investigated. Software development effort estimation (SDEE) is quite prevalent domain in software engineering but only investigated for in-house development. For open-sourced or crowdsourced platforms, it is rarely explored. Moreover, Machine learning (ML) techniques are overpowering SDEE with a claim to provide more accurate estimation results. This work aims to conjoin ML-based SDEE to analyze development effort measures on CSSD platform. The purpose is to discover development-oriented features for crowdsourced tasks and analyze performance of ML techniques to find best estimation model on CSSD dataset. TopCoder is selected as target CSSD platform for the study. TopCoder’s development tasks data with development-centric features are extracted, leading to statistical, regression and correlation analysis to justify features’ significance. For effort estimation, 10 ML families with 2 respective techniques are applied to get broader aspect of estimation. Five performance metrices (MSE, RMSE, MMRE, MdMRE, Pred (25) and Welch’s statistical test are incorporated to judge the worth of effort estimation model’s performance. Data analysis results show that selected features of TopCoder pertain reasonable model significance, regression, and correlation measures. Findings of ML effort estimation depicted that best results for TopCoder dataset can be acquired by linear, non-linear regression and SVM family models. To conclude, the study identified the most relevant development features for CSSD platform, confirmed by in-depth data analysis. This reflects careful selection of effort estimation features to offer good basis of accurate ML estimate.","PeriodicalId":50355,"journal":{"name":"Intelligent Data Analysis","volume":"29 1","pages":""},"PeriodicalIF":0.8000,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent Data Analysis","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.3233/ida-237366","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Crowd-Sourced software development (CSSD) is getting a good deal of attention from the software and research community in recent times. One of the key challenges faced by CSSD platforms is the task selection mechanism which in practice, contains no intelligent scheme. Rather, rule-of-thumb or intuition strategies are employed, leading to biasness and subjectivity. Effort considerations on crowdsourced tasks can offer good foundation for task selection criteria but are not much investigated. Software development effort estimation (SDEE) is quite prevalent domain in software engineering but only investigated for in-house development. For open-sourced or crowdsourced platforms, it is rarely explored. Moreover, Machine learning (ML) techniques are overpowering SDEE with a claim to provide more accurate estimation results. This work aims to conjoin ML-based SDEE to analyze development effort measures on CSSD platform. The purpose is to discover development-oriented features for crowdsourced tasks and analyze performance of ML techniques to find best estimation model on CSSD dataset. TopCoder is selected as target CSSD platform for the study. TopCoder’s development tasks data with development-centric features are extracted, leading to statistical, regression and correlation analysis to justify features’ significance. For effort estimation, 10 ML families with 2 respective techniques are applied to get broader aspect of estimation. Five performance metrices (MSE, RMSE, MMRE, MdMRE, Pred (25) and Welch’s statistical test are incorporated to judge the worth of effort estimation model’s performance. Data analysis results show that selected features of TopCoder pertain reasonable model significance, regression, and correlation measures. Findings of ML effort estimation depicted that best results for TopCoder dataset can be acquired by linear, non-linear regression and SVM family models. To conclude, the study identified the most relevant development features for CSSD platform, confirmed by in-depth data analysis. This reflects careful selection of effort estimation features to offer good basis of accurate ML estimate.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于机器学习的软件工作量估算，利用众包平台以开发为中心的特征

众包软件开发（CSSD）近来受到软件和研究界的广泛关注。CSSD 平台面临的主要挑战之一是任务选择机制。相反，由于采用了经验法则或直觉策略，导致了偏差和主观性。众包任务的努力考虑因素可以为任务选择标准提供良好的基础，但这方面的研究并不多。软件开发工作量估算（SDEE）是软件工程中相当普遍的领域，但只针对内部开发进行过研究。对于开源或众包平台，很少有人进行过研究。此外，机器学习（ML）技术正在取代 SDEE，声称能提供更准确的估算结果。这项工作旨在结合基于 ML 的 SDEE，分析 CSSD 平台上的开发工作量。其目的是发现众包任务的开发导向特征，并分析 ML 技术的性能，从而找到 CSSD 数据集上的最佳估算模型。本研究选择 TopCoder 作为目标 CSSD 平台。研究人员从 TopCoder 的开发任务数据中提取了以开发为中心的特征，并进行了统计、回归和相关分析，以证明特征的重要性。在工作量估算方面，应用了 10 个 ML 族和 2 种不同的技术，以获得更广泛的估算结果。五种性能指标（MSE、RMSE、MMRE、MdMRE、Pred (25) 和韦尔奇统计检验）用于判断努力估算模型的性能价值。数据分析结果表明，TopCoder 的选定功能与合理的模型显著性、回归和相关测量有关。ML 努力估算的结果表明，TopCoder 数据集的最佳结果可通过线性、非线性回归和 SVM 系列模型获得。总之，通过深入的数据分析，本研究确定了与 CSSD 平台最相关的开发特征。这反映了对努力估算特征的精心选择，为准确的 ML 估算提供了良好的基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Intelligent Data Analysis 工程技术-计算机：人工智能

CiteScore

2.20

自引率

5.90%

发文量

审稿时长

3.3 months

期刊介绍： Intelligent Data Analysis provides a forum for the examination of issues related to the research and applications of Artificial Intelligence techniques in data analysis across a variety of disciplines. These techniques include (but are not limited to): all areas of data visualization, data pre-processing (fusion, editing, transformation, filtering, sampling), data engineering, database mining techniques, tools and applications, use of domain knowledge in data analysis, big data applications, evolutionary algorithms, machine learning, neural nets, fuzzy logic, statistical pattern recognition, knowledge filtering, and post-processing. In particular, papers are preferred that discuss development of new AI related data analysis architectures, methodologies, and techniques and their applications to various domains.