We performed a comparative analysis of code generation model performance with evaluation using common NLP metrics in comparison to a test-based evaluation. The investigation was performed in the context of question answering with code (test-to-code problem) and was aimed at applicability checking both ways for generated code evaluation in a fully automatic manner. We used CodeGen and GPTNeo pretrained models applied to a problem of question answering using Stack Overflow-based corpus (APIzation). For test-based evaluation, industrial test-generation solutions (Machinet, UTBot) were used for providing automatically generated tests. The analysis showed that the performance evaluation based solely on NLP metrics or on tests provides a rather limited assessment of generated code quality. We see the evidence that predictions with both high and low NLP metrics exist that pass and don't pass tests. With the early results of our empirical study being discussed in this paper, we believe that the combination of both approaches may increase possible ways for building, evaluating, and training code generation models.
{"title":"Test-based and metric-based evaluation of code generation models for practical question answering","authors":"Sergey Kovalchuk, Dmitriy Fedrushkov, Vadim Lomshakov, Artem Aliev","doi":"10.1109/ICCQ57276.2023.10114665","DOIUrl":"https://doi.org/10.1109/ICCQ57276.2023.10114665","url":null,"abstract":"We performed a comparative analysis of code generation model performance with evaluation using common NLP metrics in comparison to a test-based evaluation. The investigation was performed in the context of question answering with code (test-to-code problem) and was aimed at applicability checking both ways for generated code evaluation in a fully automatic manner. We used CodeGen and GPTNeo pretrained models applied to a problem of question answering using Stack Overflow-based corpus (APIzation). For test-based evaluation, industrial test-generation solutions (Machinet, UTBot) were used for providing automatically generated tests. The analysis showed that the performance evaluation based solely on NLP metrics or on tests provides a rather limited assessment of generated code quality. We see the evidence that predictions with both high and low NLP metrics exist that pass and don't pass tests. With the early results of our empirical study being discussed in this paper, we believe that the combination of both approaches may increase possible ways for building, evaluating, and training code generation models.","PeriodicalId":318687,"journal":{"name":"2023 International Conference on Code Quality (ICCQ)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114885420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-22DOI: 10.1109/ICCQ57276.2023.10114662
Deema Adeeb Al Shoaibi, Mohamed Wiem Mkaouer
Performance is a quality aspect describing how the software is performing. Any performance degradation will further affect other quality aspects, such as usability. Software developers continuously conduct testing to ensure that code addition or changes do not damage existing functionalities or negatively affect the quality. Hence, developers set strategies to detect, locate and fix the regression if needed. In this paper, we provide an exploratory study on the challenges developers face in resolving performance regression. The study is based on the questions posted on a technical forum directed to performance regression. We collected 1828 questions discussing the regression of software execution time. All those questions are manually analyzed. The study resulted in a categorization of the challenges. We also discussed the difficulty level of performance regression issues within the developers community. This study provides insights to help developers during the software design and implementation to avoid regression causes.
{"title":"Understanding Software Performance Challenges an Empirical Study on Stack Overflow","authors":"Deema Adeeb Al Shoaibi, Mohamed Wiem Mkaouer","doi":"10.1109/ICCQ57276.2023.10114662","DOIUrl":"https://doi.org/10.1109/ICCQ57276.2023.10114662","url":null,"abstract":"Performance is a quality aspect describing how the software is performing. Any performance degradation will further affect other quality aspects, such as usability. Software developers continuously conduct testing to ensure that code addition or changes do not damage existing functionalities or negatively affect the quality. Hence, developers set strategies to detect, locate and fix the regression if needed. In this paper, we provide an exploratory study on the challenges developers face in resolving performance regression. The study is based on the questions posted on a technical forum directed to performance regression. We collected 1828 questions discussing the regression of software execution time. All those questions are manually analyzed. The study resulted in a categorization of the challenges. We also discussed the difficulty level of performance regression issues within the developers community. This study provides insights to help developers during the software design and implementation to avoid regression causes.","PeriodicalId":318687,"journal":{"name":"2023 International Conference on Code Quality (ICCQ)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128678679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-22DOI: 10.1109/ICCQ57276.2023.10114664
Al Khan, R. R. Mekuria, Ruslan Isaev
One of the biggest expense in software development is the maintenance. Therefore, it's critical to comprehend what triggers maintenance and if it may be predicted. Numerous research outputs have demonstrated that specific methods of assessing the complexity of created programs may produce useful prediction models to as-certain the possibility of maintenance due to software failures. As a routine it is performed prior to the release, and setting up the models frequently calls for certain, object-oriented software measurements. It's not always the case that software developers have access to these measurements. In this paper, machine learning is applied on the available data to calculate the cumulative software failure levels. A technique to forecast a software's residual defectiveness using machine learning can be looked into as a solution to the challenge of predicting residual flaws. Software metrics and defect data were separated out of the static source code repository. Static code is used to create software metrics, and reported bugs in the repository are used to gather defect information. By using a correlation method, metrics that had no connection to the defect data were removed. This makes it possible to analyze all the data without pausing the programming process. Large, sophisticated software's primary issue is that it is impossible to control everything manually, and the cost of an error can be quite expensive. Developers may miss errors during testing as a consequence, which will raise maintenance costs. Finding a method to accurately forecast software defects is the overall objective.
{"title":"Applying Machine Learning Analysis for Software Quality Test","authors":"Al Khan, R. R. Mekuria, Ruslan Isaev","doi":"10.1109/ICCQ57276.2023.10114664","DOIUrl":"https://doi.org/10.1109/ICCQ57276.2023.10114664","url":null,"abstract":"One of the biggest expense in software development is the maintenance. Therefore, it's critical to comprehend what triggers maintenance and if it may be predicted. Numerous research outputs have demonstrated that specific methods of assessing the complexity of created programs may produce useful prediction models to as-certain the possibility of maintenance due to software failures. As a routine it is performed prior to the release, and setting up the models frequently calls for certain, object-oriented software measurements. It's not always the case that software developers have access to these measurements. In this paper, machine learning is applied on the available data to calculate the cumulative software failure levels. A technique to forecast a software's residual defectiveness using machine learning can be looked into as a solution to the challenge of predicting residual flaws. Software metrics and defect data were separated out of the static source code repository. Static code is used to create software metrics, and reported bugs in the repository are used to gather defect information. By using a correlation method, metrics that had no connection to the defect data were removed. This makes it possible to analyze all the data without pausing the programming process. Large, sophisticated software's primary issue is that it is impossible to control everything manually, and the cost of an error can be quite expensive. Developers may miss errors during testing as a consequence, which will raise maintenance costs. Finding a method to accurately forecast software defects is the overall objective.","PeriodicalId":318687,"journal":{"name":"2023 International Conference on Code Quality (ICCQ)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129329566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-22DOI: 10.1109/ICCQ57276.2023.10114663
Rowland Pitts
Mutation Testing offers a powerful approach to assessing unit test set quality; however, software developers are often reluctant to embrace the technique because of the tremendous number of mutants it generates, including redundant and equivalent mutants. Researchers have sought strategies to reduce the number of mutants without reducing effectiveness, and also ways to select more effective mutants, but no strategy has performed better than random mutant selection. Equivalent mutants, which cannot be killed, make achieving mutation adequacy difficult, so most research is conducted with the assumption that unkilled mutants are equivalent. Using 15 java.lang classes that are known to have truly mutation adequate test sets, this research demonstrates that even when the number of equivalent mutants is drastically reduced, they remain a tester's largest problem, and that apart from their presence achieving mutation adequacy is relatively easy. It also assesses a variety of mutant selection strategies and demonstrates that even with mutation adequate test sets, none perform as well as random mutant selection.
{"title":"Mutant Selection Strategies in Mutation Testing","authors":"Rowland Pitts","doi":"10.1109/ICCQ57276.2023.10114663","DOIUrl":"https://doi.org/10.1109/ICCQ57276.2023.10114663","url":null,"abstract":"Mutation Testing offers a powerful approach to assessing unit test set quality; however, software developers are often reluctant to embrace the technique because of the tremendous number of mutants it generates, including redundant and equivalent mutants. Researchers have sought strategies to reduce the number of mutants without reducing effectiveness, and also ways to select more effective mutants, but no strategy has performed better than random mutant selection. Equivalent mutants, which cannot be killed, make achieving mutation adequacy difficult, so most research is conducted with the assumption that unkilled mutants are equivalent. Using 15 java.lang classes that are known to have truly mutation adequate test sets, this research demonstrates that even when the number of equivalent mutants is drastically reduced, they remain a tester's largest problem, and that apart from their presence achieving mutation adequacy is relatively easy. It also assesses a variety of mutant selection strategies and demonstrates that even with mutation adequate test sets, none perform as well as random mutant selection.","PeriodicalId":318687,"journal":{"name":"2023 International Conference on Code Quality (ICCQ)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125190890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-22DOI: 10.1109/iccq57276.2023.10114655
{"title":"What IS Code Quality: Keynote","authors":"","doi":"10.1109/iccq57276.2023.10114655","DOIUrl":"https://doi.org/10.1109/iccq57276.2023.10114655","url":null,"abstract":"","PeriodicalId":318687,"journal":{"name":"2023 International Conference on Code Quality (ICCQ)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115549562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}