Haichuan Hu, Ye Shang, Guolin Xu, Congqing He, Quanjun Zhang
ChatGPT has long been proven to be effective in automatic program repair (APR). With the continuous iterations and upgrades of the ChatGPT version, its performance in terms of fixes has already reached state-of-the-art levels. However, there are few works comparing the effectiveness and variations of different versions of ChatGPT on APR. In this work, we evaluate the performance of the latest version of ChatGPT (O1-preview and O1-mini), ChatGPT-4o, and historical version of ChatGPT on APR. We study the improvements of the O1 model over traditional ChatGPT in terms of APR from multiple perspectives (repair success rate, repair cost, behavior patterns), and find that O1's repair capability exceeds that of traditional ChatGPT, successfully fixing all 40 bugs in the benchmark. Our work can serve as a reference for further in-depth exploration of the applications of ChatGPT in APR.
{"title":"Can GPT-O1 Kill All Bugs?","authors":"Haichuan Hu, Ye Shang, Guolin Xu, Congqing He, Quanjun Zhang","doi":"arxiv-2409.10033","DOIUrl":"https://doi.org/arxiv-2409.10033","url":null,"abstract":"ChatGPT has long been proven to be effective in automatic program repair\u0000(APR). With the continuous iterations and upgrades of the ChatGPT version, its\u0000performance in terms of fixes has already reached state-of-the-art levels.\u0000However, there are few works comparing the effectiveness and variations of\u0000different versions of ChatGPT on APR. In this work, we evaluate the performance\u0000of the latest version of ChatGPT (O1-preview and O1-mini), ChatGPT-4o, and\u0000historical version of ChatGPT on APR. We study the improvements of the O1 model\u0000over traditional ChatGPT in terms of APR from multiple perspectives (repair\u0000success rate, repair cost, behavior patterns), and find that O1's repair\u0000capability exceeds that of traditional ChatGPT, successfully fixing all 40 bugs\u0000in the benchmark. Our work can serve as a reference for further in-depth\u0000exploration of the applications of ChatGPT in APR.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Test flakiness is a major problem in the software industry. Flaky tests fail seemingly at random without changes to the code and thus impede continuous integration (CI). Some researchers argue that all tests can be considered flaky and that tests only differ in their frequency of flaky failures. Aims: With the goal of developing mitigation strategies to reduce the negative impact of test flakiness, we study characteristics of tests and the test environment that potentially impact test flakiness. Method: We construct two datasets based on SAP HANA's test results over a 12-week period: one based on production data, the other based on targeted test executions from a dedicated flakiness experiment. We conduct correlation analysis for test and test environment characteristics with respect to their influence on the frequency of flaky test failures. Results: In our study, the average test execution time had the strongest positive correlation with the test flakiness rate (r = 0.79), which confirms previous studies. Potential reasons for higher flakiness include the larger test scope of long-running tests or test executions on a slower test infrastructure. Interestingly, the load on the testing infrastructure was not correlated with test flakiness. The relationship between test flakiness and required resources for test execution is inconclusive. Conclusions: Based on our findings, we conclude that splitting long-running tests can be an important measure for practitioners to cope with test flakiness, as it enables parallelization of test executions and also reduces the cost of re-executions. This effectively decreases the negative effects of test flakiness in complex testing environments. However, when splitting long-running tests, practitioners need to consider the potential test setup overhead of test splits.
背景介绍测试缺陷是软件行业的一个主要问题。在不修改代码的情况下,虚假测试似乎是随机失败的,因此阻碍了持续集成(CI)。一些研究人员认为,所有测试都可以被视为缺陷测试,只是缺陷测试失败的频率不同而已。目的:为了制定缓解策略以降低测试易错性的负面影响,我们研究了可能影响测试易错性的测试和测试环境的特征。研究方法:我们根据 SAP HANA 在 12 周内的测试结果构建了两个数据集:一个数据集基于生产数据,另一个数据集基于专门的弱点实验中的目标测试执行。我们就测试和测试环境特征对片状测试失败频率的影响进行了相关性分析。研究结果在我们的研究中,平均测试执行时间与测试易错率的正相关性最强(r = 0.79),这证实了之前的研究。造成测试不稳定率较高的潜在原因包括长期运行测试的测试范围较大,或测试在速度较慢的测试基础设施上执行。有趣的是,测试基础设施的负载与测试易错性无关。测试易损性与测试执行所需资源之间的关系尚无定论。结论根据我们的研究结果,我们得出结论:对于从业人员来说,拆分长期运行的测试是应对测试易损性的一项重要措施,因为它可以实现测试执行的并行化,还能降低重新执行的成本。在复杂的测试环境中,这能有效降低测试松散性的负面影响。然而,在拆分长期运行的测试时,实践者需要考虑测试拆分可能带来的测试设置开销。
{"title":"Do Test and Environmental Complexity Increase Flakiness? An Empirical Study of SAP HANA","authors":"Alexander Berndt, Thomas Bach, Sebastian Baltes","doi":"arxiv-2409.10062","DOIUrl":"https://doi.org/arxiv-2409.10062","url":null,"abstract":"Background: Test flakiness is a major problem in the software industry. Flaky\u0000tests fail seemingly at random without changes to the code and thus impede\u0000continuous integration (CI). Some researchers argue that all tests can be\u0000considered flaky and that tests only differ in their frequency of flaky\u0000failures. Aims: With the goal of developing mitigation strategies to reduce the\u0000negative impact of test flakiness, we study characteristics of tests and the\u0000test environment that potentially impact test flakiness. Method: We construct two datasets based on SAP HANA's test results over a\u000012-week period: one based on production data, the other based on targeted test\u0000executions from a dedicated flakiness experiment. We conduct correlation\u0000analysis for test and test environment characteristics with respect to their\u0000influence on the frequency of flaky test failures. Results: In our study, the average test execution time had the strongest\u0000positive correlation with the test flakiness rate (r = 0.79), which confirms\u0000previous studies. Potential reasons for higher flakiness include the larger\u0000test scope of long-running tests or test executions on a slower test\u0000infrastructure. Interestingly, the load on the testing infrastructure was not\u0000correlated with test flakiness. The relationship between test flakiness and\u0000required resources for test execution is inconclusive. Conclusions: Based on our findings, we conclude that splitting long-running\u0000tests can be an important measure for practitioners to cope with test\u0000flakiness, as it enables parallelization of test executions and also reduces\u0000the cost of re-executions. This effectively decreases the negative effects of\u0000test flakiness in complex testing environments. However, when splitting\u0000long-running tests, practitioners need to consider the potential test setup\u0000overhead of test splits.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Autonomous driving systems (ADS) are safety-critical and require comprehensive testing before their deployment on public roads. While existing testing approaches primarily aim at the criticality of scenarios, they often overlook the diversity of the generated scenarios that is also important to reflect system defects in different aspects. To bridge the gap, we propose LeGEND, that features a top-down fashion of scenario generation: it starts with abstract functional scenarios, and then steps downwards to logical and concrete scenarios, such that scenario diversity can be controlled at the functional level. However, unlike logical scenarios that can be formally described, functional scenarios are often documented in natural languages (e.g., accident reports) and thus cannot be precisely parsed and processed by computers. To tackle that issue, LeGEND leverages the recent advances of large language models (LLMs) to transform textual functional scenarios to formal logical scenarios. To mitigate the distraction of useless information in functional scenario description, we devise a two-phase transformation that features the use of an intermediate language; consequently, we adopt two LLMs in LeGEND, one for extracting information from functional scenarios, the other for converting the extracted information to formal logical scenarios. We experimentally evaluate LeGEND on Apollo, an industry-grade ADS from Baidu. Evaluation results show that LeGEND can effectively identify critical scenarios, and compared to baseline approaches, LeGEND exhibits evident superiority in diversity of generated scenarios. Moreover, we also demonstrate the advantages of our two-phase transformation framework, and the accuracy of the adopted LLMs.
{"title":"LeGEND: A Top-Down Approach to Scenario Generation of Autonomous Driving Systems Assisted by Large Language Models","authors":"Shuncheng Tang, Zhenya Zhang, Jixiang Zhou, Lei Lei, Yuan Zhou, Yinxing Xue","doi":"arxiv-2409.10066","DOIUrl":"https://doi.org/arxiv-2409.10066","url":null,"abstract":"Autonomous driving systems (ADS) are safety-critical and require\u0000comprehensive testing before their deployment on public roads. While existing\u0000testing approaches primarily aim at the criticality of scenarios, they often\u0000overlook the diversity of the generated scenarios that is also important to\u0000reflect system defects in different aspects. To bridge the gap, we propose\u0000LeGEND, that features a top-down fashion of scenario generation: it starts with\u0000abstract functional scenarios, and then steps downwards to logical and concrete\u0000scenarios, such that scenario diversity can be controlled at the functional\u0000level. However, unlike logical scenarios that can be formally described,\u0000functional scenarios are often documented in natural languages (e.g., accident\u0000reports) and thus cannot be precisely parsed and processed by computers. To\u0000tackle that issue, LeGEND leverages the recent advances of large language\u0000models (LLMs) to transform textual functional scenarios to formal logical\u0000scenarios. To mitigate the distraction of useless information in functional\u0000scenario description, we devise a two-phase transformation that features the\u0000use of an intermediate language; consequently, we adopt two LLMs in LeGEND, one\u0000for extracting information from functional scenarios, the other for converting\u0000the extracted information to formal logical scenarios. We experimentally\u0000evaluate LeGEND on Apollo, an industry-grade ADS from Baidu. Evaluation results\u0000show that LeGEND can effectively identify critical scenarios, and compared to\u0000baseline approaches, LeGEND exhibits evident superiority in diversity of\u0000generated scenarios. Moreover, we also demonstrate the advantages of our\u0000two-phase transformation framework, and the accuracy of the adopted LLMs.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"93 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shiva Radmanesh, Aaron Imani, Iftekhar Ahmed, Mohammad Moshirpour
Code comments are essential for clarifying code functionality, improving readability, and facilitating collaboration among developers. Despite their importance, comments often become outdated, leading to inconsistencies with the corresponding code. This can mislead developers and potentially introduce bugs. Our research investigates the impact of code-comment inconsistency on bug introduction using large language models, specifically GPT-3.5. We first compare the performance of the GPT-3.5 model with other state-of-the-art methods in detecting these inconsistencies, demonstrating the superiority of GPT-3.5 in this domain. Additionally, we analyze the temporal evolution of code-comment inconsistencies and their effect on bug proneness over various timeframes using GPT-3.5 and Odds ratio analysis. Our findings reveal that inconsistent changes are around 1.5 times more likely to lead to a bug-introducing commit than consistent changes, highlighting the necessity of maintaining consistent and up-to-date comments in software development. This study provides new insights into the relationship between code-comment inconsistency and software quality, offering a comprehensive analysis of its impact over time, demonstrating that the impact of code-comment inconsistency on bug introduction is highest immediately after the inconsistency is introduced and diminishes over time.
{"title":"Investigating the Impact of Code Comment Inconsistency on Bug Introducing","authors":"Shiva Radmanesh, Aaron Imani, Iftekhar Ahmed, Mohammad Moshirpour","doi":"arxiv-2409.10781","DOIUrl":"https://doi.org/arxiv-2409.10781","url":null,"abstract":"Code comments are essential for clarifying code functionality, improving\u0000readability, and facilitating collaboration among developers. Despite their\u0000importance, comments often become outdated, leading to inconsistencies with the\u0000corresponding code. This can mislead developers and potentially introduce bugs.\u0000Our research investigates the impact of code-comment inconsistency on bug\u0000introduction using large language models, specifically GPT-3.5. We first\u0000compare the performance of the GPT-3.5 model with other state-of-the-art\u0000methods in detecting these inconsistencies, demonstrating the superiority of\u0000GPT-3.5 in this domain. Additionally, we analyze the temporal evolution of\u0000code-comment inconsistencies and their effect on bug proneness over various\u0000timeframes using GPT-3.5 and Odds ratio analysis. Our findings reveal that\u0000inconsistent changes are around 1.5 times more likely to lead to a\u0000bug-introducing commit than consistent changes, highlighting the necessity of\u0000maintaining consistent and up-to-date comments in software development. This\u0000study provides new insights into the relationship between code-comment\u0000inconsistency and software quality, offering a comprehensive analysis of its\u0000impact over time, demonstrating that the impact of code-comment inconsistency\u0000on bug introduction is highest immediately after the inconsistency is\u0000introduced and diminishes over time.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mark Huasong Meng, Chuan Yan, Yun Hao, Qing Zhang, Zeyu Wang, Kailong Wang, Sin Gee Teo, Guangdong Bai, Jin Song Dong
Third-party Software Development Kits (SDKs) are widely adopted in Android app development, to effortlessly accelerate development pipelines and enhance app functionality. However, this convenience raises substantial concerns about unauthorized access to users' privacy-sensitive information, which could be further abused for illegitimate purposes like user tracking or monetization. Our study offers a targeted analysis of user privacy protection among Android third-party SDKs, filling a critical gap in the Android software supply chain. It focuses on two aspects of their privacy practices, including data exfiltration and behavior-policy compliance (or privacy compliance), utilizing techniques of taint analysis and large language models. It covers 158 widely-used SDKs from two key SDK release platforms, the official one and a large alternative one. From them, we identified 338 instances of privacy data exfiltration. On the privacy compliance, our study reveals that more than 30% of the examined SDKs fail to provide a privacy policy to disclose their data handling practices. Among those that provide privacy policies, 37% of them over-collect user data, and 88% falsely claim access to sensitive data. We revisit the latest versions of the SDKs after 12 months. Our analysis demonstrates a persistent lack of improvement in these concerning trends. Based on our findings, we propose three actionable recommendations to mitigate the privacy leakage risks and enhance privacy protection for Android users. Our research not only serves as an urgent call for industry attention but also provides crucial insights for future regulatory interventions.
{"title":"A Large-Scale Privacy Assessment of Android Third-Party SDKs","authors":"Mark Huasong Meng, Chuan Yan, Yun Hao, Qing Zhang, Zeyu Wang, Kailong Wang, Sin Gee Teo, Guangdong Bai, Jin Song Dong","doi":"arxiv-2409.10411","DOIUrl":"https://doi.org/arxiv-2409.10411","url":null,"abstract":"Third-party Software Development Kits (SDKs) are widely adopted in Android\u0000app development, to effortlessly accelerate development pipelines and enhance\u0000app functionality. However, this convenience raises substantial concerns about\u0000unauthorized access to users' privacy-sensitive information, which could be\u0000further abused for illegitimate purposes like user tracking or monetization.\u0000Our study offers a targeted analysis of user privacy protection among Android\u0000third-party SDKs, filling a critical gap in the Android software supply chain.\u0000It focuses on two aspects of their privacy practices, including data\u0000exfiltration and behavior-policy compliance (or privacy compliance), utilizing\u0000techniques of taint analysis and large language models. It covers 158\u0000widely-used SDKs from two key SDK release platforms, the official one and a\u0000large alternative one. From them, we identified 338 instances of privacy data\u0000exfiltration. On the privacy compliance, our study reveals that more than 30%\u0000of the examined SDKs fail to provide a privacy policy to disclose their data\u0000handling practices. Among those that provide privacy policies, 37% of them\u0000over-collect user data, and 88% falsely claim access to sensitive data. We\u0000revisit the latest versions of the SDKs after 12 months. Our analysis\u0000demonstrates a persistent lack of improvement in these concerning trends. Based\u0000on our findings, we propose three actionable recommendations to mitigate the\u0000privacy leakage risks and enhance privacy protection for Android users. Our\u0000research not only serves as an urgent call for industry attention but also\u0000provides crucial insights for future regulatory interventions.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, the application of large language models (LLMs) to code-related tasks has gained significant attention. However, existing evaluation benchmarks often focus on limited scenarios, such as code generation or completion, which do not reflect the diverse challenges developers face in real-world contexts. To address this, we introduce ComplexCodeEval, a benchmark designed to assess LCMs in various development tasks, including code generation, completion, API recommendation, and test case generation. It includes 3,897 Java samples and 7,184 Python samples from high-star GitHub repositories, each annotated with function signatures, docstrings, and API references to simulate real development environments. Our experiments across ten LCMs reveal that context improves performance and that data leakage can lead to overestimation, highlighting the need for more accurate evaluations.
{"title":"ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code","authors":"Jia Feng, Jiachen Liu, Cuiyun Gao, Chun Yong Chong, Chaozheng Wang, Shan Gao, Xin Xia","doi":"arxiv-2409.10280","DOIUrl":"https://doi.org/arxiv-2409.10280","url":null,"abstract":"In recent years, the application of large language models (LLMs) to\u0000code-related tasks has gained significant attention. However, existing\u0000evaluation benchmarks often focus on limited scenarios, such as code generation\u0000or completion, which do not reflect the diverse challenges developers face in\u0000real-world contexts. To address this, we introduce ComplexCodeEval, a benchmark\u0000designed to assess LCMs in various development tasks, including code\u0000generation, completion, API recommendation, and test case generation. It\u0000includes 3,897 Java samples and 7,184 Python samples from high-star GitHub\u0000repositories, each annotated with function signatures, docstrings, and API\u0000references to simulate real development environments. Our experiments across\u0000ten LCMs reveal that context improves performance and that data leakage can\u0000lead to overestimation, highlighting the need for more accurate evaluations.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"41 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Current automotive E/E architectures are subject to significant transformations: Computing-power-intensive advanced driver-assistance systems, bandwidth-hungry infotainment systems, the connection of the vehicle with the internet and the consequential need for cyber-security drives the centralization of E/E architectures. A centralized architecture is often seen as a key enabler to master those challenges. Available research focuses mostly on the different types of E/E architectures and contrasts their advantages and disadvantages. There is a research gap on guidelines for system designers and function developers to analyze the potential of their systems for centralization. The present paper aims to quantify centralization potential reviewing relevant literature and conducting qualitative interviews with industry practitioners. In literature, we identified seven key automotive system properties reaching limitations in current automotive architectures: busload, functional safety, computing power, feature dependencies, development and maintenance costs, error rate, modularity and flexibility. These properties serve as quantitative evaluation criteria to estimate whether centralization would enhance overall system performance. In the interviews, we have validated centralization and its fundament - the conceptual systems engineering - as capabilities to mitigate these limitations. By focusing on practical insights and lessons learned, this research provides system designers with actionable guidance to optimize their systems, addressing the outlined challenges while avoiding monolithic architecture. This paper bridges the gap between theoretical research and practical application, offering valuable takeaways for practitioners.
{"title":"Centralization potential of automotive E/E architectures","authors":"Lucas Mauser, Stefan Wagner","doi":"arxiv-2409.10690","DOIUrl":"https://doi.org/arxiv-2409.10690","url":null,"abstract":"Current automotive E/E architectures are subject to significant\u0000transformations: Computing-power-intensive advanced driver-assistance systems,\u0000bandwidth-hungry infotainment systems, the connection of the vehicle with the\u0000internet and the consequential need for cyber-security drives the\u0000centralization of E/E architectures. A centralized architecture is often seen\u0000as a key enabler to master those challenges. Available research focuses mostly\u0000on the different types of E/E architectures and contrasts their advantages and\u0000disadvantages. There is a research gap on guidelines for system designers and\u0000function developers to analyze the potential of their systems for\u0000centralization. The present paper aims to quantify centralization potential\u0000reviewing relevant literature and conducting qualitative interviews with\u0000industry practitioners. In literature, we identified seven key automotive\u0000system properties reaching limitations in current automotive architectures:\u0000busload, functional safety, computing power, feature dependencies, development\u0000and maintenance costs, error rate, modularity and flexibility. These properties\u0000serve as quantitative evaluation criteria to estimate whether centralization\u0000would enhance overall system performance. In the interviews, we have validated\u0000centralization and its fundament - the conceptual systems engineering - as\u0000capabilities to mitigate these limitations. By focusing on practical insights\u0000and lessons learned, this research provides system designers with actionable\u0000guidance to optimize their systems, addressing the outlined challenges while\u0000avoiding monolithic architecture. This paper bridges the gap between\u0000theoretical research and practical application, offering valuable takeaways for\u0000practitioners.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adekunle Ajibode, Abdul Ali Bangash, Filipe Roseiro Cogo, Bram Adams, Ahmed E. Hassan
The proliferation of open Pre-trained Language Models (PTLMs) on model registry platforms like Hugging Face (HF) presents both opportunities and challenges for companies building products around them. Similar to traditional software dependencies, PTLMs continue to evolve after a release. However, the current state of release practices of PTLMs on model registry platforms are plagued by a variety of inconsistencies, such as ambiguous naming conventions and inaccessible model training documentation. Given the knowledge gap on current PTLM release practices, our empirical study uses a mixed-methods approach to analyze the releases of 52,227 PTLMs on the most well-known model registry, HF. Our results reveal 148 different naming practices for PTLM releases, with 40.87% of changes to model weight files not represented in the adopted name-based versioning practice or their documentation. In addition, we identified that the 52,227 PTLMs are derived from only 299 different base models (the modified original models used to create 52,227 PTLMs), with Fine-tuning and Quantization being the most prevalent modification methods applied to these base models. Significant gaps in release transparency, in terms of training dataset specifications and model card availability, still exist, highlighting the need for standardized documentation. While we identified a model naming practice explicitly differentiating between major and minor PTLM releases, we did not find any significant difference in the types of changes that went into either type of releases, suggesting that major/minor version numbers for PTLMs often are chosen arbitrarily. Our findings provide valuable insights to improve PTLM release practices, nudging the field towards more formal semantic versioning practices.
{"title":"Towards Semantic Versioning of Open Pre-trained Language Model Releases on Hugging Face","authors":"Adekunle Ajibode, Abdul Ali Bangash, Filipe Roseiro Cogo, Bram Adams, Ahmed E. Hassan","doi":"arxiv-2409.10472","DOIUrl":"https://doi.org/arxiv-2409.10472","url":null,"abstract":"The proliferation of open Pre-trained Language Models (PTLMs) on model\u0000registry platforms like Hugging Face (HF) presents both opportunities and\u0000challenges for companies building products around them. Similar to traditional\u0000software dependencies, PTLMs continue to evolve after a release. However, the\u0000current state of release practices of PTLMs on model registry platforms are\u0000plagued by a variety of inconsistencies, such as ambiguous naming conventions\u0000and inaccessible model training documentation. Given the knowledge gap on\u0000current PTLM release practices, our empirical study uses a mixed-methods\u0000approach to analyze the releases of 52,227 PTLMs on the most well-known model\u0000registry, HF. Our results reveal 148 different naming practices for PTLM\u0000releases, with 40.87% of changes to model weight files not represented in the\u0000adopted name-based versioning practice or their documentation. In addition, we\u0000identified that the 52,227 PTLMs are derived from only 299 different base\u0000models (the modified original models used to create 52,227 PTLMs), with\u0000Fine-tuning and Quantization being the most prevalent modification methods\u0000applied to these base models. Significant gaps in release transparency, in\u0000terms of training dataset specifications and model card availability, still\u0000exist, highlighting the need for standardized documentation. While we\u0000identified a model naming practice explicitly differentiating between major and\u0000minor PTLM releases, we did not find any significant difference in the types of\u0000changes that went into either type of releases, suggesting that major/minor\u0000version numbers for PTLMs often are chosen arbitrarily. Our findings provide\u0000valuable insights to improve PTLM release practices, nudging the field towards\u0000more formal semantic versioning practices.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A crucial activity in software maintenance and evolution is the comprehension of the changes performed by developers, when they submit a pull request and/or perform a commit on the repository. Typically, code changes are represented in the form of code diffs, textual representations highlighting the differences between two file versions, depicting the added, removed, and changed lines. This simplistic representation must be interpreted by developers, and mentally lifted to a higher abstraction level, that more closely resembles natural language descriptions, and eases the creation of a mental model of the changes. However, the textual diff-based representation is cumbersome, and the lifting requires considerable domain knowledge and programming skills. We present an approach, based on the concept of micro-change, to overcome these difficulties, translating code diffs into a series of pre-defined change operations, which can be described in natural language. We present a catalog of micro-changes, together with an automated micro-change detector. To evaluate our approach, we performed an empirical study on a large set of open-source repositories, focusing on a subset of our micro-change catalog, namely those related to changes affecting the conditional logic. We found that our detector is capable of explaining more than 67% of the changes taking place in the systems under study.
{"title":"Understanding Code Change with Micro-Changes","authors":"Lei Chen, Michele Lanza, Shinpei Hayashi","doi":"arxiv-2409.09923","DOIUrl":"https://doi.org/arxiv-2409.09923","url":null,"abstract":"A crucial activity in software maintenance and evolution is the comprehension\u0000of the changes performed by developers, when they submit a pull request and/or\u0000perform a commit on the repository. Typically, code changes are represented in\u0000the form of code diffs, textual representations highlighting the differences\u0000between two file versions, depicting the added, removed, and changed lines.\u0000This simplistic representation must be interpreted by developers, and mentally\u0000lifted to a higher abstraction level, that more closely resembles natural\u0000language descriptions, and eases the creation of a mental model of the changes.\u0000However, the textual diff-based representation is cumbersome, and the lifting\u0000requires considerable domain knowledge and programming skills. We present an\u0000approach, based on the concept of micro-change, to overcome these difficulties,\u0000translating code diffs into a series of pre-defined change operations, which\u0000can be described in natural language. We present a catalog of micro-changes,\u0000together with an automated micro-change detector. To evaluate our approach, we\u0000performed an empirical study on a large set of open-source repositories,\u0000focusing on a subset of our micro-change catalog, namely those related to\u0000changes affecting the conditional logic. We found that our detector is capable\u0000of explaining more than 67% of the changes taking place in the systems under\u0000study.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
WebAssembly (Wasm) is a low-level bytecode format that can run in modern browsers. With the development of standalone runtimes and the improvement of the WebAssembly System Interface (WASI), Wasm has further provided a more complete sandboxed runtime experience for server-side applications, effectively expanding its application scenarios. However, the implementation of WASI varies across different runtimes, and suboptimal interface implementations can lead to performance degradation during interactions between the runtime and the operating system. Existing research mainly focuses on overall performance evaluation of runtimes, while studies on WASI implementations are relatively scarce. To tackle this problem, we propose an eBPF-based WASI performance analysis framework. It collects key performance metrics of the runtime under different I/O load conditions, such as total execution time, startup time, WASI execution time, and syscall time. We can comprehensively analyze the performance of the runtime's I/O interactions with the operating system. Additionally, we provide a detailed analysis of the causes behind two specific WASI performance anomalies. These analytical results will guide the optimization of standalone runtimes and WASI implementations, enhancing their efficiency.
{"title":"eWAPA: An eBPF-based WASI Performance Analysis Framework for WebAssembly Runtimes","authors":"Chenxi Mao, Yuxin Su, Shiwen Shan, Dan Li","doi":"arxiv-2409.10252","DOIUrl":"https://doi.org/arxiv-2409.10252","url":null,"abstract":"WebAssembly (Wasm) is a low-level bytecode format that can run in modern\u0000browsers. With the development of standalone runtimes and the improvement of\u0000the WebAssembly System Interface (WASI), Wasm has further provided a more\u0000complete sandboxed runtime experience for server-side applications, effectively\u0000expanding its application scenarios. However, the implementation of WASI varies\u0000across different runtimes, and suboptimal interface implementations can lead to\u0000performance degradation during interactions between the runtime and the\u0000operating system. Existing research mainly focuses on overall performance\u0000evaluation of runtimes, while studies on WASI implementations are relatively\u0000scarce. To tackle this problem, we propose an eBPF-based WASI performance\u0000analysis framework. It collects key performance metrics of the runtime under\u0000different I/O load conditions, such as total execution time, startup time, WASI\u0000execution time, and syscall time. We can comprehensively analyze the\u0000performance of the runtime's I/O interactions with the operating system.\u0000Additionally, we provide a detailed analysis of the causes behind two specific\u0000WASI performance anomalies. These analytical results will guide the\u0000optimization of standalone runtimes and WASI implementations, enhancing their\u0000efficiency.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}