arXiv - CS - Software Engineering最新文献_第3页

Can GPT-O1 Kill All Bugs? GPT-O1 能杀死所有虫子吗？

arXiv - CS - Software Engineering

Pub Date : 2024-09-16 DOI: arxiv-2409.10033

Haichuan Hu, Ye Shang, Guolin Xu, Congqing He, Quanjun Zhang

ChatGPT has long been proven to be effective in automatic program repair(APR). With the continuous iterations and upgrades of the ChatGPT version, itsperformance in terms of fixes has already reached state-of-the-art levels.However, there are few works comparing the effectiveness and variations ofdifferent versions of ChatGPT on APR. In this work, we evaluate the performanceof the latest version of ChatGPT (O1-preview and O1-mini), ChatGPT-4o, andhistorical version of ChatGPT on APR. We study the improvements of the O1 modelover traditional ChatGPT in terms of APR from multiple perspectives (repairsuccess rate, repair cost, behavior patterns), and find that O1's repaircapability exceeds that of traditional ChatGPT, successfully fixing all 40 bugsin the benchmark. Our work can serve as a reference for further in-depthexploration of the applications of ChatGPT in APR.

ChatGPT 在自动程序修复（APR）方面的功效早已得到证实。随着 ChatGPT 版本的不断迭代和升级，它在修复方面的性能已经达到了最先进的水平。然而，比较不同版本的 ChatGPT 在 APR 上的效果和变化的工作却很少。在这项工作中，我们评估了最新版本的 ChatGPT（O1-preview 和 O1-mini）、ChatGPT-4o 和历史版本的 ChatGPT 在 APR 上的性能。我们从多个角度（修复成功率、修复成本、行为模式）研究了 O1 模型对传统 ChatGPT 在 APR 方面的改进，发现 O1 的修复能力超过了传统 ChatGPT，成功修复了基准测试中的所有 40 个错误。我们的工作可为进一步深入探索 ChatGPT 在 APR 中的应用提供参考。

引用次数: 0

Do Test and Environmental Complexity Increase Flakiness? An Empirical Study of SAP HANA 测试和环境复杂性会增加缺陷吗？SAP HANA 的实证研究

arXiv - CS - Software Engineering

Pub Date : 2024-09-16 DOI: arxiv-2409.10062

Alexander Berndt, Thomas Bach, Sebastian Baltes

Background: Test flakiness is a major problem in the software industry. Flakytests fail seemingly at random without changes to the code and thus impedecontinuous integration (CI). Some researchers argue that all tests can beconsidered flaky and that tests only differ in their frequency of flakyfailures. Aims: With the goal of developing mitigation strategies to reduce thenegative impact of test flakiness, we study characteristics of tests and thetest environment that potentially impact test flakiness. Method: We construct two datasets based on SAP HANA's test results over a12-week period: one based on production data, the other based on targeted testexecutions from a dedicated flakiness experiment. We conduct correlationanalysis for test and test environment characteristics with respect to theirinfluence on the frequency of flaky test failures. Results: In our study, the average test execution time had the strongestpositive correlation with the test flakiness rate (r = 0.79), which confirmsprevious studies. Potential reasons for higher flakiness include the largertest scope of long-running tests or test executions on a slower testinfrastructure. Interestingly, the load on the testing infrastructure was notcorrelated with test flakiness. The relationship between test flakiness andrequired resources for test execution is inconclusive. Conclusions: Based on our findings, we conclude that splitting long-runningtests can be an important measure for practitioners to cope with testflakiness, as it enables parallelization of test executions and also reducesthe cost of re-executions. This effectively decreases the negative effects oftest flakiness in complex testing environments. However, when splittinglong-running tests, practitioners need to consider the potential test setupoverhead of test splits.

背景介绍测试缺陷是软件行业的一个主要问题。在不修改代码的情况下，虚假测试似乎是随机失败的，因此阻碍了持续集成（CI）。一些研究人员认为，所有测试都可以被视为缺陷测试，只是缺陷测试失败的频率不同而已。目的：为了制定缓解策略以降低测试易错性的负面影响，我们研究了可能影响测试易错性的测试和测试环境的特征。研究方法：我们根据 SAP HANA 在 12 周内的测试结果构建了两个数据集：一个数据集基于生产数据，另一个数据集基于专门的弱点实验中的目标测试执行。我们就测试和测试环境特征对片状测试失败频率的影响进行了相关性分析。研究结果在我们的研究中，平均测试执行时间与测试易错率的正相关性最强（r = 0.79），这证实了之前的研究。造成测试不稳定率较高的潜在原因包括长期运行测试的测试范围较大，或测试在速度较慢的测试基础设施上执行。有趣的是，测试基础设施的负载与测试易错性无关。测试易损性与测试执行所需资源之间的关系尚无定论。结论根据我们的研究结果，我们得出结论：对于从业人员来说，拆分长期运行的测试是应对测试易损性的一项重要措施，因为它可以实现测试执行的并行化，还能降低重新执行的成本。在复杂的测试环境中，这能有效降低测试松散性的负面影响。然而，在拆分长期运行的测试时，实践者需要考虑测试拆分可能带来的测试设置开销。

{"title":"Do Test and Environmental Complexity Increase Flakiness? An Empirical Study of SAP HANA","authors":"Alexander Berndt, Thomas Bach, Sebastian Baltes","doi":"arxiv-2409.10062","DOIUrl":"https://doi.org/arxiv-2409.10062","url":null,"abstract":"Background: Test flakiness is a major problem in the software industry. Flaky\u0000tests fail seemingly at random without changes to the code and thus impede\u0000continuous integration (CI). Some researchers argue that all tests can be\u0000considered flaky and that tests only differ in their frequency of flaky\u0000failures. Aims: With the goal of developing mitigation strategies to reduce the\u0000negative impact of test flakiness, we study characteristics of tests and the\u0000test environment that potentially impact test flakiness. Method: We construct two datasets based on SAP HANA's test results over a\u000012-week period: one based on production data, the other based on targeted test\u0000executions from a dedicated flakiness experiment. We conduct correlation\u0000analysis for test and test environment characteristics with respect to their\u0000influence on the frequency of flaky test failures. Results: In our study, the average test execution time had the strongest\u0000positive correlation with the test flakiness rate (r = 0.79), which confirms\u0000previous studies. Potential reasons for higher flakiness include the larger\u0000test scope of long-running tests or test executions on a slower test\u0000infrastructure. Interestingly, the load on the testing infrastructure was not\u0000correlated with test flakiness. The relationship between test flakiness and\u0000required resources for test execution is inconclusive. Conclusions: Based on our findings, we conclude that splitting long-running\u0000tests can be an important measure for practitioners to cope with test\u0000flakiness, as it enables parallelization of test executions and also reduces\u0000the cost of re-executions. This effectively decreases the negative effects of\u0000test flakiness in complex testing environments. However, when splitting\u0000long-running tests, practitioners need to consider the potential test setup\u0000overhead of test splits.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LeGEND: A Top-Down Approach to Scenario Generation of Autonomous Driving Systems Assisted by Large Language Models LeGEND：在大型语言模型辅助下自上而下生成自动驾驶系统场景的方法

arXiv - CS - Software Engineering

Pub Date : 2024-09-16 DOI: arxiv-2409.10066

Shuncheng Tang, Zhenya Zhang, Jixiang Zhou, Lei Lei, Yuan Zhou, Yinxing Xue

Autonomous driving systems (ADS) are safety-critical and requirecomprehensive testing before their deployment on public roads. While existingtesting approaches primarily aim at the criticality of scenarios, they oftenoverlook the diversity of the generated scenarios that is also important toreflect system defects in different aspects. To bridge the gap, we proposeLeGEND, that features a top-down fashion of scenario generation: it starts withabstract functional scenarios, and then steps downwards to logical and concretescenarios, such that scenario diversity can be controlled at the functionallevel. However, unlike logical scenarios that can be formally described,functional scenarios are often documented in natural languages (e.g., accidentreports) and thus cannot be precisely parsed and processed by computers. Totackle that issue, LeGEND leverages the recent advances of large languagemodels (LLMs) to transform textual functional scenarios to formal logicalscenarios. To mitigate the distraction of useless information in functionalscenario description, we devise a two-phase transformation that features theuse of an intermediate language; consequently, we adopt two LLMs in LeGEND, onefor extracting information from functional scenarios, the other for convertingthe extracted information to formal logical scenarios. We experimentallyevaluate LeGEND on Apollo, an industry-grade ADS from Baidu. Evaluation resultsshow that LeGEND can effectively identify critical scenarios, and compared tobaseline approaches, LeGEND exhibits evident superiority in diversity ofgenerated scenarios. Moreover, we also demonstrate the advantages of ourtwo-phase transformation framework, and the accuracy of the adopted LLMs.

自动驾驶系统（ADS）对安全至关重要，在公共道路上部署前需要进行全面测试。虽然现有的测试方法主要针对场景的关键性，但它们往往忽略了生成场景的多样性，而这种多样性对于反映不同方面的系统缺陷也很重要。为了弥补这一差距，我们提出了LeGEND，它采用自上而下的情景生成方式：从抽象的功能情景开始，然后逐步向下生成逻辑情景和具体情景，这样就可以在功能层面控制情景多样性。然而，与可以正式描述的逻辑情景不同，功能情景通常是用自然语言（如事故报告）记录的，因此无法由计算机进行精确解析和处理。为了解决这个问题，LeGEND 利用大型语言模型（LLM）的最新进展，将文本功能场景转换为正式的逻辑场景。为了减少功能场景描述中无用信息的干扰，我们设计了一种以使用中间语言为特点的两阶段转换；因此，我们在 LeGEND 中采用了两种 LLM，一种用于从功能场景中提取信息，另一种用于将提取的信息转换为正式逻辑场景。我们在百度的行业级 ADS Apollo 上对 LeGEND 进行了实验评估。评估结果表明，LeGEND 可以有效识别关键场景，与基准方法相比，LeGEND 在生成场景的多样性方面表现出明显的优势。此外，我们还展示了两阶段转换框架的优势以及所采用的 LLM 的准确性。

{"title":"LeGEND: A Top-Down Approach to Scenario Generation of Autonomous Driving Systems Assisted by Large Language Models","authors":"Shuncheng Tang, Zhenya Zhang, Jixiang Zhou, Lei Lei, Yuan Zhou, Yinxing Xue","doi":"arxiv-2409.10066","DOIUrl":"https://doi.org/arxiv-2409.10066","url":null,"abstract":"Autonomous driving systems (ADS) are safety-critical and require\u0000comprehensive testing before their deployment on public roads. While existing\u0000testing approaches primarily aim at the criticality of scenarios, they often\u0000overlook the diversity of the generated scenarios that is also important to\u0000reflect system defects in different aspects. To bridge the gap, we propose\u0000LeGEND, that features a top-down fashion of scenario generation: it starts with\u0000abstract functional scenarios, and then steps downwards to logical and concrete\u0000scenarios, such that scenario diversity can be controlled at the functional\u0000level. However, unlike logical scenarios that can be formally described,\u0000functional scenarios are often documented in natural languages (e.g., accident\u0000reports) and thus cannot be precisely parsed and processed by computers. To\u0000tackle that issue, LeGEND leverages the recent advances of large language\u0000models (LLMs) to transform textual functional scenarios to formal logical\u0000scenarios. To mitigate the distraction of useless information in functional\u0000scenario description, we devise a two-phase transformation that features the\u0000use of an intermediate language; consequently, we adopt two LLMs in LeGEND, one\u0000for extracting information from functional scenarios, the other for converting\u0000the extracted information to formal logical scenarios. We experimentally\u0000evaluate LeGEND on Apollo, an industry-grade ADS from Baidu. Evaluation results\u0000show that LeGEND can effectively identify critical scenarios, and compared to\u0000baseline approaches, LeGEND exhibits evident superiority in diversity of\u0000generated scenarios. Moreover, we also demonstrate the advantages of our\u0000two-phase transformation framework, and the accuracy of the adopted LLMs.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"93 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Investigating the Impact of Code Comment Inconsistency on Bug Introducing 调查代码注释不一致对错误引入的影响

arXiv - CS - Software Engineering

Pub Date : 2024-09-16 DOI: arxiv-2409.10781

Shiva Radmanesh, Aaron Imani, Iftekhar Ahmed, Mohammad Moshirpour

Code comments are essential for clarifying code functionality, improvingreadability, and facilitating collaboration among developers. Despite theirimportance, comments often become outdated, leading to inconsistencies with thecorresponding code. This can mislead developers and potentially introduce bugs.Our research investigates the impact of code-comment inconsistency on bugintroduction using large language models, specifically GPT-3.5. We firstcompare the performance of the GPT-3.5 model with other state-of-the-artmethods in detecting these inconsistencies, demonstrating the superiority ofGPT-3.5 in this domain. Additionally, we analyze the temporal evolution ofcode-comment inconsistencies and their effect on bug proneness over varioustimeframes using GPT-3.5 and Odds ratio analysis. Our findings reveal thatinconsistent changes are around 1.5 times more likely to lead to abug-introducing commit than consistent changes, highlighting the necessity ofmaintaining consistent and up-to-date comments in software development. Thisstudy provides new insights into the relationship between code-commentinconsistency and software quality, offering a comprehensive analysis of itsimpact over time, demonstrating that the impact of code-comment inconsistencyon bug introduction is highest immediately after the inconsistency isintroduced and diminishes over time.

代码注释对于阐明代码功能、提高可读性和促进开发人员之间的协作至关重要。尽管注释非常重要，但它经常会过时，导致与相应代码的不一致。我们的研究使用大型语言模型（特别是 GPT-3.5）调查了代码注释不一致对错误引入的影响。我们首先比较了 GPT-3.5 模型和其他先进方法在检测这些不一致性方面的性能，证明了 GPT-3.5 在这一领域的优势。此外，我们还使用 GPT-3.5 和赔率分析法分析了代码注释不一致性的时间演变及其在不同时间框架内对错误易发性的影响。我们的研究结果表明，与一致的变更相比，不一致的变更导致错误提交的可能性要高出约 1.5 倍，这突出表明了在软件开发中保持一致和最新注释的必要性。这项研究为代码注释不一致与软件质量之间的关系提供了新的见解，对其随时间变化的影响进行了全面分析，证明代码注释不一致对引入错误的影响在不一致引入后立即达到最高，并随着时间的推移逐渐减弱。

{"title":"Investigating the Impact of Code Comment Inconsistency on Bug Introducing","authors":"Shiva Radmanesh, Aaron Imani, Iftekhar Ahmed, Mohammad Moshirpour","doi":"arxiv-2409.10781","DOIUrl":"https://doi.org/arxiv-2409.10781","url":null,"abstract":"Code comments are essential for clarifying code functionality, improving\u0000readability, and facilitating collaboration among developers. Despite their\u0000importance, comments often become outdated, leading to inconsistencies with the\u0000corresponding code. This can mislead developers and potentially introduce bugs.\u0000Our research investigates the impact of code-comment inconsistency on bug\u0000introduction using large language models, specifically GPT-3.5. We first\u0000compare the performance of the GPT-3.5 model with other state-of-the-art\u0000methods in detecting these inconsistencies, demonstrating the superiority of\u0000GPT-3.5 in this domain. Additionally, we analyze the temporal evolution of\u0000code-comment inconsistencies and their effect on bug proneness over various\u0000timeframes using GPT-3.5 and Odds ratio analysis. Our findings reveal that\u0000inconsistent changes are around 1.5 times more likely to lead to a\u0000bug-introducing commit than consistent changes, highlighting the necessity of\u0000maintaining consistent and up-to-date comments in software development. This\u0000study provides new insights into the relationship between code-comment\u0000inconsistency and software quality, offering a comprehensive analysis of its\u0000impact over time, demonstrating that the impact of code-comment inconsistency\u0000on bug introduction is highest immediately after the inconsistency is\u0000introduced and diminishes over time.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Large-Scale Privacy Assessment of Android Third-Party SDKs 大规模安卓第三方 SDK 隐私评估

arXiv - CS - Software Engineering

Pub Date : 2024-09-16 DOI: arxiv-2409.10411

Mark Huasong Meng, Chuan Yan, Yun Hao, Qing Zhang, Zeyu Wang, Kailong Wang, Sin Gee Teo, Guangdong Bai, Jin Song Dong

Third-party Software Development Kits (SDKs) are widely adopted in Androidapp development, to effortlessly accelerate development pipelines and enhanceapp functionality. However, this convenience raises substantial concerns aboutunauthorized access to users' privacy-sensitive information, which could befurther abused for illegitimate purposes like user tracking or monetization.Our study offers a targeted analysis of user privacy protection among Androidthird-party SDKs, filling a critical gap in the Android software supply chain.It focuses on two aspects of their privacy practices, including dataexfiltration and behavior-policy compliance (or privacy compliance), utilizingtechniques of taint analysis and large language models. It covers 158widely-used SDKs from two key SDK release platforms, the official one and alarge alternative one. From them, we identified 338 instances of privacy dataexfiltration. On the privacy compliance, our study reveals that more than 30%of the examined SDKs fail to provide a privacy policy to disclose their datahandling practices. Among those that provide privacy policies, 37% of themover-collect user data, and 88% falsely claim access to sensitive data. Werevisit the latest versions of the SDKs after 12 months. Our analysisdemonstrates a persistent lack of improvement in these concerning trends. Basedon our findings, we propose three actionable recommendations to mitigate theprivacy leakage risks and enhance privacy protection for Android users. Ourresearch not only serves as an urgent call for industry attention but alsoprovides crucial insights for future regulatory interventions.

第三方软件开发工具包（SDK）在 Android 应用开发中被广泛采用，可轻松加速开发流程并增强应用功能。我们的研究利用污点分析技术和大型语言模型，对 Android 第三方 SDK 的用户隐私保护情况进行了有针对性的分析，填补了 Android 软件供应链中的一个重要空白，重点关注其隐私保护实践的两个方面，包括数据过滤和行为政策合规性（或隐私合规性）。它涵盖了来自两个主要 SDK 发布平台（官方平台和大型替代平台）的 158 个广泛使用的 SDK。我们从中发现了 338 个隐私数据过滤实例。在隐私合规性方面，我们的研究显示，超过 30% 的受检 SDK 没有提供隐私政策，以披露其数据处理做法。在提供隐私政策的 SDK 中，37% 过度收集用户数据，88% 谎称可以访问敏感数据。我们在 12 个月后访问了最新版本的 SDK。我们的分析表明，这些令人担忧的趋势始终没有得到改善。基于我们的研究结果，我们提出了三项可行的建议，以降低隐私泄露风险并加强对 Android 用户的隐私保护。我们的研究不仅是对行业关注的紧急呼吁，也为未来的监管干预提供了重要的见解。

{"title":"A Large-Scale Privacy Assessment of Android Third-Party SDKs","authors":"Mark Huasong Meng, Chuan Yan, Yun Hao, Qing Zhang, Zeyu Wang, Kailong Wang, Sin Gee Teo, Guangdong Bai, Jin Song Dong","doi":"arxiv-2409.10411","DOIUrl":"https://doi.org/arxiv-2409.10411","url":null,"abstract":"Third-party Software Development Kits (SDKs) are widely adopted in Android\u0000app development, to effortlessly accelerate development pipelines and enhance\u0000app functionality. However, this convenience raises substantial concerns about\u0000unauthorized access to users' privacy-sensitive information, which could be\u0000further abused for illegitimate purposes like user tracking or monetization.\u0000Our study offers a targeted analysis of user privacy protection among Android\u0000third-party SDKs, filling a critical gap in the Android software supply chain.\u0000It focuses on two aspects of their privacy practices, including data\u0000exfiltration and behavior-policy compliance (or privacy compliance), utilizing\u0000techniques of taint analysis and large language models. It covers 158\u0000widely-used SDKs from two key SDK release platforms, the official one and a\u0000large alternative one. From them, we identified 338 instances of privacy data\u0000exfiltration. On the privacy compliance, our study reveals that more than 30%\u0000of the examined SDKs fail to provide a privacy policy to disclose their data\u0000handling practices. Among those that provide privacy policies, 37% of them\u0000over-collect user data, and 88% falsely claim access to sensitive data. We\u0000revisit the latest versions of the SDKs after 12 months. Our analysis\u0000demonstrates a persistent lack of improvement in these concerning trends. Based\u0000on our findings, we propose three actionable recommendations to mitigate the\u0000privacy leakage risks and enhance privacy protection for Android users. Our\u0000research not only serves as an urgent call for industry attention but also\u0000provides crucial insights for future regulatory interventions.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code ComplexCodeEval：在更复杂代码上评估大型代码模型的基准

arXiv - CS - Software Engineering

Pub Date : 2024-09-16 DOI: arxiv-2409.10280

Jia Feng, Jiachen Liu, Cuiyun Gao, Chun Yong Chong, Chaozheng Wang, Shan Gao, Xin Xia

In recent years, the application of large language models (LLMs) tocode-related tasks has gained significant attention. However, existingevaluation benchmarks often focus on limited scenarios, such as code generationor completion, which do not reflect the diverse challenges developers face inreal-world contexts. To address this, we introduce ComplexCodeEval, a benchmarkdesigned to assess LCMs in various development tasks, including codegeneration, completion, API recommendation, and test case generation. Itincludes 3,897 Java samples and 7,184 Python samples from high-star GitHubrepositories, each annotated with function signatures, docstrings, and APIreferences to simulate real development environments. Our experiments acrossten LCMs reveal that context improves performance and that data leakage canlead to overestimation, highlighting the need for more accurate evaluations.

近年来，大型语言模型（LLM）在代码相关任务中的应用受到了广泛关注。然而，现有的评估基准通常只关注有限的场景，如代码生成或完成，无法反映开发人员在现实世界中面临的各种挑战。为了解决这个问题，我们引入了 ComplexCodeEval，这是一个旨在评估各种开发任务中的 LCM 的基准，包括代码生成、完成、API 推荐和测试用例生成。它包括来自高星级 GitHub 仓库的 3,897 个 Java 样本和 7,184 个 Python 样本，每个样本都注释了函数签名、文档说明和 API 参考，以模拟真实的开发环境。我们对 LCM 的实验表明，上下文提高了性能，而数据泄露则会导致高估，这突出表明我们需要更准确的评估。

引用次数: 0

Centralization potential of automotive E/E architectures 汽车 E/E 架构的集中化潜力

arXiv - CS - Software Engineering

Pub Date : 2024-09-16 DOI: arxiv-2409.10690

Lucas Mauser, Stefan Wagner

Current automotive E/E architectures are subject to significanttransformations: Computing-power-intensive advanced driver-assistance systems,bandwidth-hungry infotainment systems, the connection of the vehicle with theinternet and the consequential need for cyber-security drives thecentralization of E/E architectures. A centralized architecture is often seenas a key enabler to master those challenges. Available research focuses mostlyon the different types of E/E architectures and contrasts their advantages anddisadvantages. There is a research gap on guidelines for system designers andfunction developers to analyze the potential of their systems forcentralization. The present paper aims to quantify centralization potentialreviewing relevant literature and conducting qualitative interviews withindustry practitioners. In literature, we identified seven key automotivesystem properties reaching limitations in current automotive architectures:busload, functional safety, computing power, feature dependencies, developmentand maintenance costs, error rate, modularity and flexibility. These propertiesserve as quantitative evaluation criteria to estimate whether centralizationwould enhance overall system performance. In the interviews, we have validatedcentralization and its fundament - the conceptual systems engineering - ascapabilities to mitigate these limitations. By focusing on practical insightsand lessons learned, this research provides system designers with actionableguidance to optimize their systems, addressing the outlined challenges whileavoiding monolithic architecture. This paper bridges the gap betweentheoretical research and practical application, offering valuable takeaways forpractitioners.

当前的汽车电子/电气架构正在经历重大变革：计算能力密集型高级驾驶辅助系统、带宽要求极高的信息娱乐系统、汽车与互联网的连接以及随之而来的网络安全需求，都推动着 E/E 架构的集中化。集中式架构通常被视为应对这些挑战的关键因素。现有的研究主要集中在不同类型的电子/电子架构上，并对其优缺点进行了对比。在为系统设计者和功能开发者提供分析其系统集中化潜力的指南方面，还存在研究空白。本文旨在通过查阅相关文献和对行业从业人员进行定性访谈，量化集中化的潜力。在文献中，我们发现了当前汽车架构中存在局限性的七个关键汽车系统属性：总线负载、功能安全性、计算能力、功能依赖性、开发和维护成本、错误率、模块化和灵活性。这些特性可作为量化评估标准，用于估算集中化是否能提高系统的整体性能。在访谈中，我们验证了集中化及其基础--概念系统工程--能够减轻这些限制。通过关注实际见解和经验教训，本研究为系统设计人员提供了优化系统的可行指导，在避免单一架构的同时解决了概述的挑战。本文在理论研究与实际应用之间架起了一座桥梁，为实践者提供了宝贵的启示。

{"title":"Centralization potential of automotive E/E architectures","authors":"Lucas Mauser, Stefan Wagner","doi":"arxiv-2409.10690","DOIUrl":"https://doi.org/arxiv-2409.10690","url":null,"abstract":"Current automotive E/E architectures are subject to significant\u0000transformations: Computing-power-intensive advanced driver-assistance systems,\u0000bandwidth-hungry infotainment systems, the connection of the vehicle with the\u0000internet and the consequential need for cyber-security drives the\u0000centralization of E/E architectures. A centralized architecture is often seen\u0000as a key enabler to master those challenges. Available research focuses mostly\u0000on the different types of E/E architectures and contrasts their advantages and\u0000disadvantages. There is a research gap on guidelines for system designers and\u0000function developers to analyze the potential of their systems for\u0000centralization. The present paper aims to quantify centralization potential\u0000reviewing relevant literature and conducting qualitative interviews with\u0000industry practitioners. In literature, we identified seven key automotive\u0000system properties reaching limitations in current automotive architectures:\u0000busload, functional safety, computing power, feature dependencies, development\u0000and maintenance costs, error rate, modularity and flexibility. These properties\u0000serve as quantitative evaluation criteria to estimate whether centralization\u0000would enhance overall system performance. In the interviews, we have validated\u0000centralization and its fundament - the conceptual systems engineering - as\u0000capabilities to mitigate these limitations. By focusing on practical insights\u0000and lessons learned, this research provides system designers with actionable\u0000guidance to optimize their systems, addressing the outlined challenges while\u0000avoiding monolithic architecture. This paper bridges the gap between\u0000theoretical research and practical application, offering valuable takeaways for\u0000practitioners.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Semantic Versioning of Open Pre-trained Language Model Releases on Hugging Face 基于拥抱表情的开放式预训练语言模型的语义版本发布

arXiv - CS - Software Engineering

Pub Date : 2024-09-16 DOI: arxiv-2409.10472

Adekunle Ajibode, Abdul Ali Bangash, Filipe Roseiro Cogo, Bram Adams, Ahmed E. Hassan

The proliferation of open Pre-trained Language Models (PTLMs) on modelregistry platforms like Hugging Face (HF) presents both opportunities andchallenges for companies building products around them. Similar to traditionalsoftware dependencies, PTLMs continue to evolve after a release. However, thecurrent state of release practices of PTLMs on model registry platforms areplagued by a variety of inconsistencies, such as ambiguous naming conventionsand inaccessible model training documentation. Given the knowledge gap oncurrent PTLM release practices, our empirical study uses a mixed-methodsapproach to analyze the releases of 52,227 PTLMs on the most well-known modelregistry, HF. Our results reveal 148 different naming practices for PTLMreleases, with 40.87% of changes to model weight files not represented in theadopted name-based versioning practice or their documentation. In addition, weidentified that the 52,227 PTLMs are derived from only 299 different basemodels (the modified original models used to create 52,227 PTLMs), withFine-tuning and Quantization being the most prevalent modification methodsapplied to these base models. Significant gaps in release transparency, interms of training dataset specifications and model card availability, stillexist, highlighting the need for standardized documentation. While weidentified a model naming practice explicitly differentiating between major andminor PTLM releases, we did not find any significant difference in the types ofchanges that went into either type of releases, suggesting that major/minorversion numbers for PTLMs often are chosen arbitrarily. Our findings providevaluable insights to improve PTLM release practices, nudging the field towardsmore formal semantic versioning practices.

Hugging Face（HF）等模型注册平台上的开放式预训练语言模型（PTLM）的激增，为围绕这些模型开发产品的公司带来了机遇和挑战。与传统的软件依赖关系类似，PTLM 在发布后也会继续发展。然而，目前模型注册平台上 PTLM 的发布实践却存在各种不一致，例如命名规则不明确和模型培训文档难以获取。鉴于当前 PTLM 发布实践方面的知识空白，我们的实证研究采用混合方法分析了最著名的模型注册平台 HF 上 52,227 个 PTLM 的发布情况。我们的研究结果表明，在 PTLM 的发布过程中，有 148 种不同的命名方法，其中 40.87% 的模型权重文件变更没有体现在所采用的基于名称的版本控制方法或其文档中。此外，我们还发现 52 227 个 PTLM 仅来自 299 个不同的基础模型（用于创建 52 227 个 PTLM 的修改过的原始模型），微调和量化是应用于这些基础模型的最普遍的修改方法。在训练数据集规格和模型卡可用性方面，发布透明度仍存在很大差距，这突出表明了标准化文档的必要性。虽然我们发现了一种明确区分 PTLM 主版本和次版本的模型命名方法，但我们并没有发现这两种版本的修改类型有任何显著差异，这表明 PTLM 的主/次版本号往往是随意选择的。我们的发现为改进 PTLM 的发布实践提供了宝贵的见解，推动了该领域向更正式的语义版本实践迈进。

{"title":"Towards Semantic Versioning of Open Pre-trained Language Model Releases on Hugging Face","authors":"Adekunle Ajibode, Abdul Ali Bangash, Filipe Roseiro Cogo, Bram Adams, Ahmed E. Hassan","doi":"arxiv-2409.10472","DOIUrl":"https://doi.org/arxiv-2409.10472","url":null,"abstract":"The proliferation of open Pre-trained Language Models (PTLMs) on model\u0000registry platforms like Hugging Face (HF) presents both opportunities and\u0000challenges for companies building products around them. Similar to traditional\u0000software dependencies, PTLMs continue to evolve after a release. However, the\u0000current state of release practices of PTLMs on model registry platforms are\u0000plagued by a variety of inconsistencies, such as ambiguous naming conventions\u0000and inaccessible model training documentation. Given the knowledge gap on\u0000current PTLM release practices, our empirical study uses a mixed-methods\u0000approach to analyze the releases of 52,227 PTLMs on the most well-known model\u0000registry, HF. Our results reveal 148 different naming practices for PTLM\u0000releases, with 40.87% of changes to model weight files not represented in the\u0000adopted name-based versioning practice or their documentation. In addition, we\u0000identified that the 52,227 PTLMs are derived from only 299 different base\u0000models (the modified original models used to create 52,227 PTLMs), with\u0000Fine-tuning and Quantization being the most prevalent modification methods\u0000applied to these base models. Significant gaps in release transparency, in\u0000terms of training dataset specifications and model card availability, still\u0000exist, highlighting the need for standardized documentation. While we\u0000identified a model naming practice explicitly differentiating between major and\u0000minor PTLM releases, we did not find any significant difference in the types of\u0000changes that went into either type of releases, suggesting that major/minor\u0000version numbers for PTLMs often are chosen arbitrarily. Our findings provide\u0000valuable insights to improve PTLM release practices, nudging the field towards\u0000more formal semantic versioning practices.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Understanding Code Change with Micro-Changes 通过微变化了解代码变化

arXiv - CS - Software Engineering

Pub Date : 2024-09-16 DOI: arxiv-2409.09923

Lei Chen, Michele Lanza, Shinpei Hayashi

A crucial activity in software maintenance and evolution is the comprehensionof the changes performed by developers, when they submit a pull request and/orperform a commit on the repository. Typically, code changes are represented inthe form of code diffs, textual representations highlighting the differencesbetween two file versions, depicting the added, removed, and changed lines.This simplistic representation must be interpreted by developers, and mentallylifted to a higher abstraction level, that more closely resembles naturallanguage descriptions, and eases the creation of a mental model of the changes.However, the textual diff-based representation is cumbersome, and the liftingrequires considerable domain knowledge and programming skills. We present anapproach, based on the concept of micro-change, to overcome these difficulties,translating code diffs into a series of pre-defined change operations, whichcan be described in natural language. We present a catalog of micro-changes,together with an automated micro-change detector. To evaluate our approach, weperformed an empirical study on a large set of open-source repositories,focusing on a subset of our micro-change catalog, namely those related tochanges affecting the conditional logic. We found that our detector is capableof explaining more than 67% of the changes taking place in the systems understudy.

软件维护和演进中的一项重要活动是理解开发人员在提交拉取请求和/或对版本库进行提交时所进行的更改。通常，代码变更以代码差异（diffs）的形式表示，这种文本表示法突出了两个文件版本之间的差异，描述了添加、删除和更改的行。这种简单化的表示法必须由开发人员进行解释，并在头脑中提升到更高的抽象层次，这种抽象层次更接近于自然语言描述，便于创建变更的心智模型。我们提出了一种基于微变更概念的方法来克服这些困难，将代码差异转化为一系列预定义的变更操作，这些操作可以用自然语言进行描述。我们提供了一个微变更目录和一个自动微变更检测器。为了评估我们的方法，我们在大量开源软件库中进行了实证研究，重点关注微变更目录的子集，即那些与影响条件逻辑的变更相关的内容。我们发现，我们的检测器能够解释所研究系统中发生的 67% 以上的变化。

{"title":"Understanding Code Change with Micro-Changes","authors":"Lei Chen, Michele Lanza, Shinpei Hayashi","doi":"arxiv-2409.09923","DOIUrl":"https://doi.org/arxiv-2409.09923","url":null,"abstract":"A crucial activity in software maintenance and evolution is the comprehension\u0000of the changes performed by developers, when they submit a pull request and/or\u0000perform a commit on the repository. Typically, code changes are represented in\u0000the form of code diffs, textual representations highlighting the differences\u0000between two file versions, depicting the added, removed, and changed lines.\u0000This simplistic representation must be interpreted by developers, and mentally\u0000lifted to a higher abstraction level, that more closely resembles natural\u0000language descriptions, and eases the creation of a mental model of the changes.\u0000However, the textual diff-based representation is cumbersome, and the lifting\u0000requires considerable domain knowledge and programming skills. We present an\u0000approach, based on the concept of micro-change, to overcome these difficulties,\u0000translating code diffs into a series of pre-defined change operations, which\u0000can be described in natural language. We present a catalog of micro-changes,\u0000together with an automated micro-change detector. To evaluate our approach, we\u0000performed an empirical study on a large set of open-source repositories,\u0000focusing on a subset of our micro-change catalog, namely those related to\u0000changes affecting the conditional logic. We found that our detector is capable\u0000of explaining more than 67% of the changes taking place in the systems under\u0000study.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

eWAPA: An eBPF-based WASI Performance Analysis Framework for WebAssembly Runtimes eWAPA：基于 eBPF 的 WebAssembly 运行时 WASI 性能分析框架

arXiv - CS - Software Engineering

Pub Date : 2024-09-16 DOI: arxiv-2409.10252

Chenxi Mao, Yuxin Su, Shiwen Shan, Dan Li

WebAssembly (Wasm) is a low-level bytecode format that can run in modernbrowsers. With the development of standalone runtimes and the improvement ofthe WebAssembly System Interface (WASI), Wasm has further provided a morecomplete sandboxed runtime experience for server-side applications, effectivelyexpanding its application scenarios. However, the implementation of WASI variesacross different runtimes, and suboptimal interface implementations can lead toperformance degradation during interactions between the runtime and theoperating system. Existing research mainly focuses on overall performanceevaluation of runtimes, while studies on WASI implementations are relativelyscarce. To tackle this problem, we propose an eBPF-based WASI performanceanalysis framework. It collects key performance metrics of the runtime underdifferent I/O load conditions, such as total execution time, startup time, WASIexecution time, and syscall time. We can comprehensively analyze theperformance of the runtime's I/O interactions with the operating system.Additionally, we provide a detailed analysis of the causes behind two specificWASI performance anomalies. These analytical results will guide theoptimization of standalone runtimes and WASI implementations, enhancing theirefficiency.

WebAssembly (Wasm) 是一种可在现代浏览器中运行的低级字节码格式。随着独立运行时的发展和 WebAssembly 系统接口（WASI）的改进，Wasm 进一步为服务器端应用程序提供了更完整的沙盒运行时体验，有效地扩展了其应用场景。然而，不同运行时对 WASI 的实现各不相同，次优的接口实现可能会导致运行时与操作系统交互过程中的性能下降。现有研究主要关注运行时的整体性能评估，而对 WASI 实现的研究相对较少。为了解决这个问题，我们提出了一个基于 eBPF 的 WASI 性能分析框架。它收集了运行时在不同 I/O 负载条件下的关键性能指标，如总执行时间、启动时间、WASI 执行时间和系统调用时间。此外，我们还对两个特定的 WASI 性能异常背后的原因进行了详细分析。这些分析结果将指导独立运行时和 WASI 实现的优化，从而提高效率。

{"title":"eWAPA: An eBPF-based WASI Performance Analysis Framework for WebAssembly Runtimes","authors":"Chenxi Mao, Yuxin Su, Shiwen Shan, Dan Li","doi":"arxiv-2409.10252","DOIUrl":"https://doi.org/arxiv-2409.10252","url":null,"abstract":"WebAssembly (Wasm) is a low-level bytecode format that can run in modern\u0000browsers. With the development of standalone runtimes and the improvement of\u0000the WebAssembly System Interface (WASI), Wasm has further provided a more\u0000complete sandboxed runtime experience for server-side applications, effectively\u0000expanding its application scenarios. However, the implementation of WASI varies\u0000across different runtimes, and suboptimal interface implementations can lead to\u0000performance degradation during interactions between the runtime and the\u0000operating system. Existing research mainly focuses on overall performance\u0000evaluation of runtimes, while studies on WASI implementations are relatively\u0000scarce. To tackle this problem, we propose an eBPF-based WASI performance\u0000analysis framework. It collects key performance metrics of the runtime under\u0000different I/O load conditions, such as total execution time, startup time, WASI\u0000execution time, and syscall time. We can comprehensively analyze the\u0000performance of the runtime's I/O interactions with the operating system.\u0000Additionally, we provide a detailed analysis of the causes behind two specific\u0000WASI performance anomalies. These analytical results will guide the\u0000optimization of standalone runtimes and WASI implementations, enhancing their\u0000efficiency.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0