arXiv - CS - Software Engineering最新文献_第10页

Insights from Benchmarking Frontier Language Models on Web App Code Generation 网络应用程序代码生成前沿语言模型基准测试的启示

arXiv - CS - Software Engineering

Pub Date : 2024-09-08 DOI: arxiv-2409.05177

Yi Cui

This paper presents insights from evaluating 16 frontier large languagemodels (LLMs) on the WebApp1K benchmark, a test suite designed to assess theability of LLMs to generate web application code. The results reveal that whileall models possess similar underlying knowledge, their performance isdifferentiated by the frequency of mistakes they make. By analyzing lines ofcode (LOC) and failure distributions, we find that writing correct code is morecomplex than generating incorrect code. Furthermore, prompt engineering showslimited efficacy in reducing errors beyond specific cases. These findingssuggest that further advancements in coding LLM should emphasize on modelreliability and mistake minimization.

本文介绍了在 WebApp1K 基准上对 16 个前沿大型语言模型（LLM）进行评估后得出的见解。WebApp1K 基准是一个测试套件，旨在评估 LLM 生成网络应用程序代码的能力。结果表明，虽然所有模型都拥有相似的基础知识，但它们的性能却因犯错频率的不同而有所区别。通过分析代码行数（LOC）和故障分布，我们发现编写正确代码比生成错误代码更加复杂。此外，提示工程在减少特定情况下的错误方面效果有限。这些发现表明，编码 LLM 的进一步发展应强调模型的可靠性和错误最小化。

引用次数: 0

KModels: Unlocking AI for Business Applications KModels：释放人工智能的商业应用

arXiv - CS - Software Engineering

Pub Date : 2024-09-08 DOI: arxiv-2409.05919

Roy AbitbolIBM Research Israel, Eyal CohenIBM Research Israel, Muhammad KanaanIBM Research Israel, Bhavna AgrawalIBM Research USA, Yingjie LiIBM Research USA, Anuradha BhamidipatyIBM Research USA, Erez BilgoryIBM Research Israel

As artificial intelligence (AI) continues to rapidly advance, there is agrowing demand to integrate AI capabilities into existing businessapplications. However, a significant gap exists between the rapid progress inAI and how slowly AI is being embedded into business environments. Deployingwell-performing lab models into production settings, especially in on-premiseenvironments, often entails specialized expertise and imposes a heavy burden ofmodel management, creating significant barriers to implementing AI models inreal-world applications. KModels leverages proven libraries and platforms (Kubeflow Pipelines, KServe)to streamline AI adoption by supporting both AI developers and consumers. Itallows model developers to focus solely on model development and share modelsas transportable units (Templates), abstracting away complex productiondeployment concerns. KModels enables AI consumers to eliminate the need for adedicated data scientist, as the templates encapsulate most data scienceconsiderations while providing business-oriented control. This paper presents the architecture of KModels and the key decisions thatshape it. We outline KModels' main components as well as its interfaces.Furthermore, we explain how KModels is highly suited for on-premise deploymentbut can also be used in cloud environments. The efficacy of KModels is demonstrated through the successful deployment ofthree AI models within an existing Work Order Management system. These modelsoperate in a client's data center and are trained on local data, without datascientist intervention. One model improved the accuracy of Failure Codespecification for work orders from 46% to 83%, showcasing the substantialbenefit of accessible and localized AI solutions.

随着人工智能（AI）的持续快速发展，将 AI 功能集成到现有业务应用中的需求日益增长。然而，在人工智能的快速发展与人工智能嵌入业务环境的缓慢程度之间存在着巨大差距。将性能良好的实验室模型部署到生产环境中，尤其是在企业内部环境中，往往需要专业的技术知识，并带来沉重的模型管理负担，这为在现实世界的应用中实施人工智能模型制造了巨大障碍。KModels 利用成熟的库和平台（Kubeflow Pipelines、KServe），通过为人工智能开发者和消费者提供支持来简化人工智能的采用。它允许模型开发人员只专注于模型开发，并以可传输单元（模板）的形式共享模型，从而抽象出复杂的生产部署问题。KModels 使人工智能消费者不再需要专门的数据科学家，因为模板封装了大多数数据科学考虑因素，同时提供面向业务的控制。本文介绍了 KModels 的架构以及影响它的关键决策。此外，我们还解释了 KModels 如何非常适合内部部署，但也可用于云环境。通过在现有的工单管理系统中成功部署三个人工智能模型，我们证明了 KModels 的功效。这些模型在客户的数据中心运行，并根据本地数据进行训练，无需数据科学家的干预。其中一个模型将工单故障代码规范的准确率从 46% 提高到 83%，展示了可访问的本地化人工智能解决方案的巨大优势。

{"title":"KModels: Unlocking AI for Business Applications","authors":"Roy AbitbolIBM Research Israel, Eyal CohenIBM Research Israel, Muhammad KanaanIBM Research Israel, Bhavna AgrawalIBM Research USA, Yingjie LiIBM Research USA, Anuradha BhamidipatyIBM Research USA, Erez BilgoryIBM Research Israel","doi":"arxiv-2409.05919","DOIUrl":"https://doi.org/arxiv-2409.05919","url":null,"abstract":"As artificial intelligence (AI) continues to rapidly advance, there is a\u0000growing demand to integrate AI capabilities into existing business\u0000applications. However, a significant gap exists between the rapid progress in\u0000AI and how slowly AI is being embedded into business environments. Deploying\u0000well-performing lab models into production settings, especially in on-premise\u0000environments, often entails specialized expertise and imposes a heavy burden of\u0000model management, creating significant barriers to implementing AI models in\u0000real-world applications. KModels leverages proven libraries and platforms (Kubeflow Pipelines, KServe)\u0000to streamline AI adoption by supporting both AI developers and consumers. It\u0000allows model developers to focus solely on model development and share models\u0000as transportable units (Templates), abstracting away complex production\u0000deployment concerns. KModels enables AI consumers to eliminate the need for a\u0000dedicated data scientist, as the templates encapsulate most data science\u0000considerations while providing business-oriented control. This paper presents the architecture of KModels and the key decisions that\u0000shape it. We outline KModels' main components as well as its interfaces.\u0000Furthermore, we explain how KModels is highly suited for on-premise deployment\u0000but can also be used in cloud environments. The efficacy of KModels is demonstrated through the successful deployment of\u0000three AI models within an existing Work Order Management system. These models\u0000operate in a client's data center and are trained on local data, without data\u0000scientist intervention. One model improved the accuracy of Failure Code\u0000specification for work orders from 46% to 83%, showcasing the substantial\u0000benefit of accessible and localized AI solutions.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Investigating the Role of Cultural Values in Adopting Large Language Models for Software Engineering 调查文化价值观在软件工程采用大型语言模型中的作用

arXiv - CS - Software Engineering

Pub Date : 2024-09-08 DOI: arxiv-2409.05055

Stefano Lambiase, Gemma Catolino, Fabio Palomba, Filomena Ferrucci, Daniel Russo

As a socio-technical activity, software development involves the closeinterconnection of people and technology. The integration of Large LanguageModels (LLMs) into this process exemplifies the socio-technical nature ofsoftware development. Although LLMs influence the development process, softwaredevelopment remains fundamentally human-centric, necessitating an investigationof the human factors in this adoption. Thus, with this study we explore thefactors influencing the adoption of LLMs in software development, focusing onthe role of professionals' cultural values. Guided by the Unified Theory ofAcceptance and Use of Technology (UTAUT2) and Hofstede's cultural dimensions,we hypothesized that cultural values moderate the relationships within theUTAUT2 framework. Using Partial Least Squares-Structural Equation Modelling anddata from 188 software engineers, we found that habit and performanceexpectancy are the primary drivers of LLM adoption, while cultural values donot significantly moderate this process. These findings suggest that, byhighlighting how LLMs can boost performance and efficiency, organizations canencourage their use, no matter the cultural differences. Practical stepsinclude offering training programs to demonstrate LLM benefits, creating asupportive environment for regular use, and continuously tracking and sharingperformance improvements from using LLMs.

作为一种社会技术活动，软件开发涉及人与技术的紧密联系。将大型语言模型（LLM）整合到这一过程中，体现了软件开发的社会技术性质。虽然 LLM 会影响开发过程，但软件开发从根本上说仍然是以人为中心的，因此有必要对采用 LLM 过程中的人为因素进行研究。因此，我们在本研究中探讨了影响软件开发中采用 LLM 的因素，重点关注专业人员文化价值观的作用。在技术接受与使用统一理论（UTAUT2）和霍夫斯泰德文化维度的指导下，我们假设文化价值观对UTAUT2框架内的关系具有调节作用。通过使用偏最小二乘法-结构方程模型和来自 188 名软件工程师的数据，我们发现习惯和绩效预期是采用 LLM 的主要驱动因素，而文化价值观对这一过程的调节作用并不明显。这些研究结果表明，通过强调 LLM 可以如何提高性能和效率，企业可以鼓励员工使用 LLM，无论文化差异如何。实际步骤包括提供培训计划以展示 LLM 的益处，为定期使用 LLM 创造支持性环境，以及持续跟踪和分享使用 LLM 所带来的绩效改进。

{"title":"Investigating the Role of Cultural Values in Adopting Large Language Models for Software Engineering","authors":"Stefano Lambiase, Gemma Catolino, Fabio Palomba, Filomena Ferrucci, Daniel Russo","doi":"arxiv-2409.05055","DOIUrl":"https://doi.org/arxiv-2409.05055","url":null,"abstract":"As a socio-technical activity, software development involves the close\u0000interconnection of people and technology. The integration of Large Language\u0000Models (LLMs) into this process exemplifies the socio-technical nature of\u0000software development. Although LLMs influence the development process, software\u0000development remains fundamentally human-centric, necessitating an investigation\u0000of the human factors in this adoption. Thus, with this study we explore the\u0000factors influencing the adoption of LLMs in software development, focusing on\u0000the role of professionals' cultural values. Guided by the Unified Theory of\u0000Acceptance and Use of Technology (UTAUT2) and Hofstede's cultural dimensions,\u0000we hypothesized that cultural values moderate the relationships within the\u0000UTAUT2 framework. Using Partial Least Squares-Structural Equation Modelling and\u0000data from 188 software engineers, we found that habit and performance\u0000expectancy are the primary drivers of LLM adoption, while cultural values do\u0000not significantly moderate this process. These findings suggest that, by\u0000highlighting how LLMs can boost performance and efficiency, organizations can\u0000encourage their use, no matter the cultural differences. Practical steps\u0000include offering training programs to demonstrate LLM benefits, creating a\u0000supportive environment for regular use, and continuously tracking and sharing\u0000performance improvements from using LLMs.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"102 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

OSS License Identification at Scale: A Comprehensive Dataset Using World of Code 大规模开放源码软件许可证识别：使用《代码世界》的综合数据集

arXiv - CS - Software Engineering

Pub Date : 2024-09-07 DOI: arxiv-2409.04824

Mahmoud Jahanshahi, David Reid, Adam McDaniel, Audris Mockus

The proliferation of open source software (OSS) has led to a complexlandscape of licensing practices, making accurate license identificationcrucial for legal and compliance purposes. This study presents a comprehensiveanalysis of OSS licenses using the World of Code (WoC) infrastructure. Weemploy an exhaustive approach, scanning all files containing ``license'' intheir filepath, and apply the winnowing algorithm for robust text matching. Ourmethod identifies and matches over 5.5 million distinct license blobs acrossmillions of OSS projects, creating a detailed project-to-license (P2L) map. Weverify the accuracy of our approach through stratified sampling and manualreview, achieving a final accuracy of 92.08%, with precision of 87.14%, recallof 95.45%, and an F1 score of 91.11%. This work enhances the understanding ofOSS licensing practices and provides a valuable resource for developers,researchers, and legal professionals. Future work will expand the scope oflicense detection to include code files and references to licenses in projectdocumentation.

开放源码软件（OSS）的激增导致了许可实践的复杂局面，使得准确的许可识别对于法律和合规目的至关重要。本研究利用 "代码世界"（WoC）基础设施对开放源码软件许可证进行了全面分析。我们采用了一种详尽的方法，扫描文件路径中包含 "许可证 "的所有文件，并应用筛选算法进行稳健的文本匹配。我们的方法在数百万个开放源码软件项目中识别并匹配了 550 多万个不同的许可证，创建了详细的项目到许可证（P2L）地图。我们通过分层抽样和人工审核验证了我们方法的准确性，最终准确率达到 92.08%，精确度为 87.14%，回收率为 95.45%，F1 分数为 91.11%。这项工作加深了人们对OSS 许可实践的理解，为开发人员、研究人员和法律专业人员提供了宝贵的资源。未来的工作将扩大许可证检测的范围，以包括代码文件和项目文档中的许可证引用。

引用次数: 0

MILE: A Mutation Testing Framework of In-Context Learning Systems MILE：情境学习系统的突变测试框架

arXiv - CS - Software Engineering

Pub Date : 2024-09-07 DOI: arxiv-2409.04831

Zeming Wei, Yihao Zhang, Meng Sun

In-context Learning (ICL) has achieved notable success in the applications oflarge language models (LLMs). By adding only a few input-output pairs thatdemonstrate a new task, the LLM can efficiently learn the task during inferencewithout modifying the model parameters. Such mysterious ability of LLMs hasattracted great research interests in understanding, formatting, and improvingthe in-context demonstrations, while still suffering from drawbacks likeblack-box mechanisms and sensitivity against the selection of examples. In thiswork, inspired by the foundations of adopting testing techniques in machinelearning (ML) systems, we propose a mutation testing framework designed tocharacterize the quality and effectiveness of test data for ICL systems. First,we propose several mutation operators specialized for ICL demonstrations, aswell as corresponding mutation scores for ICL test sets. With comprehensiveexperiments, we showcase the effectiveness of our framework in evaluating thereliability and quality of ICL test suites. Our code is available athttps://github.com/weizeming/MILE.

上下文学习（ICL）在大型语言模型（LLM）的应用中取得了显著的成功。LLM 只需添加几个能演示新任务的输入输出对，就能在推理过程中高效地学习任务，而无需修改模型参数。LLMs 的这种神秘能力在理解、格式化和改进上下文演示方面吸引了大量的研究兴趣，但仍存在黑箱机制和对示例选择敏感等缺点。在这项工作中，受机器学习（ML）系统中采用测试技术的基础启发，我们提出了一个突变测试框架，旨在描述 ICL 系统测试数据的质量和有效性。首先，我们提出了几种专门用于 ICL 演示的突变算子，以及用于 ICL 测试集的相应突变分数。通过综合实验，我们展示了我们的框架在评估 ICL 测试套件的可靠性和质量方面的有效性。我们的代码可在https://github.com/weizeming/MILE。

引用次数: 0

Reducing Events to Augment Log-based Anomaly Detection Models: An Empirical Study 减少事件以增强基于日志的异常检测模型：实证研究

arXiv - CS - Software Engineering

Pub Date : 2024-09-07 DOI: arxiv-2409.04834

Lingzhe Zhang, Tong Jia, Kangjin Wang, Mengxi Jia, Yang Yong, Ying Li

As software systems grow increasingly intricate, the precise detection ofanomalies have become both essential and challenging. Current log-based anomalydetection methods depend heavily on vast amounts of log data leading toinefficient inference and potential misguidance by noise logs. However, thequantitative effects of log reduction on the effectiveness of anomaly detectionremain unexplored. Therefore, we first conduct a comprehensive study on sixdistinct models spanning three datasets. Through the study, the impact of logquantity and their effectiveness in representing anomalies is qualifies,uncovering three distinctive log event types that differently influence modelperformance. Drawing from these insights, we propose LogCleaner: an efficientmethodology for the automatic reduction of log events in the context of anomalydetection. Serving as middleware between software systems and models,LogCleaner continuously updates and filters anti-events and duplicative-eventsin the raw generated logs. Experimental outcomes highlight LogCleaner'scapability to reduce over 70% of log events in anomaly detection, acceleratingthe model's inference speed by approximately 300%, and universally improvingthe performance of models for anomaly detection.

随着软件系统日益复杂，精确检测异常变得既重要又具有挑战性。当前基于日志的异常检测方法严重依赖大量日志数据，导致推断效率低下，并可能受到噪声日志的误导。然而，日志缩减对异常检测有效性的定量影响仍未得到探索。因此，我们首先对跨越三个数据集的六个不同模型进行了全面研究。通过这项研究，我们确定了日志数量的影响及其在表示异常情况时的有效性，并发现了三种对模型性能产生不同影响的独特日志事件类型。根据这些见解，我们提出了日志清理器：一种在异常检测中自动减少日志事件的高效方法。作为软件系统和模型之间的中间件，LogCleaner 不断更新和过滤原始生成日志中的反事件和重复事件。实验结果表明，LogCleaner 能够在异常检测中减少 70% 以上的日志事件，将模型的推理速度提高了约 300%，并普遍提高了异常检测模型的性能。

{"title":"Reducing Events to Augment Log-based Anomaly Detection Models: An Empirical Study","authors":"Lingzhe Zhang, Tong Jia, Kangjin Wang, Mengxi Jia, Yang Yong, Ying Li","doi":"arxiv-2409.04834","DOIUrl":"https://doi.org/arxiv-2409.04834","url":null,"abstract":"As software systems grow increasingly intricate, the precise detection of\u0000anomalies have become both essential and challenging. Current log-based anomaly\u0000detection methods depend heavily on vast amounts of log data leading to\u0000inefficient inference and potential misguidance by noise logs. However, the\u0000quantitative effects of log reduction on the effectiveness of anomaly detection\u0000remain unexplored. Therefore, we first conduct a comprehensive study on six\u0000distinct models spanning three datasets. Through the study, the impact of log\u0000quantity and their effectiveness in representing anomalies is qualifies,\u0000uncovering three distinctive log event types that differently influence model\u0000performance. Drawing from these insights, we propose LogCleaner: an efficient\u0000methodology for the automatic reduction of log events in the context of anomaly\u0000detection. Serving as middleware between software systems and models,\u0000LogCleaner continuously updates and filters anti-events and duplicative-events\u0000in the raw generated logs. Experimental outcomes highlight LogCleaner's\u0000capability to reduce over 70% of log events in anomaly detection, accelerating\u0000the model's inference speed by approximately 300%, and universally improving\u0000the performance of models for anomaly detection.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Beyond Dependencies: The Role of Copy-Based Reuse in Open Source Software Development 超越依赖：基于副本的重用在开源软件开发中的作用

arXiv - CS - Software Engineering

Pub Date : 2024-09-07 DOI: arxiv-2409.04830

Mahmoud Jahanshahi, David Reid, Audris Mockus

In Open Source Software, resources of any project are open for reuse byintroducing dependencies or copying the resource itself. In contrast todependency-based reuse, the infrastructure to systematically support copy-basedreuse appears to be entirely missing. Our aim is to enable future research andtool development to increase efficiency and reduce the risks of copy-basedreuse. We seek a better understanding of such reuse by measuring its prevalenceand identifying factors affecting the propensity to reuse. To identify reusedartifacts and trace their origins, our method exploits World of Codeinfrastructure. We begin with a set of theory-derived factors related to thepropensity to reuse, sample instances of different reuse types, and surveydevelopers to better understand their intentions. Our results indicate thatcopy-based reuse is common, with many developers being aware of it when writingcode. The propensity for a file to be reused varies greatly among languages andbetween source code and binary files, consistently decreasing over time. Filesintroduced by popular projects are more likely to be reused, but at least halfof reused resources originate from ``small'' and ``medium'' projects.Developers had various reasons for reuse but were generally positive aboutusing a package manager.

在开源软件中，任何项目的资源都可以通过引入依赖关系或复制资源本身的方式进行重用。与基于依赖性的重用不同，系统地支持基于复制的重用的基础设施似乎完全缺失。我们的目标是使未来的研究和工具开发能够提高效率，降低基于复制的重用风险。我们试图通过测量这种重复使用的普遍程度和确定影响重复使用倾向的因素来更好地了解这种重复使用。为了识别重复使用的工件并追踪其来源，我们的方法利用了《代码世界》基础架构。我们从一组与重用倾向相关的理论衍生因素入手，对不同重用类型的实例进行抽样，并对开发者进行调查，以更好地了解他们的意图。我们的结果表明，基于拷贝的重用非常普遍，许多开发人员在编写代码时都意识到了这一点。不同语言、源代码和二进制文件之间的文件重用倾向差异很大，而且随着时间的推移不断降低。热门项目引入的文件更容易被重用，但至少有一半的重用资源来自 "小型 "和 "中型 "项目。

{"title":"Beyond Dependencies: The Role of Copy-Based Reuse in Open Source Software Development","authors":"Mahmoud Jahanshahi, David Reid, Audris Mockus","doi":"arxiv-2409.04830","DOIUrl":"https://doi.org/arxiv-2409.04830","url":null,"abstract":"In Open Source Software, resources of any project are open for reuse by\u0000introducing dependencies or copying the resource itself. In contrast to\u0000dependency-based reuse, the infrastructure to systematically support copy-based\u0000reuse appears to be entirely missing. Our aim is to enable future research and\u0000tool development to increase efficiency and reduce the risks of copy-based\u0000reuse. We seek a better understanding of such reuse by measuring its prevalence\u0000and identifying factors affecting the propensity to reuse. To identify reused\u0000artifacts and trace their origins, our method exploits World of Code\u0000infrastructure. We begin with a set of theory-derived factors related to the\u0000propensity to reuse, sample instances of different reuse types, and survey\u0000developers to better understand their intentions. Our results indicate that\u0000copy-based reuse is common, with many developers being aware of it when writing\u0000code. The propensity for a file to be reused varies greatly among languages and\u0000between source code and binary files, consistently decreasing over time. Files\u0000introduced by popular projects are more likely to be reused, but at least half\u0000of reused resources originate from ``small'' and ``medium'' projects.\u0000Developers had various reasons for reuse but were generally positive about\u0000using a package manager.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring User Privacy Awareness on GitHub: An Empirical Study 探索 GitHub 上的用户隐私意识：实证研究

arXiv - CS - Software Engineering

Pub Date : 2024-09-06 DOI: arxiv-2409.04048

Costanza Alfieri, Juri Di Rocco, Phuong T. Nguyen, Paola Inverardi

GitHub provides developers with a practical way to distribute source code andcollaboratively work on common projects. To enhance account security andprivacy, GitHub allows its users to manage access permissions, review auditlogs, and enable two-factor authentication. However, despite the endlesseffort, the platform still faces various issues related to the privacy of itsusers. This paper presents an empirical study delving into the GitHubecosystem. Our focus is on investigating the utilization of privacy settings onthe platform and identifying various types of sensitive information disclosedby users. Leveraging a dataset comprising 6,132 developers, we report andanalyze their activities by means of comments on pull requests. Our findingsindicate an active engagement by users with the available privacy settings onGitHub. Notably, we observe the disclosure of different forms of privateinformation within pull request comments. This observation has prompted ourexploration into sensitivity detection using a large language model and BERT,to pave the way for a personalized privacy assistant. Our work providesinsights into the utilization of existing privacy protection tools, such asprivacy settings, along with their inherent limitations. Essentially, we aim toadvance research in this field by providing both the motivation for creatingsuch privacy protection tools and a proposed methodology for personalizingthem.

GitHub 为开发人员提供了发布源代码和协作完成共同项目的实用方法。为了加强账户安全和隐私保护，GitHub 允许用户管理访问权限、查看审计日志并启用双因素身份验证。然而，尽管付出了巨大努力，该平台仍然面临着与用户隐私相关的各种问题。本文介绍了一项深入研究 GitHub 生态系统的实证研究。我们的重点是调查平台上隐私设置的使用情况，并识别用户披露的各类敏感信息。我们利用由 6,132 名开发者组成的数据集，通过对拉取请求的评论来报告和分析他们的活动。我们的研究结果表明，用户积极使用 GitHub 上的可用隐私设置。值得注意的是，我们观察到用户在拉取请求评论中披露了不同形式的隐私信息。这一观察结果促使我们探索使用大型语言模型和 BERT 进行敏感度检测，从而为个性化隐私助手铺平道路。我们的工作为现有隐私保护工具（如隐私设置）的使用及其固有局限性提供了启示。从根本上说，我们的目标是通过提供创建此类隐私保护工具的动机和建议的个性化方法来推进该领域的研究。

{"title":"Exploring User Privacy Awareness on GitHub: An Empirical Study","authors":"Costanza Alfieri, Juri Di Rocco, Phuong T. Nguyen, Paola Inverardi","doi":"arxiv-2409.04048","DOIUrl":"https://doi.org/arxiv-2409.04048","url":null,"abstract":"GitHub provides developers with a practical way to distribute source code and\u0000collaboratively work on common projects. To enhance account security and\u0000privacy, GitHub allows its users to manage access permissions, review audit\u0000logs, and enable two-factor authentication. However, despite the endless\u0000effort, the platform still faces various issues related to the privacy of its\u0000users. This paper presents an empirical study delving into the GitHub\u0000ecosystem. Our focus is on investigating the utilization of privacy settings on\u0000the platform and identifying various types of sensitive information disclosed\u0000by users. Leveraging a dataset comprising 6,132 developers, we report and\u0000analyze their activities by means of comments on pull requests. Our findings\u0000indicate an active engagement by users with the available privacy settings on\u0000GitHub. Notably, we observe the disclosure of different forms of private\u0000information within pull request comments. This observation has prompted our\u0000exploration into sensitivity detection using a large language model and BERT,\u0000to pave the way for a personalized privacy assistant. Our work provides\u0000insights into the utilization of existing privacy protection tools, such as\u0000privacy settings, along with their inherent limitations. Essentially, we aim to\u0000advance research in this field by providing both the motivation for creating\u0000such privacy protection tools and a proposed methodology for personalizing\u0000them.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Detecting Buggy Contracts via Smart Testing 通过智能测试检测漏洞合同

arXiv - CS - Software Engineering

Pub Date : 2024-09-06 DOI: arxiv-2409.04597

Sally Junsong Wang, Jianan Yao, Kexin Pei, Hidedaki Takahashi, Junfeng Yang

Smart contracts are susceptible to critical vulnerabilities. Hybrid dynamicanalyses, such as concolic execution assisted fuzzing and foundation modelassisted fuzzing, have emerged as highly effective testing techniques for smartcontract bug detection recently. This hybrid approach has shown initial promisein real-world benchmarks, but it still suffers from low scalability to finddeep bugs buried in complex code patterns. We observe that performancebottlenecks of existing dynamic analyses and model hallucination are two mainfactors limiting the scalability of this hybrid approach in finding deep bugs. To overcome the challenges, we design an interactive, self-decidingfoundation model based system, called SmartSys, to support hybrid smartcontract dynamic analyses. The key idea is to teach foundation models aboutperformance bottlenecks of different dynamic analysis techniques, making itpossible to forecast the right technique and generates effective fuzz targetsthat can reach deep, hidden bugs. To prune hallucinated, incorrect fuzztargets, SmartSys feeds foundation models with feedback from dynamic analysisduring compilation and at runtime. The interesting results of SmartSys include: i) discovering a smart contractprotocol vulnerability that has escaped eleven tools and survived multipleaudits for over a year; ii) improving coverage by up to 14.3% on real-worldbenchmarks compared to the baselines.

智能合约容易受到关键漏洞的影响。混合动态分析，如协迫执行辅助模糊（concolic execution assisted fuzzing）和基础模型辅助模糊（foundation modelassisted fuzzing），最近已成为智能合约漏洞检测的高效测试技术。这种混合方法在现实世界的基准测试中已初见成效，但仍存在可扩展性低的问题，难以发现埋藏在复杂代码模式中的深层错误。我们发现，现有动态分析的性能瓶颈和模型幻觉是限制这种混合方法在发现深度错误时可扩展性的两个主要因素。为了克服这些挑战，我们设计了一种交互式、基于自决基础模型的系统，称为 SmartSys，以支持混合智能合约动态分析。其关键思路是向基础模型传授不同动态分析技术的性能瓶颈，使其能够预测正确的技术并生成有效的模糊目标，从而发现深层隐藏的漏洞。为了删除幻觉的、不正确的模糊目标，SmartSys 在编译和运行时向基础模型提供动态分析的反馈。SmartSys 的有趣成果包括：i) 发现了一个智能合约协议漏洞，该漏洞躲过了 11 种工具的攻击，并在一年多的多次审计中幸存下来；ii) 与基线相比，在真实世界基准测试中的覆盖率提高了 14.3%。

{"title":"Detecting Buggy Contracts via Smart Testing","authors":"Sally Junsong Wang, Jianan Yao, Kexin Pei, Hidedaki Takahashi, Junfeng Yang","doi":"arxiv-2409.04597","DOIUrl":"https://doi.org/arxiv-2409.04597","url":null,"abstract":"Smart contracts are susceptible to critical vulnerabilities. Hybrid dynamic\u0000analyses, such as concolic execution assisted fuzzing and foundation model\u0000assisted fuzzing, have emerged as highly effective testing techniques for smart\u0000contract bug detection recently. This hybrid approach has shown initial promise\u0000in real-world benchmarks, but it still suffers from low scalability to find\u0000deep bugs buried in complex code patterns. We observe that performance\u0000bottlenecks of existing dynamic analyses and model hallucination are two main\u0000factors limiting the scalability of this hybrid approach in finding deep bugs. To overcome the challenges, we design an interactive, self-deciding\u0000foundation model based system, called SmartSys, to support hybrid smart\u0000contract dynamic analyses. The key idea is to teach foundation models about\u0000performance bottlenecks of different dynamic analysis techniques, making it\u0000possible to forecast the right technique and generates effective fuzz targets\u0000that can reach deep, hidden bugs. To prune hallucinated, incorrect fuzz\u0000targets, SmartSys feeds foundation models with feedback from dynamic analysis\u0000during compilation and at runtime. The interesting results of SmartSys include: i) discovering a smart contract\u0000protocol vulnerability that has escaped eleven tools and survived multiple\u0000audits for over a year; ii) improving coverage by up to 14.3% on real-world\u0000benchmarks compared to the baselines.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Can OpenSource beat ChatGPT? -- A Comparative Study of Large Language Models for Text-to-Code Generation 开源能否击败 ChatGPT？-- 用于文本到代码生成的大型语言模型比较研究

arXiv - CS - Software Engineering

Pub Date : 2024-09-06 DOI: arxiv-2409.04164

Luis Mayer, Christian Heumann, Matthias Aßenmacher

In recent years, large language models (LLMs) have emerged as powerful toolswith potential applications in various fields, including software engineering.Within the scope of this research, we evaluate five different state-of-the-artLLMs - Bard, BingChat, ChatGPT, Llama2, and Code Llama - concerning theircapabilities for text-to-code generation. In an empirical study, we feedprompts with textual descriptions of coding problems sourced from theprogramming website LeetCode to the models with the task of creating solutionsin Python. Subsequently, the quality of the generated outputs is assessed usingthe testing functionalities of LeetCode. The results indicate large differencesin performance between the investigated models. ChatGPT can handle thesetypical programming challenges by far the most effectively, surpassing evencode-specialized models like Code Llama. To gain further insights, we measurethe runtime as well as the memory usage of the generated outputs and comparedthem to the other code submissions on Leetcode. A detailed error analysis,encompassing a comparison of the differences concerning correct indentation andform of the generated code as well as an assignment of the incorrectly solvedtasks to certain error categories allows us to obtain a more nuanced picture ofthe results and potential for improvement. The results also show a clearpattern of increasingly incorrect produced code when the models are facing alot of context in the form of longer prompts.

近年来，大型语言模型（LLMs）作为一种强大的工具，在包括软件工程在内的各个领域都有潜在的应用前景。在本研究范围内，我们对 Bard、BingChat、ChatGPT、Llama2 和 Code Llama 这五种最先进的大型语言模型进行了评估，以了解它们在文本到代码生成方面的能力。在一项实证研究中，我们将来自编程网站 LeetCode 的编码问题文本描述输入到模型中，让模型用 Python 创建解决方案。随后，我们使用 LeetCode 的测试功能对生成输出的质量进行了评估。结果表明，所研究模型之间的性能差异很大。到目前为止，ChatGPT 能最有效地处理典型的编程挑战，甚至超过了 Code Llama 等代码专用模型。为了进一步深入了解，我们测量了生成输出的运行时间和内存使用情况，并将它们与 Leetcode 上提交的其他代码进行了比较。详细的错误分析包括比较生成代码的正确缩进和形式方面的差异，以及将错误解决的任务分配到特定的错误类别，这使我们能够对结果和改进潜力有更细致的了解。结果还显示了一个明显的模式，即当模型面对大量以较长提示形式出现的上下文时，生成的代码越来越不正确。

{"title":"Can OpenSource beat ChatGPT? -- A Comparative Study of Large Language Models for Text-to-Code Generation","authors":"Luis Mayer, Christian Heumann, Matthias Aßenmacher","doi":"arxiv-2409.04164","DOIUrl":"https://doi.org/arxiv-2409.04164","url":null,"abstract":"In recent years, large language models (LLMs) have emerged as powerful tools\u0000with potential applications in various fields, including software engineering.\u0000Within the scope of this research, we evaluate five different state-of-the-art\u0000LLMs - Bard, BingChat, ChatGPT, Llama2, and Code Llama - concerning their\u0000capabilities for text-to-code generation. In an empirical study, we feed\u0000prompts with textual descriptions of coding problems sourced from the\u0000programming website LeetCode to the models with the task of creating solutions\u0000in Python. Subsequently, the quality of the generated outputs is assessed using\u0000the testing functionalities of LeetCode. The results indicate large differences\u0000in performance between the investigated models. ChatGPT can handle these\u0000typical programming challenges by far the most effectively, surpassing even\u0000code-specialized models like Code Llama. To gain further insights, we measure\u0000the runtime as well as the memory usage of the generated outputs and compared\u0000them to the other code submissions on Leetcode. A detailed error analysis,\u0000encompassing a comparison of the differences concerning correct indentation and\u0000form of the generated code as well as an assignment of the incorrectly solved\u0000tasks to certain error categories allows us to obtain a more nuanced picture of\u0000the results and potential for improvement. The results also show a clear\u0000pattern of increasingly incorrect produced code when the models are facing a\u0000lot of context in the form of longer prompts.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"438 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0