首页 > 最新文献

ACM Transactions on Software Engineering and Methodology最新文献

英文 中文
Beyond Fidelity: Explaining Vulnerability Localization of Learning-based Detectors 超越保真度:基于学习的检测器的漏洞定位解释
IF 4.4 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-01-31 DOI: 10.1145/3641543
Baijun Cheng, Mingsheng Zhao, Kailong Wang, Meizhen Wang, Guangdong Bai, Ruitao Feng, Yao Guo, Lei Ma, Haoyu Wang

Abstract: Vulnerability detectors based on deep learning (DL) models have proven their effectiveness in recent years. However, the shroud of opacity surrounding the decision-making process of these detectors makes it difficult for security analysts to comprehend. To address this, various explanation approaches have been proposed to explain the predictions by highlighting important features, which have been demonstrated effective in other domains such as computer vision and natural language processing. Unfortunately, an in-depth evaluation of vulnerability-critical features, such as fine-grained vulnerability-related code lines, learned and understood by these explanation approaches remains lacking. In this study, we first evaluate the performance of ten explanation approaches for vulnerability detectors based on graph and sequence representations, measured by two quantitative metrics including fidelity and vulnerability line coverage rate. Our results show that fidelity alone is not sufficient for evaluating these approaches, as fidelity incurs significant fluctuations across different datasets and detectors. We subsequently check the precision of the vulnerability-related code lines reported by the explanation approaches, and find poor accuracy in this task among all of them. This can be attributed to the inefficiency of explainers in selecting important features and the presence of irrelevant artifacts learned by DL-based detectors.

摘要:近年来,基于深度学习(DL)模型的漏洞检测器已经证明了其有效性。然而,这些检测器的决策过程笼罩着一层不透明的面纱,使安全分析人员难以理解。为了解决这个问题,人们提出了各种解释方法,通过突出重要特征来解释预测结果,这些方法在计算机视觉和自然语言处理等其他领域已被证明是有效的。遗憾的是,对这些解释方法所学习和理解的漏洞关键特征(如细粒度的漏洞相关代码行)的深入评估仍然缺乏。在本研究中,我们首先评估了基于图和序列表示的漏洞检测器的十种解释方法的性能,通过两个定量指标来衡量,包括保真度和漏洞行覆盖率。我们的结果表明,仅靠保真度不足以评估这些方法,因为保真度在不同数据集和检测器之间会产生显著波动。我们随后检查了解释方法报告的漏洞相关代码行的精确度,发现所有解释方法在这项任务中的精确度都很低。这可以归因于解释者在选择重要特征时的低效率,以及基于 DL 的检测器所学习到的不相关人工智能的存在。
{"title":"Beyond Fidelity: Explaining Vulnerability Localization of Learning-based Detectors","authors":"Baijun Cheng, Mingsheng Zhao, Kailong Wang, Meizhen Wang, Guangdong Bai, Ruitao Feng, Yao Guo, Lei Ma, Haoyu Wang","doi":"10.1145/3641543","DOIUrl":"https://doi.org/10.1145/3641543","url":null,"abstract":"<p><b>Abstract</b>: Vulnerability detectors based on deep learning (DL) models have proven their effectiveness in recent years. However, the shroud of opacity surrounding the decision-making process of these detectors makes it difficult for security analysts to comprehend. To address this, various explanation approaches have been proposed to explain the predictions by highlighting important features, which have been demonstrated effective in other domains such as computer vision and natural language processing. Unfortunately, an in-depth evaluation of vulnerability-critical features, such as fine-grained vulnerability-related code lines, learned and understood by these explanation approaches remains lacking. In this study, we first evaluate the performance of ten explanation approaches for vulnerability detectors based on graph and sequence representations, measured by two quantitative metrics including fidelity and vulnerability line coverage rate. Our results show that fidelity alone is not sufficient for evaluating these approaches, as fidelity incurs significant fluctuations across different datasets and detectors. We subsequently check the precision of the vulnerability-related code lines reported by the explanation approaches, and find poor accuracy in this task among all of them. This can be attributed to the inefficiency of explainers in selecting important features and the presence of irrelevant artifacts learned by DL-based detectors.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"163 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139666022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Empirical Analysis of Issue Templates Usage in Large-Scale Projects on GitHub 对 GitHub 上大型项目中问题模板使用情况的实证分析
IF 4.4 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-01-31 DOI: 10.1145/3643673
Emre Sülün, Metehan Saçakçı, Eray Tüzün

GitHub Issues is a widely used issue tracking tool in open-source software projects. Originally designed with broad flexibility, its lack of standardization led to incomplete issue reports, impeding software development and maintenance efficiency. To counteract this, GitHub introduced issue templates in 2016, which rapidly became popular. Our study assesses the current use and evolution of these templates in large-scale open-source projects and their impact on issue tracking metrics, including resolution time, number of reopens, and number of issue comments. Employing a comprehensive analysis of 350 templates from 100 projects, we also evaluated over 1.9 million issues for template conformity and impact. Additionally, we solicited insights from open-source software maintainers through a survey. Our findings highlight issue templates’ extensive usage in 99 of the 100 surveyed projects, with a growing preference for YAML-based templates, a more structured template variant. Projects with a template exhibited markedly reduced resolution time (381.02 days to 103.18 days) and reduced issue comment count (4.95 to 4.32) compared to those without. The use of YAML-based templates further significantly decreased resolution time, the number of reopenings, and the discussion extent. Thus, our research underscores issue templates’ positive impact on large-scale open-source projects, offering recommendations for improved effectiveness.

GitHub Issues 是开源软件项目中广泛使用的问题跟踪工具。它最初的设计具有广泛的灵活性,但由于缺乏标准化,导致问题报告不完整,影响了软件开发和维护效率。为了解决这一问题,GitHub 于 2016 年推出了问题模板,并迅速流行起来。我们的研究评估了这些模板目前在大型开源项目中的使用和演变情况,以及它们对问题跟踪指标的影响,包括解决时间、重开次数和问题评论数量。我们对来自 100 个项目的 350 个模板进行了全面分析,并对超过 190 万个问题进行了模板一致性和影响评估。此外,我们还通过调查征求了开源软件维护者的意见。我们的研究结果表明,在接受调查的 100 个项目中,有 99 个项目广泛使用了问题模板,而且越来越多的项目倾向于使用基于 YAML 的模板,这是一种结构性更强的模板变体。与没有模板的项目相比,使用模板的项目明显缩短了解决问题的时间(从 381.02 天缩短到 103.18 天),并减少了问题注释数(从 4.95 条减少到 4.32 条)。使用基于 YAML 的模板进一步显著减少了解决问题的时间、重新开放的次数和讨论的范围。因此,我们的研究强调了问题模板对大型开源项目的积极影响,并提出了提高效率的建议。
{"title":"An Empirical Analysis of Issue Templates Usage in Large-Scale Projects on GitHub","authors":"Emre Sülün, Metehan Saçakçı, Eray Tüzün","doi":"10.1145/3643673","DOIUrl":"https://doi.org/10.1145/3643673","url":null,"abstract":"<p>GitHub Issues is a widely used issue tracking tool in open-source software projects. Originally designed with broad flexibility, its lack of standardization led to incomplete issue reports, impeding software development and maintenance efficiency. To counteract this, GitHub introduced issue templates in 2016, which rapidly became popular. Our study assesses the current use and evolution of these templates in large-scale open-source projects and their impact on issue tracking metrics, including resolution time, number of reopens, and number of issue comments. Employing a comprehensive analysis of 350 templates from 100 projects, we also evaluated over 1.9 million issues for template conformity and impact. Additionally, we solicited insights from open-source software maintainers through a survey. Our findings highlight issue templates’ extensive usage in 99 of the 100 surveyed projects, with a growing preference for YAML-based templates, a more structured template variant. Projects with a template exhibited markedly reduced resolution time (381.02 days to 103.18 days) and reduced issue comment count (4.95 to 4.32) compared to those without. The use of YAML-based templates further significantly decreased resolution time, the number of reopenings, and the discussion extent. Thus, our research underscores issue templates’ positive impact on large-scale open-source projects, offering recommendations for improved effectiveness.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"9 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139666017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
KADEL: Knowledge-Aware Denoising Learning for Commit Message Generation KADEL:针对指令信息生成的知识感知去噪学习
IF 4.4 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-01-29 DOI: 10.1145/3643675
Wei Tao, Yucheng Zhou, Yanlin Wang, Hongyu Zhang, Haofen Wang, Wenqiang Zhang

Commit messages are natural language descriptions of code changes, which are important for software evolution such as code understanding and maintenance. However, previous methods are trained on the entire dataset without considering the fact that a portion of commit messages adhere to good practice (i.e., good-practice commits), while the rest do not. On the basis of our empirical study, we discover that training on good-practice commits significantly contributes to the commit message generation. Motivated by this finding, we propose a novel knowledge-aware denoising learning method called KADEL. Considering that good-practice commits constitute only a small proportion of the dataset, we align the remaining training samples with these good-practice commits. To achieve this, we propose a model that learns the commit knowledge by training on good-practice commits. This knowledge model enables supplementing more information for training samples that do not conform to good practice. However, since the supplementary information may contain noise or prediction errors, we propose a dynamic denoising training method. This method composes a distribution-aware confidence function and a dynamic distribution list, which enhances the effectiveness of the training process. Experimental results on the whole MCMD dataset demonstrate that our method overall achieves state-of-the-art performance compared with previous methods.

提交信息是代码变更的自然语言描述,对于代码理解和维护等软件进化非常重要。然而,以往的方法都是在整个数据集上进行训练,而没有考虑到一部分提交信息符合良好实践(即良好实践提交),而其余的则不符合这一事实。在实证研究的基础上,我们发现对良好实践的提交进行培训对提交信息的生成有很大帮助。基于这一发现,我们提出了一种名为 KADEL 的新型知识感知去噪学习方法。考虑到良好实践的提交只占数据集的一小部分,我们将剩余的训练样本与这些良好实践的提交对齐。为此,我们提出了一个模型,该模型通过对优秀实践提交进行训练来学习提交知识。这种知识模型可以为不符合良好实践的训练样本补充更多信息。不过,由于补充信息可能包含噪声或预测错误,我们提出了一种动态去噪训练方法。该方法由分布感知置信函数和动态分布列表组成,从而提高了训练过程的有效性。在整个 MCMD 数据集上的实验结果表明,与之前的方法相比,我们的方法总体上达到了最先进的性能。
{"title":"KADEL: Knowledge-Aware Denoising Learning for Commit Message Generation","authors":"Wei Tao, Yucheng Zhou, Yanlin Wang, Hongyu Zhang, Haofen Wang, Wenqiang Zhang","doi":"10.1145/3643675","DOIUrl":"https://doi.org/10.1145/3643675","url":null,"abstract":"<p>Commit messages are natural language descriptions of code changes, which are important for software evolution such as code understanding and maintenance. However, previous methods are trained on the entire dataset without considering the fact that a portion of commit messages adhere to good practice (i.e., good-practice commits), while the rest do not. On the basis of our empirical study, we discover that training on good-practice commits significantly contributes to the commit message generation. Motivated by this finding, we propose a novel knowledge-aware denoising learning method called KADEL. Considering that good-practice commits constitute only a small proportion of the dataset, we align the remaining training samples with these good-practice commits. To achieve this, we propose a model that learns the commit knowledge by training on good-practice commits. This knowledge model enables supplementing more information for training samples that do not conform to good practice. However, since the supplementary information may contain noise or prediction errors, we propose a dynamic denoising training method. This method composes a distribution-aware confidence function and a dynamic distribution list, which enhances the effectiveness of the training process. Experimental results on the whole MCMD dataset demonstrate that our method overall achieves state-of-the-art performance compared with previous methods.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"13 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139584025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analyzing and Detecting Information Types of Developer Live Chat Threads 分析和检测开发人员即时聊天主题的信息类型
IF 4.4 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-01-29 DOI: 10.1145/3643677
Xiuwei Shang, Shuai Zhang, Yitong Zhang, Shikai Guo, Yulong Li, Rong Chen, Hui Li, Xiaochen Li, He Jiang

Online chatrooms serve as vital platforms for information exchange among software developers. With multiple developers engaged in rapid communication and diverse conversation topics, the resulting chat messages often manifest complexity and lack structure. To enhance the efficiency of extracting information from chat threads, automatic mining techniques are introduced for thread classification. However, previous approaches still grapple with unsatisfactory classification accuracy, due to two primary challenges that they struggle to adequately capture long-distance dependencies within chat threads and address the issue of category imbalance in labeled datasets. To surmount these challenges, we present a topic classification approach for chat information types named EAEChat. Specifically, EAEChat comprises three core components: the text feature encoding component captures contextual text features using a multi-head self-attention mechanism-based text feature encoder, and a siamese network is employed to mitigate overfitting caused by limited data; the data augmentation component expands a small number of categories in the training dataset using a technique tailored to developer chat messages, effectively tackling the challenge of imbalanced category distribution; the non-text feature encoding component employs a feature fusion model to integrate deep text features with manually extracted non-text features. Evaluation across three real-world projects demonstrates that EAEChat respectively achieves an average precision, recall, and F1-score of 0.653, 0.651, and 0.644, and it marks a significant 7.60% improvement over the state-of-the-art approachs. These findings confirm the effectiveness of our method in proficiently classifying developer chat messages in online chatrooms.

在线聊天室是软件开发人员进行信息交流的重要平台。由于多个开发人员进行快速交流,聊天主题也多种多样,因此产生的聊天信息往往表现出复杂性和缺乏结构性。为了提高从聊天线程中提取信息的效率,人们引入了线程分类自动挖掘技术。然而,以往的方法仍然无法达到令人满意的分类精度,这主要是由于它们难以充分捕捉聊天线程中的长距离依赖关系,以及无法解决标签数据集中类别不平衡的问题。为了克服这些挑战,我们提出了一种名为 EAEChat 的聊天信息类型主题分类方法。具体来说,EAEChat 由三个核心组件组成:文本特征编码组件使用基于多头自注意机制的文本特征编码器捕获上下文文本特征,并使用连体网络来减轻有限数据造成的过拟合;数据增强组件使用一种为开发者聊天信息量身定制的技术扩展训练数据集中的少量类别,从而有效解决类别分布不平衡的难题;非文本特征编码组件使用一种特征融合模型来整合深度文本特征和手动提取的非文本特征。对三个真实世界项目的评估表明,EAEChat 的平均精确度、召回率和 F1 分数分别达到了 0.653、0.651 和 0.644,比最先进的方法显著提高了 7.60%。这些发现证实了我们的方法在熟练分类在线聊天室中的开发人员聊天信息方面的有效性。
{"title":"Analyzing and Detecting Information Types of Developer Live Chat Threads","authors":"Xiuwei Shang, Shuai Zhang, Yitong Zhang, Shikai Guo, Yulong Li, Rong Chen, Hui Li, Xiaochen Li, He Jiang","doi":"10.1145/3643677","DOIUrl":"https://doi.org/10.1145/3643677","url":null,"abstract":"<p>Online chatrooms serve as vital platforms for information exchange among software developers. With multiple developers engaged in rapid communication and diverse conversation topics, the resulting chat messages often manifest complexity and lack structure. To enhance the efficiency of extracting information from chat <i>threads</i>, automatic mining techniques are introduced for thread classification. However, previous approaches still grapple with unsatisfactory classification accuracy, due to two primary challenges that they struggle to adequately capture long-distance dependencies within chat threads and address the issue of category imbalance in labeled datasets. To surmount these challenges, we present a topic classification approach for chat information types named EAEChat. Specifically, EAEChat comprises three core components: the text feature encoding component captures contextual text features using a multi-head self-attention mechanism-based text feature encoder, and a siamese network is employed to mitigate overfitting caused by limited data; the data augmentation component expands a small number of categories in the training dataset using a technique tailored to developer chat messages, effectively tackling the challenge of imbalanced category distribution; the non-text feature encoding component employs a feature fusion model to integrate deep text features with manually extracted non-text features. Evaluation across three real-world projects demonstrates that EAEChat respectively achieves an average precision, recall, and F1-score of 0.653, 0.651, and 0.644, and it marks a significant 7.60% improvement over the state-of-the-art approachs. These findings confirm the effectiveness of our method in proficiently classifying developer chat messages in online chatrooms.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"27 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139649609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues 完善 ChatGPT 生成的代码:描述和缓解代码质量问题
IF 4.4 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-01-27 DOI: 10.1145/3643674
Yue Liu, Thanh Le-Cong, Ratnadira Widyasari, Chakkrit Tantithamthavorn, Li Li, Xuan-Bach D. Le, David Lo

Since its introduction in November 2022, ChatGPT has rapidly gained popularity due to its remarkable ability in language understanding and human-like responses. ChatGPT, based on GPT-3.5 architecture, has shown great promise for revolutionizing various research fields, including code generation. However, the reliability and quality of code generated by ChatGPT remain unexplored, raising concerns about potential risks associated with the widespread use of ChatGPT-driven code generation.

In this paper, we systematically study the quality of 4,066 ChatGPT-generated code implemented in two popular programming languages, i.e., Java and Python, for 2,033 programming tasks. The goal of this work is three folds. First, we analyze the correctness of ChatGPT on code generation tasks and uncover the factors that influence its effectiveness, including task difficulty, programming language, time that tasks are introduced, and program size. Second, we identify and characterize potential issues with the quality of ChatGPT-generated code. Last, we provide insights into how these issues can be mitigated. Experiments highlight that out of 4,066 programs generated by ChatGPT, 2,756 programs are deemed correct, 1,082 programs provide wrong outputs, and 177 programs contain compilation or runtime errors. Additionally, we further analyze other characteristics of the generated code through static analysis tools, such as code style and maintainability, and find that 1,930 ChatGPT-generated code snippets suffer from maintainability issues. Subsequently, we investigate ChatGPT’s self-repairing ability and its interaction with static analysis tools to fix the errors uncovered in the previous step. Experiments suggest that ChatGPT can partially address these challenges, improving code quality by more than 20%, but there are still limitations and opportunities for improvement. Overall, our study provides valuable insights into the current limitations of ChatGPT and offers a roadmap for future research and development efforts to enhance the code generation capabilities of AI models like ChatGPT.

自 2022 年 11 月问世以来,ChatGPT 凭借其出色的语言理解能力和类人反应迅速赢得了人们的青睐。基于 GPT-3.5 架构的 ChatGPT 在代码生成等多个研究领域都展现出了巨大的变革前景。然而,由 ChatGPT 生成的代码的可靠性和质量仍有待探索,这引发了人们对广泛使用 ChatGPT 驱动代码生成的潜在风险的担忧。在本文中,我们系统地研究了用两种流行编程语言(即 Java 和 Python)在 2033 个编程任务中生成的 4066 个 ChatGPT 代码的质量。这项工作的目标有三个方面。首先,我们分析了 ChatGPT 在代码生成任务中的正确性,并揭示了影响其有效性的因素,包括任务难度、编程语言、任务引入时间和程序大小。其次,我们识别并描述了 ChatGPT 生成代码质量的潜在问题。最后,我们就如何缓解这些问题提出了见解。实验表明,在 ChatGPT 生成的 4,066 个程序中,2,756 个程序被认为是正确的,1,082 个程序提供了错误的输出,177 个程序包含编译或运行时错误。此外,我们还通过静态分析工具进一步分析了生成代码的其他特征,如代码风格和可维护性,发现 ChatGPT 生成的 1,930 个代码片段存在可维护性问题。随后,我们研究了 ChatGPT 的自我修复能力及其与静态分析工具的交互,以修复上一步发现的错误。实验表明,ChatGPT 可以部分解决这些难题,将代码质量提高 20% 以上,但仍存在局限性和改进机会。总之,我们的研究对 ChatGPT 目前的局限性提供了有价值的见解,并为未来的研发工作提供了路线图,以增强 ChatGPT 等人工智能模型的代码生成能力。
{"title":"Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues","authors":"Yue Liu, Thanh Le-Cong, Ratnadira Widyasari, Chakkrit Tantithamthavorn, Li Li, Xuan-Bach D. Le, David Lo","doi":"10.1145/3643674","DOIUrl":"https://doi.org/10.1145/3643674","url":null,"abstract":"<p>Since its introduction in November 2022, ChatGPT has rapidly gained popularity due to its remarkable ability in language understanding and human-like responses. ChatGPT, based on GPT-3.5 architecture, has shown great promise for revolutionizing various research fields, including code generation. However, the reliability and quality of code generated by ChatGPT remain unexplored, raising concerns about potential risks associated with the widespread use of ChatGPT-driven code generation. </p><p>In this paper, we systematically study the quality of 4,066 ChatGPT-generated code implemented in two popular programming languages, i.e., Java and Python, for 2,033 programming tasks. The goal of this work is three folds. First, we analyze the correctness of ChatGPT on code generation tasks and uncover the factors that influence its effectiveness, including task difficulty, programming language, time that tasks are introduced, and program size. Second, we identify and characterize potential issues with the quality of ChatGPT-generated code. Last, we provide insights into how these issues can be mitigated. Experiments highlight that out of 4,066 programs generated by ChatGPT, 2,756 programs are deemed correct, 1,082 programs provide wrong outputs, and 177 programs contain compilation or runtime errors. Additionally, we further analyze other characteristics of the generated code through static analysis tools, such as code style and maintainability, and find that 1,930 ChatGPT-generated code snippets suffer from maintainability issues. Subsequently, we investigate ChatGPT’s self-repairing ability and its interaction with static analysis tools to fix the errors uncovered in the previous step. Experiments suggest that ChatGPT can partially address these challenges, improving code quality by more than 20%, but there are still limitations and opportunities for improvement. Overall, our study provides valuable insights into the current limitations of ChatGPT and offers a roadmap for future research and development efforts to enhance the code generation capabilities of AI models like ChatGPT.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"86 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139583890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Test Input Prioritization for 3D Point Clouds 三维点云测试输入优先级排序
IF 4.4 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-01-27 DOI: 10.1145/3643676
Yinghua Li, Xueqi Dang, Lei Ma, Jacques Klein, Yves LE Traon, Tegawendé F. Bissyandé

Three-dimensional (3D) point cloud applications have become increasingly prevalent in diverse domains, showcasing their efficacy in various software systems. However, testing such applications presents unique challenges due to the high-dimensional nature of 3D point cloud data and the vast number of possible test cases. Test input prioritization has emerged as a promising approach to enhance testing efficiency by prioritizing potentially misclassified test cases during the early stages of the testing process. Consequently, this enables the early labeling of critical inputs, leading to a reduction in the overall labeling cost. However, applying existing prioritization methods to 3D point cloud data is constrained by several factors: 1) Inadequate consideration of crucial spatial information, and 2) susceptibility to noises inherent in 3D point cloud data. In this paper, we propose PCPrior, the first test prioritization approach specifically designed for 3D point cloud test cases. The fundamental concept behind PCPrior is that test inputs closer to the decision boundary of the model are more likely to be predicted incorrectly. To capture the spatial relationship between a point cloud test and the decision boundary, we propose transforming each test (a point cloud) into a low-dimensional feature vector, towards indirectly revealing the underlying proximity between a test and the decision boundary. To achieve this, we carefully design a group of feature generation strategies, and for each test input, we generate four distinct types of features, namely, spatial features, mutation features, prediction features, and uncertainty features. Through a concatenation of the four feature types, PCPrior assembles a final feature vector for each test. Subsequently, a ranking model is employed to estimate the probability of misclassification for each test based on its feature vector. Finally, PCPrior ranks all tests based on their misclassification probabilities. We conducted an extensive study based on 165 subjects to evaluate the performance of PCPrior, encompassing both natural and noisy datasets. The results demonstrate that PCPrior outperforms all the compared test prioritization approaches, with an average improvement of 10.99%~66.94% on natural datasets and 16.62%~53% on noisy datasets.

三维(3D)点云应用在不同领域越来越普遍,在各种软件系统中展示了其功效。然而,由于三维点云数据的高维特性和大量可能的测试用例,测试此类应用程序面临着独特的挑战。在测试过程的早期阶段,通过优先处理可能被错误分类的测试用例,测试输入优先级排序已成为提高测试效率的一种有前途的方法。因此,这样就能及早标注关键输入,从而降低整体标注成本。然而,将现有的优先级排序方法应用于三维点云数据受到几个因素的限制:1) 对关键空间信息考虑不周;2) 易受三维点云数据固有噪声的影响。在本文中,我们提出了 PCPrior,这是第一种专为三维点云测试案例设计的测试优先级排序方法。PCPrior 背后的基本概念是,更接近模型决策边界的测试输入更有可能被错误预测。为了捕捉点云测试与决策边界之间的空间关系,我们建议将每个测试(点云)转化为低维特征向量,从而间接揭示测试与决策边界之间的内在接近性。为此,我们精心设计了一组特征生成策略,并针对每个测试输入生成四种不同类型的特征,即空间特征、突变特征、预测特征和不确定性特征。通过这四种特征类型的串联,PCPrior 为每个测试生成一个最终特征向量。随后,根据每个测试的特征向量,采用排序模型来估算其误判概率。最后,PCPrior 根据误分类概率对所有测试进行排序。为了评估 PCPrior 的性能,我们对 165 个受试者进行了广泛的研究,包括自然数据集和噪声数据集。结果表明,PCPrior 优于所有比较过的测试优先级排序方法,在自然数据集上平均提高了 10.99% 到 66.94%,在噪声数据集上平均提高了 16.62% 到 53%。
{"title":"Test Input Prioritization for 3D Point Clouds","authors":"Yinghua Li, Xueqi Dang, Lei Ma, Jacques Klein, Yves LE Traon, Tegawendé F. Bissyandé","doi":"10.1145/3643676","DOIUrl":"https://doi.org/10.1145/3643676","url":null,"abstract":"<p>Three-dimensional (3D) point cloud applications have become increasingly prevalent in diverse domains, showcasing their efficacy in various software systems. However, testing such applications presents unique challenges due to the high-dimensional nature of 3D point cloud data and the vast number of possible test cases. Test input prioritization has emerged as a promising approach to enhance testing efficiency by prioritizing potentially misclassified test cases during the early stages of the testing process. Consequently, this enables the early labeling of critical inputs, leading to a reduction in the overall labeling cost. However, applying existing prioritization methods to 3D point cloud data is constrained by several factors: 1) Inadequate consideration of crucial spatial information, and 2) susceptibility to noises inherent in 3D point cloud data. In this paper, we propose PCPrior, the first test prioritization approach specifically designed for 3D point cloud test cases. The fundamental concept behind PCPrior is that test inputs closer to the decision boundary of the model are more likely to be predicted incorrectly. To capture the spatial relationship between a point cloud test and the decision boundary, we propose transforming each test (a point cloud) into a low-dimensional feature vector, towards indirectly revealing the underlying proximity between a test and the decision boundary. To achieve this, we carefully design a group of feature generation strategies, and for each test input, we generate four distinct types of features, namely, spatial features, mutation features, prediction features, and uncertainty features. Through a concatenation of the four feature types, PCPrior assembles a final feature vector for each test. Subsequently, a ranking model is employed to estimate the probability of misclassification for each test based on its feature vector. Finally, PCPrior ranks all tests based on their misclassification probabilities. We conducted an extensive study based on 165 subjects to evaluate the performance of PCPrior, encompassing both natural and noisy datasets. The results demonstrate that PCPrior outperforms all the compared test prioritization approaches, with an average improvement of 10.99%~66.94% on natural datasets and 16.62%~53% on noisy datasets.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"5 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139584064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Test Optimization in DNN Testing: A Survey DNN 测试中的测试优化:调查
IF 4.4 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-01-27 DOI: 10.1145/3643678
Qiang Hu, Yuejun Guo, Xiaofei Xie, Maxime Cordy, Lei Ma, Mike Papadakis, Yves Le Traon

This paper presents a comprehensive survey on test optimization in deep neural network (DNN) testing. Here, test optimization refers to testing with low data labeling effort. We analyzed 90 papers, including 43 from the software engineering (SE) community, 32 from the machine learning (ML) community, and 15 from other communities. Our study: (i) unifies the problems as well as terminologies associated with low-labeling cost testing, (ii) compares the distinct focal points of SE and ML communities, and (iii) reveals the pitfalls in existing literature. Furthermore, we highlight the research opportunities in this domain.

本文对深度神经网络(DNN)测试中的测试优化进行了全面研究。这里的测试优化指的是以较低的数据标注工作量进行测试。我们分析了 90 篇论文,其中 43 篇来自软件工程(SE)领域,32 篇来自机器学习(ML)领域,15 篇来自其他领域。我们的研究:(i) 统一了与低标注成本测试相关的问题和术语,(ii) 比较了 SE 和 ML 社区的不同焦点,(iii) 揭示了现有文献中的误区。此外,我们还强调了该领域的研究机会。
{"title":"Test Optimization in DNN Testing: A Survey","authors":"Qiang Hu, Yuejun Guo, Xiaofei Xie, Maxime Cordy, Lei Ma, Mike Papadakis, Yves Le Traon","doi":"10.1145/3643678","DOIUrl":"https://doi.org/10.1145/3643678","url":null,"abstract":"<p>This paper presents a comprehensive survey on test optimization in deep neural network (DNN) testing. Here, test optimization refers to testing with low data labeling effort. We analyzed 90 papers, including 43 from the software engineering (SE) community, 32 from the machine learning (ML) community, and 15 from other communities. Our study: (i) unifies the problems as well as terminologies associated with low-labeling cost testing, (ii) compares the distinct focal points of SE and ML communities, and (iii) reveals the pitfalls in existing literature. Furthermore, we highlight the research opportunities in this domain.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"1 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139583946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Understanding Real-time Collaborative Programming: a Study of Visual Studio Live Share 了解实时协作编程:Visual Studio Live Share 研究
IF 4.4 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-01-27 DOI: 10.1145/3643672
Xin Tan, Xinyue Lv, Jing Jiang, Li Zhang

Real-time collaborative programming (RCP) entails developers working simultaneously, regardless of their geographic locations. RCP differs from traditional asynchronous online programming methods, such as Git or SVN, where developers work independently and update the codebase at separate times. Although various real-time code collaboration tools (e.g., Visual Studio Live Share, Code with Me, and Replit) have kept emerging in recent years, none of the existing studies explicitly focus on a deep understanding of the processes or experiences associated with RCP. To this end, we combine interviews and an email survey with the users of Visual Studio Live Share, aiming to understand (i) the scenarios, (ii) the requirements, (ii) and the challenges when developers participate in RCP. We find that developers participate in RCP in 18 different scenarios belonging to six categories, e.g., pair programming, group debugging, and code review. However, existing users’ attitudes toward the usefulness of the current RCP tools in these scenarios were significantly more negative than the expectations of potential users. As for the requirements, the most critical category is live editing, followed by the need for sharing terminals to enable hosts and guests to run commands and see the results, as well as focusing and following, which involves “following” the host’s edit location and “focusing” the guests’ attention on the host with a notification. Under these categories, we identify 17 requirements, but most of them are not well supported by current tools. In terms of challenges, we identify 19 challenges belonging to seven categories. The most severe category of challenges is lagging followed by permissions and conflicts. The above findings indicate that the current RCP tools and even collaborative environment need to be improved greatly and urgently. Based on these findings, we discuss the recommendations for different stakeholders, including practitioners, tool designers, and researchers.

实时协作编程(RCP)要求开发人员不分地理位置同时工作。实时协作编程不同于传统的异步在线编程方法,如 Git 或 SVN,在这些方法中,开发人员各自独立工作,并在不同的时间更新代码库。虽然近年来不断出现各种实时代码协作工具(如 Visual Studio Live Share、Code with Me 和 Replit),但现有的研究都没有明确侧重于深入了解与 RCP 相关的流程或体验。为此,我们结合对 Visual Studio Live Share 用户的访谈和电子邮件调查,旨在了解 (i) 开发人员参与 RCP 的情景、(ii) 需求、(ii) 挑战。我们发现,开发人员参与 RCP 的场景有 18 种,分属六个类别,如结对编程、小组调试和代码审查。然而,与潜在用户的期望相比,现有用户对当前 RCP 工具在这些场景中的实用性的态度明显更为消极。至于需求,最关键的类别是实时编辑,其次是共享终端的需求,以便主机和访客都能运行命令并查看结果,以及聚焦和跟随,这涉及到 "跟随 "主机的编辑位置,并通过通知将访客的注意力 "聚焦 "到主机上。在这些类别下,我们确定了 17 项要求,但其中大部分都没有得到现有工具的很好支持。在挑战方面,我们确定了属于 7 个类别的 19 项挑战。最严重的挑战类别是滞后,其次是权限和冲突。上述发现表明,当前的 RCP 工具甚至协作环境都急需大力改进。基于这些发现,我们讨论了针对不同利益相关者(包括从业人员、工具设计者和研究人员)的建议。
{"title":"Understanding Real-time Collaborative Programming: a Study of Visual Studio Live Share","authors":"Xin Tan, Xinyue Lv, Jing Jiang, Li Zhang","doi":"10.1145/3643672","DOIUrl":"https://doi.org/10.1145/3643672","url":null,"abstract":"<p>Real-time collaborative programming (RCP) entails developers working simultaneously, regardless of their geographic locations. RCP differs from traditional asynchronous online programming methods, such as Git or SVN, where developers work independently and update the codebase at separate times. Although various real-time code collaboration tools (e.g., <i>Visual Studio Live Share</i>, <i>Code with Me</i>, and <i>Replit</i>) have kept emerging in recent years, none of the existing studies explicitly focus on a deep understanding of the processes or experiences associated with RCP. To this end, we combine interviews and an email survey with the users of <i>Visual Studio Live Share</i>, aiming to understand (i) the scenarios, (ii) the requirements, (ii) and the challenges when developers participate in RCP. We find that developers participate in RCP in 18 different scenarios belonging to six categories, e.g., <i>pair programming</i>, <i>group debugging</i>, and <i>code review</i>. However, existing users’ attitudes toward the usefulness of the current RCP tools in these scenarios were significantly more negative than the expectations of potential users. As for the requirements, the most critical category is <i>live editing</i>, followed by the need for <i>sharing terminals</i> to enable hosts and guests to run commands and see the results, as well as <i>focusing and following</i>, which involves “following” the host’s edit location and “focusing” the guests’ attention on the host with a notification. Under these categories, we identify 17 requirements, but most of them are not well supported by current tools. In terms of challenges, we identify 19 challenges belonging to seven categories. The most severe category of challenges is <i>lagging</i> followed by <i>permissions and conflicts</i>. The above findings indicate that the current RCP tools and even collaborative environment need to be improved greatly and urgently. Based on these findings, we discuss the recommendations for different stakeholders, including practitioners, tool designers, and researchers.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"330 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139584032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mitigating Debugger-based Attacks to Java Applications with Self-Debugging 利用自调试缓解基于调试器的 Java 应用程序攻击
IF 4.4 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-01-25 DOI: 10.1145/3631971
Davide Pizzolotto, Stefano Berlato, Mariano Ceccato

Java bytecode is a quite high-level language and, as such, it is fairly easy to analyze and decompile with malicious intents, e.g., to tamper with code and skip license checks. Code obfuscation was a first attempt to mitigate malicious reverse engineering based on static analysis. However, obfuscated code can still be dynamically analyzed with standard debuggers to perform step-wise execution and to inspect (or change) memory content at important execution points, e.g., to alter the verdict of license validity checks. Although some approaches have been proposed to mitigate debugger-based attacks, they are only applicable to binary compiled code and none address the challenge of protecting Java bytecode.

In this paper, we propose a novel approach to protect Java bytecode from malicious debugging. Our approach is based on automated program transformation to manipulate Java bytecode and split it into two binary processes that debug each other (i.e., a self-debugging solution). In fact, when the debugging interface is already engaged, an additional malicious debugger cannot attach. To be resilient against typical attacks, our approach adopts a series of technical solutions, e.g., an encoded channel is shared by the two processes to avoid leaking information, an authentication protocol is established to avoid Man-in-the-Middle attacks and the computation is spread between the two processes to prevent the attacker to replace or terminate either of them.

We test our solution on 18 real-world Java applications, showing that our approach can effectively block the most common debugging tasks (either with the Java debugger or the GNU debugger) while preserving the functional correctness of the protected programs. While the final decision on when to activate this protection is still up to the developers, the observed performance overhead was acceptable for common desktop application domains.

Java 字节码是一种相当高级的语言,因此很容易被恶意分析和反编译,例如篡改代码和跳过许可证检查。代码混淆是基于静态分析减轻恶意逆向工程的首次尝试。然而,混淆代码仍可通过标准调试器进行动态分析,以执行分步执行,并在重要执行点检查(或更改)内存内容,例如,改变许可证有效性检查的判决。虽然已经提出了一些方法来缓解基于调试器的攻击,但它们只适用于二进制编译代码,没有一种方法能解决保护 Java 字节代码的难题。在本文中,我们提出了一种保护 Java 字节代码免受恶意调试的新方法。我们的方法基于自动程序转换来处理 Java 字节码,并将其拆分为两个二进制进程,这两个进程可以相互调试(即自调试解决方案)。事实上,当调试接口已经启动时,额外的恶意调试器是无法附加的。为了抵御典型的攻击,我们的方法采用了一系列技术解决方案,例如,两个进程共享一个编码通道以避免信息泄露;建立一个验证协议以避免中间人攻击;在两个进程之间分散计算以防止攻击者替换或终止其中任何一个进程。我们在 18 个实际 Java 应用程序上测试了我们的解决方案,结果表明我们的方法可以有效阻止最常见的调试任务(使用 Java 调试器或 GNU 调试器),同时保持受保护程序的功能正确性。虽然何时激活这种保护的最终决定权仍在开发人员手中,但观察到的性能开销对于常见的桌面应用程序领域来说是可以接受的。
{"title":"Mitigating Debugger-based Attacks to Java Applications with Self-Debugging","authors":"Davide Pizzolotto, Stefano Berlato, Mariano Ceccato","doi":"10.1145/3631971","DOIUrl":"https://doi.org/10.1145/3631971","url":null,"abstract":"<p>Java bytecode is a quite high-level language and, as such, it is fairly easy to analyze and decompile with malicious intents, e.g., to tamper with code and skip license checks. Code obfuscation was a first attempt to mitigate malicious reverse engineering based on static analysis. However, obfuscated code can still be dynamically analyzed with standard debuggers to perform step-wise execution and to inspect (or change) memory content at important execution points, e.g., to alter the verdict of license validity checks. Although some approaches have been proposed to mitigate debugger-based attacks, they are only applicable to binary compiled code and none address the challenge of protecting Java bytecode. </p><p>In this paper, we propose a novel approach to protect Java bytecode from malicious debugging. Our approach is based on automated program transformation to manipulate Java bytecode and split it into two binary processes that debug each other (i.e., a self-debugging solution). In fact, when the debugging interface is already engaged, an additional malicious debugger cannot attach. To be resilient against typical attacks, our approach adopts a series of technical solutions, e.g., an encoded channel is shared by the two processes to avoid leaking information, an authentication protocol is established to avoid Man-in-the-Middle attacks and the computation is spread between the two processes to prevent the attacker to replace or terminate either of them. </p><p>We test our solution on 18 real-world Java applications, showing that our approach can effectively block the most common debugging tasks (either with the Java debugger or the GNU debugger) while preserving the functional correctness of the protected programs. While the final decision on when to activate this protection is still up to the developers, the observed performance overhead was acceptable for common desktop application domains.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"115 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139556342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring Semantic Redundancy using Backdoor Triggers: A Complementary Insight into the Challenges facing DNN-based Software Vulnerability Detection 利用后门触发器探索语义冗余:基于 DNN 的软件漏洞检测所面临挑战的补充见解
IF 4.4 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-01-24 DOI: 10.1145/3640333
Changjie Shao, Gaolei Li, Jun Wu, Xi Zheng

To detect software vulnerabilities with better performance, deep neural networks (DNNs) have received extensive attention recently. However, these vulnerability detection DNN models trained with code representations are vulnerable to specific perturbations on code representations. This motivates us to rethink the bane of software vulnerability detection and find function-agnostic features during code representation which we name as semantic redundant features. This paper first identifies a tight correlation between function-agnostic triggers and semantic redundant feature space (where the redundant features reside) in these DNN models. For correlation identification, we propose a novel Backdoor-based Semantic Redundancy Exploration (BSemRE) framework. In BSemRE, the sensitivity of the trained models to function-agnostic triggers is observed to verify the existence of semantic redundancy in various code representations. Specifically, acting as the typical manifestations of semantic redundancy, naming conventions, ternary operators and identically-true conditions are exploited to generate function-agnostic triggers. Extensive comparative experiments on 1613823 samples of 8 representative vulnerability datasets and state-of-the-art code representation techniques and vulnerability detection models demonstrate that the existence of semantic redundancy determines the upper trustworthiness limit of DNN-based software vulnerability detection. To the best of our knowledge, this is the first work exploring the bane of software vulnerability detection using backdoor triggers.

为了以更好的性能检测软件漏洞,深度神经网络(DNN)近来受到广泛关注。然而,这些用代码表示法训练的漏洞检测 DNN 模型很容易受到代码表示法特定扰动的影响。这促使我们重新思考软件漏洞检测的祸根,并在代码表示过程中找到与功能无关的特征,我们将其命名为语义冗余特征。本文首先确定了这些 DNN 模型中功能无关触发器与语义冗余特征空间(冗余特征所在)之间的紧密相关性。为了识别相关性,我们提出了一种新颖的基于后门的语义冗余探索(BSemRE)框架。在 BSemRE 中,我们观察了训练有素的模型对与功能无关的触发器的敏感性,以验证各种代码表示中是否存在语义冗余。具体来说,作为语义冗余的典型表现形式,命名约定、三元运算符和同真条件被用来生成功能无关触发器。对 8 个代表性漏洞数据集的 1613823 个样本以及最先进的代码表示技术和漏洞检测模型进行的广泛对比实验表明,语义冗余的存在决定了基于 DNN 的软件漏洞检测的可信度上限。据我们所知,这是第一项探索利用后门触发器进行软件漏洞检测的难题的工作。
{"title":"Exploring Semantic Redundancy using Backdoor Triggers: A Complementary Insight into the Challenges facing DNN-based Software Vulnerability Detection","authors":"Changjie Shao, Gaolei Li, Jun Wu, Xi Zheng","doi":"10.1145/3640333","DOIUrl":"https://doi.org/10.1145/3640333","url":null,"abstract":"<p>To detect software vulnerabilities with better performance, deep neural networks (DNNs) have received extensive attention recently. However, these vulnerability detection DNN models trained with code representations are vulnerable to specific perturbations on code representations. This motivates us to rethink the bane of software vulnerability detection and find function-agnostic features during code representation which we name as semantic redundant features. This paper first identifies a tight correlation between function-agnostic triggers and semantic redundant feature space (where the redundant features reside) in these DNN models. For correlation identification, we propose a novel Backdoor-based Semantic Redundancy Exploration (BSemRE) framework. In BSemRE, the sensitivity of the trained models to function-agnostic triggers is observed to verify the existence of semantic redundancy in various code representations. Specifically, acting as the typical manifestations of semantic redundancy, naming conventions, ternary operators and identically-true conditions are exploited to generate function-agnostic triggers. Extensive comparative experiments on 1613823 samples of 8 representative vulnerability datasets and state-of-the-art code representation techniques and vulnerability detection models demonstrate that the existence of semantic redundancy determines the upper trustworthiness limit of DNN-based software vulnerability detection. To the best of our knowledge, this is the first work exploring the bane of software vulnerability detection using backdoor triggers.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"5 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139556480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
ACM Transactions on Software Engineering and Methodology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1