首页 > 最新文献

IEEE Transactions on Software Engineering最新文献

英文 中文
StagedVulBERT: Multigranular Vulnerability Detection With a Novel Pretrained Code Model StagedVulBERT:利用新颖的预训练代码模型进行多粒度漏洞检测
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-11-07 DOI: 10.1109/TSE.2024.3493245
Yuan Jiang;Yujian Zhang;Xiaohong Su;Christoph Treude;Tiantian Wang
The emergence of pre-trained model-based vulnerability detection methods has significantly advanced the field of automated vulnerability detection. However, these methods still face several challenges, such as difficulty in learning effective feature representations of statements for fine-grained predictions and struggling to process overly long code sequences. To address these issues, this study introduces StagedVulBERT, a novel vulnerability detection framework that leverages a pre-trained code language model and employs a coarse-to-fine strategy. The key innovation and contribution of our research lies in the development of the CodeBERT-HLS component within our framework, specialized in hierarchical, layered, and semantic encoding. This component is designed to capture semantics at both the token and statement levels simultaneously, which is crucial for achieving more accurate multi-granular vulnerability detection. Additionally, CodeBERT-HLS efficiently processes longer code token sequences, making it more suited to real-world vulnerability detection. Comprehensive experiments demonstrate that our method enhances the performance of vulnerability detection at both coarse- and fine-grained levels. Specifically, in coarse-grained vulnerability detection, StagedVulBERT achieves an F1 score of 92.26%, marking a 6.58% improvement over the best-performing methods. At the fine-grained level, our method achieves a Top-5% accuracy of 65.69%, which outperforms the state-of-the-art methods by up to 75.17%.
基于预训练模型的漏洞检测方法的出现,极大地推动了自动化漏洞检测领域的发展。然而,这些方法仍然面临着一些挑战,例如难以学习用于细粒度预测的语句的有效特征表示,以及难以处理过长的代码序列。为了解决这些问题,本研究引入了StagedVulBERT,这是一种新的漏洞检测框架,它利用了预训练的代码语言模型,并采用了从粗到精的策略。我们研究的关键创新和贡献在于在我们的框架内开发CodeBERT-HLS组件,专门用于分层,分层和语义编码。该组件旨在同时捕获令牌和语句级别的语义,这对于实现更准确的多粒度漏洞检测至关重要。此外,CodeBERT-HLS有效地处理较长的代码令牌序列,使其更适合现实世界的漏洞检测。综合实验表明,我们的方法在粗粒度和细粒度两个层次上都提高了漏洞检测的性能。具体来说,在粗粒度漏洞检测中,StagedVulBERT达到了92.26%的F1分数,比性能最好的方法提高了6.58%。在细粒度级别,我们的方法达到了65.69%的Top-5%准确率,比最先进的方法高出75.17%。
{"title":"StagedVulBERT: Multigranular Vulnerability Detection With a Novel Pretrained Code Model","authors":"Yuan Jiang;Yujian Zhang;Xiaohong Su;Christoph Treude;Tiantian Wang","doi":"10.1109/TSE.2024.3493245","DOIUrl":"10.1109/TSE.2024.3493245","url":null,"abstract":"The emergence of pre-trained model-based vulnerability detection methods has significantly advanced the field of automated vulnerability detection. However, these methods still face several challenges, such as difficulty in learning effective feature representations of statements for fine-grained predictions and struggling to process overly long code sequences. To address these issues, this study introduces StagedVulBERT, a novel vulnerability detection framework that leverages a pre-trained code language model and employs a coarse-to-fine strategy. The key innovation and contribution of our research lies in the development of the CodeBERT-HLS component within our framework, specialized in hierarchical, layered, and semantic encoding. This component is designed to capture semantics at both the token and statement levels simultaneously, which is crucial for achieving more accurate multi-granular vulnerability detection. Additionally, CodeBERT-HLS efficiently processes longer code token sequences, making it more suited to real-world vulnerability detection. Comprehensive experiments demonstrate that our method enhances the performance of vulnerability detection at both coarse- and fine-grained levels. Specifically, in coarse-grained vulnerability detection, StagedVulBERT achieves an F1 score of 92.26%, marking a 6.58% improvement over the best-performing methods. At the fine-grained level, our method achieves a Top-5% accuracy of 65.69%, which outperforms the state-of-the-art methods by up to 75.17%.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 12","pages":"3454-3471"},"PeriodicalIF":6.5,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142596496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SMARLA: A Safety Monitoring Approach for Deep Reinforcement Learning Agents SMARLA:深度强化学习代理的安全监控方法
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-11-06 DOI: 10.1109/TSE.2024.3491496
Amirhossein Zolfagharian;Manel Abdellatif;Lionel C. Briand;Ramesh S
Deep Reinforcement Learning (DRL) has made significant advancements in various fields, such as autonomous driving, healthcare, and robotics, by enabling agents to learn optimal policies through interactions with their environments. However, the application of DRL in safety-critical domains presents challenges, particularly concerning the safety of the learned policies. DRL agents, which are focused on maximizing rewards, may select unsafe actions, leading to safety violations. Runtime safety monitoring is thus essential to ensure the safe operation of these agents, especially in unpredictable and dynamic environments. This paper introduces SMARLA, a black-box safety monitoring approach specifically designed for DRL agents. SMARLA utilizes machine learning to predict safety violations by observing the agent's behavior during execution. The approach is based on Q-values, which reflect the expected reward for taking actions in specific states. SMARLA employs state abstraction to reduce the complexity of the state space, enhancing the predictive capabilities of the monitoring model. Such abstraction enables the early detection of unsafe states, allowing for the implementation of corrective and preventive measures before incidents occur. We quantitatively and qualitatively validated SMARLA on three well-known case studies widely used in DRL research. Empirical results reveal that SMARLA is accurate at predicting safety violations, with a low false positive rate, and can predict violations at an early stage, approximately halfway through the execution of the agent, before violations occur. We also discuss different decision criteria, based on confidence intervals of the predicted violation probabilities, to trigger safety mechanisms aiming at a trade-off between early detection and low false positive rates.
深度强化学习(DRL)通过使智能体通过与其环境的交互学习最佳策略,在自动驾驶、医疗保健和机器人等各个领域取得了重大进展。然而,DRL在安全关键领域的应用面临着挑战,特别是在学习策略的安全性方面。DRL主体以奖励最大化为目标,可能会选择不安全的行为,导致安全违规。因此,运行时安全监控对于确保这些代理的安全操作至关重要,尤其是在不可预测和动态的环境中。本文介绍了一种专门为DRL代理设计的黑盒安全监测方法SMARLA。SMARLA利用机器学习,通过观察代理在执行过程中的行为来预测安全违规行为。该方法基于q值,它反映了在特定状态下采取行动的预期回报。SMARLA采用状态抽象来降低状态空间的复杂性,增强了监控模型的预测能力。这样的抽象可以使不安全状态的早期检测成为可能,允许在事故发生之前实施纠正和预防措施。我们通过在DRL研究中广泛使用的三个知名案例对SMARLA进行了定量和定性验证。实证结果表明,SMARLA在预测安全违规方面是准确的,假阳性率低,并且可以在违规发生之前的早期阶段(大约在代理执行的中途)预测违规。我们还讨论了不同的决策标准,基于预测违规概率的置信区间,以触发安全机制,旨在早期检测和低假阳性率之间的权衡。
{"title":"SMARLA: A Safety Monitoring Approach for Deep Reinforcement Learning Agents","authors":"Amirhossein Zolfagharian;Manel Abdellatif;Lionel C. Briand;Ramesh S","doi":"10.1109/TSE.2024.3491496","DOIUrl":"10.1109/TSE.2024.3491496","url":null,"abstract":"Deep Reinforcement Learning (DRL) has made significant advancements in various fields, such as autonomous driving, healthcare, and robotics, by enabling agents to learn optimal policies through interactions with their environments. However, the application of DRL in safety-critical domains presents challenges, particularly concerning the safety of the learned policies. DRL agents, which are focused on maximizing rewards, may select unsafe actions, leading to safety violations. Runtime safety monitoring is thus essential to ensure the safe operation of these agents, especially in unpredictable and dynamic environments. This paper introduces \u0000<italic>SMARLA</i>\u0000, a black-box safety monitoring approach specifically designed for DRL agents. \u0000<italic>SMARLA</i>\u0000 utilizes machine learning to predict safety violations by observing the agent's behavior during execution. The approach is based on Q-values, which reflect the expected reward for taking actions in specific states. \u0000<italic>SMARLA</i>\u0000 employs state abstraction to reduce the complexity of the state space, enhancing the predictive capabilities of the monitoring model. Such abstraction enables the early detection of unsafe states, allowing for the implementation of corrective and preventive measures before incidents occur. We quantitatively and qualitatively validated \u0000<italic>SMARLA</i>\u0000 on three well-known case studies widely used in DRL research. Empirical results reveal that \u0000<italic>SMARLA</i>\u0000 is accurate at predicting safety violations, with a low false positive rate, and can predict violations at an early stage, approximately halfway through the execution of the agent, before violations occur. We also discuss different decision criteria, based on confidence intervals of the predicted violation probabilities, to trigger safety mechanisms aiming at a trade-off between early detection and low false positive rates.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 1","pages":"82-105"},"PeriodicalIF":6.5,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142594787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Diversity-Oriented Testing for Competitive Game Agent via Constraint-Guided Adversarial Agent Training 通过约束引导的对抗式代理训练,对竞争性游戏代理进行以多样性为导向的测试
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-11-05 DOI: 10.1109/TSE.2024.3491193
Xuyan Ma;Yawen Wang;Junjie Wang;Xiaofei Xie;Boyu Wu;Yiguang Yan;Shoubin Li;Fanjiang Xu;Qing Wang
Deep reinforcement learning has achieved remarkable success in competitive games, surpassing human performance in applications ranging from business competitions to video games. In competitive environments, agents face the challenge of adapting to continuously shifting adversary strategies, necessitating the ability to handle diverse scenarios. Existing studies primarily focus on evaluating agent robustness either through perturbing observations, which has practical limitations, or through training adversarial agents to expose weaknesses, which lacks strategy diversity exploration. There are also studies which rely on curiosity-based mechanism to explore the diversity, yet they may lack direct guidance to enhance identified decision-making flaws. In this paper, we propose a novel diversity-oriented testing framework (called AdvTest) to test the competitive game agent via constraint-guided adversarial agent training. Specifically, AdvTest adds constraints as the explicit guidance during adversarial agent training to make it capable of defeating the target agent using diverse strategies. To realize the method, three challenges need to be addressed, i.e., what are the suitable constraints, when to introduce constraints, and which constraint should be added. We experimentally evaluate AdvTest on the commonly-used competitive game environment, StarCraft II. The results on four maps show that AdvTest exposes more diverse failure scenarios compared with the commonly-used and state-of-the-art baselines.
深度强化学习在竞技游戏中取得了显著的成功,在从商业竞赛到视频游戏的应用中超越了人类的表现。在竞争环境中,智能体面临着适应不断变化的对手策略的挑战,需要处理不同场景的能力。现有的研究主要集中在通过扰动观察来评估智能体的鲁棒性,这有实际的局限性,或者通过训练对抗智能体来暴露弱点,这缺乏对策略多样性的探索。也有一些研究依靠基于好奇心的机制来探索多样性,但它们可能缺乏直接指导来增强已识别的决策缺陷。在本文中,我们提出了一个新的面向多样性的测试框架(称为AdvTest),通过约束引导的对抗代理训练来测试竞争博弈代理。具体来说,AdvTest在对抗性代理训练过程中增加了约束作为明确的指导,使其能够使用不同的策略击败目标代理。为了实现该方法,需要解决三个挑战,即,什么是合适的约束,何时引入约束,以及应该添加哪些约束。我们在常用的竞技游戏环境《星际争霸2》中对AdvTest进行了实验评估。四个地图上的结果显示,与常用的和最先进的基线相比,AdvTest暴露了更多不同的故障场景。
{"title":"Diversity-Oriented Testing for Competitive Game Agent via Constraint-Guided Adversarial Agent Training","authors":"Xuyan Ma;Yawen Wang;Junjie Wang;Xiaofei Xie;Boyu Wu;Yiguang Yan;Shoubin Li;Fanjiang Xu;Qing Wang","doi":"10.1109/TSE.2024.3491193","DOIUrl":"10.1109/TSE.2024.3491193","url":null,"abstract":"Deep reinforcement learning has achieved remarkable success in competitive games, surpassing human performance in applications ranging from business competitions to video games. In competitive environments, agents face the challenge of adapting to continuously shifting adversary strategies, necessitating the ability to handle diverse scenarios. Existing studies primarily focus on evaluating agent robustness either through perturbing observations, which has practical limitations, or through training adversarial agents to expose weaknesses, which lacks strategy diversity exploration. There are also studies which rely on curiosity-based mechanism to explore the diversity, yet they may lack direct guidance to enhance identified decision-making flaws. In this paper, we propose a novel diversity-oriented testing framework (called AdvTest) to test the competitive game agent via constraint-guided adversarial agent training. Specifically, AdvTest adds constraints as the explicit guidance during adversarial agent training to make it capable of defeating the target agent using diverse strategies. To realize the method, three challenges need to be addressed, i.e., what are the suitable constraints, when to introduce constraints, and which constraint should be added. We experimentally evaluate AdvTest on the commonly-used competitive game environment, StarCraft II. The results on four maps show that AdvTest exposes more diverse failure scenarios compared with the commonly-used and state-of-the-art baselines.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 1","pages":"66-81"},"PeriodicalIF":6.5,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142588843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dividable Configuration Performance Learning 可分割配置 性能学习
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-11-05 DOI: 10.1109/TSE.2024.3491945
Jingzhi Gong;Tao Chen;Rami Bahsoon
Machine/deep learning models have been widely adopted to predict the configuration performance of software systems. However, a crucial yet unaddressed challenge is how to cater for the sparsity inherited from the configuration landscape: the influence of configuration options (features) and the distribution of data samples are highly sparse. In this paper, we propose a model-agnostic and sparsity-robust framework for predicting configuration performance, dubbed DaL, based on the new paradigm of dividable learning that builds a model via “divide-and-learn”. To handle sample sparsity, the samples from the configuration landscape are divided into distant divisions, for each of which we build a sparse local model, e.g., regularized Hierarchical Interaction Neural Network, to deal with the feature sparsity. A newly given configuration would then be assigned to the right model of division for the final prediction. Further, DaL adaptively determines the optimal number of divisions required for a system and sample size without any extra training or profiling. Experiment results from 12 real-world systems and five sets of training data reveal that, compared with the state-of-the-art approaches, DaL performs no worse than the best counterpart on 44 out of 60 cases (within which 31 cases are significantly better) with up to $1.61times$ improvement on accuracy; requires fewer samples to reach the same/better accuracy; and producing acceptable training overhead. In particular, the mechanism that adapted the parameter $d$ can reach the optimal value for 76.43% of the individual runs. The result also confirms that the paradigm of dividable learning is more suitable than other similar paradigms such as ensemble learning for predicting configuration performance. Practically, DaL considerably improves different global models when using them as the underlying local models, which further strengthens its flexibility. To promote open science, all the data, code, and supplementary materials of this work can be accessed at our repository: https://github.com/ideas-labo/DaL-ext.
机器/深度学习模型已被广泛用于预测软件系统的配置性能。然而,一个关键但尚未解决的挑战是如何满足从配置环境继承的稀疏性:配置选项(特征)的影响和数据样本的分布是高度稀疏的。在本文中,我们提出了一个模型不可知和稀疏鲁棒的框架,用于预测配置性能,称为DaL,基于可分学习的新范式,该范式通过“分与学习”构建模型。为了处理样本稀疏性,将来自配置景观的样本划分为遥远的分区,我们为每个分区构建一个稀疏的局部模型,例如正则化层次交互神经网络,以处理特征稀疏性。然后将新给出的配置分配给正确的划分模型以进行最终预测。此外,DaL自适应地确定系统所需的最佳划分数量和样本大小,而无需任何额外的训练或分析。来自12个真实世界系统和5组训练数据的实验结果表明,与最先进的方法相比,在60个案例中有44个案例(其中31个案例明显更好),DaL的表现并不比最好的方法差,准确率提高了1.61倍;需要更少的样本来达到相同/更好的精度;产生可接受的训练开销。特别是,适应参数$d$的机制在76.43%的单次运行中可以达到最优值。结果还证实了可分学习范式比集成学习等其他类似范式更适合于预测配置性能。实际上,当使用不同的全局模型作为底层局部模型时,DaL显著地改进了它们,这进一步增强了它的灵活性。为了促进开放科学,本书的所有数据、代码和补充材料都可以在我们的知识库中访问:https://github.com/ideas-labo/DaL-ext。
{"title":"Dividable Configuration Performance Learning","authors":"Jingzhi Gong;Tao Chen;Rami Bahsoon","doi":"10.1109/TSE.2024.3491945","DOIUrl":"10.1109/TSE.2024.3491945","url":null,"abstract":"Machine/deep learning models have been widely adopted to predict the configuration performance of software systems. However, a crucial yet unaddressed challenge is how to cater for the sparsity inherited from the configuration landscape: the influence of configuration options (features) and the distribution of data samples are highly sparse. In this paper, we propose a model-agnostic and sparsity-robust framework for predicting configuration performance, dubbed \u0000<monospace>DaL</monospace>\u0000, based on the new paradigm of dividable learning that builds a model via “divide-and-learn”. To handle sample sparsity, the samples from the configuration landscape are divided into distant divisions, for each of which we build a sparse local model, e.g., regularized Hierarchical Interaction Neural Network, to deal with the feature sparsity. A newly given configuration would then be assigned to the right model of division for the final prediction. Further, \u0000<monospace>DaL</monospace>\u0000 adaptively determines the optimal number of divisions required for a system and sample size without any extra training or profiling. Experiment results from 12 real-world systems and five sets of training data reveal that, compared with the state-of-the-art approaches, \u0000<monospace>DaL</monospace>\u0000 performs no worse than the best counterpart on 44 out of 60 cases (within which 31 cases are significantly better) with up to \u0000<inline-formula><tex-math>$1.61times$</tex-math></inline-formula>\u0000 improvement on accuracy; requires fewer samples to reach the same/better accuracy; and producing acceptable training overhead. In particular, the mechanism that adapted the parameter \u0000<inline-formula><tex-math>$d$</tex-math></inline-formula>\u0000 can reach the optimal value for 76.43% of the individual runs. The result also confirms that the paradigm of dividable learning is more suitable than other similar paradigms such as ensemble learning for predicting configuration performance. Practically, \u0000<monospace>DaL</monospace>\u0000 considerably improves different global models when using them as the underlying local models, which further strengthens its flexibility. To promote open science, all the data, code, and supplementary materials of this work can be accessed at our repository: \u0000<uri>https://github.com/ideas-labo/DaL-ext</uri>\u0000.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 1","pages":"106-134"},"PeriodicalIF":6.5,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10744216","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142588716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fight Fire With Fire: How Much Can We Trust ChatGPT on Source Code-Related Tasks? 以毒攻毒:在源代码相关任务上,我们能在多大程度上信任 ChatGPT?
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-11-05 DOI: 10.1109/TSE.2024.3492204
Xiao Yu;Lei Liu;Xing Hu;Jacky Wai Keung;Jin Liu;Xin Xia
With the increasing utilization of large language models such as ChatGPT during software development, it has become crucial to verify the quality of code content it generates. Recent studies proposed utilizing ChatGPT as both a developer and tester for multi-agent collaborative software development. The multi-agent collaboration empowers ChatGPT to produce test reports for its generated code, enabling it to self-verify the code content and fix bugs based on these reports. However, these studies did not assess the effectiveness of the generated test reports in validating the code. Therefore, we conduct a comprehensive empirical investigation to evaluate ChatGPT's self-verification capability in code generation, code completion, and program repair. We request ChatGPT to (1) generate correct code and then self-verify its correctness; (2) complete code without vulnerabilities and then self-verify for the presence of vulnerabilities; and (3) repair buggy code and then self-verify whether the bugs are resolved. Our findings on two code generation datasets, one code completion dataset, and two program repair datasets reveal the following observations: (1) ChatGPT often erroneously predicts its generated incorrect code as correct, its vulnerable completed code as non-vulnerable, and its failed program repairs as successful during its self-verification. (2) The self-contradictory hallucinations in ChatGPT's behavior arise: (a) ChatGPT initially generates code that it believes to be correct but later predicts it to be incorrect; (b) ChatGPT initially generates code completions that it deems secure but later predicts them to be vulnerable; (c) ChatGPT initially outputs code that it considers successfully repaired but later predicts it to be buggy during its self-verification. (3) The self-verification capability of ChatGPT can be enhanced by asking the guiding question, which queries whether ChatGPT agrees with assertions about incorrectly generated or repaired code and vulnerabilities in completed code. (4) Using test reports generated by ChatGPT can identify more vulnerabilities in completed code, but the explanations for incorrectly generated code and failed repairs are mostly inaccurate in the test reports. Based on these findings, we provide implications for further research or development using ChatGPT.
随着在软件开发过程中对ChatGPT等大型语言模型的使用越来越多,验证其生成的代码内容的质量变得至关重要。最近的研究提出利用ChatGPT作为多代理协作软件开发的开发人员和测试人员。多代理协作使ChatGPT能够为其生成的代码生成测试报告,使其能够自我验证代码内容并根据这些报告修复错误。然而,这些研究并没有评估在验证代码时生成的测试报告的有效性。因此,我们进行了全面的实证调查,以评估ChatGPT在代码生成、代码完成和程序修复方面的自我验证能力。我们要求ChatGPT(1)生成正确的代码,然后自我验证其正确性;(2)完整的代码,不存在漏洞,然后自我验证是否存在漏洞;(3)修复有bug的代码,然后自我验证bug是否被解决。我们对两个代码生成数据集、一个代码完成数据集和两个程序修复数据集的研究结果揭示了以下观察结果:(1)ChatGPT在自我验证期间经常错误地将其生成的不正确代码预测为正确,将其易受攻击的完成代码预测为非易受攻击,并将其失败的程序修复预测为成功。(2) ChatGPT行为产生自相矛盾的幻觉:(a) ChatGPT最初生成它认为正确的代码,但后来预测它是错误的;(b) ChatGPT最初生成它认为安全的代码补全,但后来预测它们容易受到攻击;(c) ChatGPT最初输出它认为已成功修复的代码,但后来在自我验证期间预测它有bug。(3)通过提出引导问题来增强ChatGPT的自验证能力,引导问题询问ChatGPT是否同意关于错误生成或修复的代码以及已完成代码中的漏洞的断言。(4)使用ChatGPT生成的测试报告可以识别出更多已完成代码中的漏洞,但测试报告中对于生成错误的代码和修复失败的解释大多不准确。基于这些发现,我们提供了使用ChatGPT进一步研究或开发的含义。
{"title":"Fight Fire With Fire: How Much Can We Trust ChatGPT on Source Code-Related Tasks?","authors":"Xiao Yu;Lei Liu;Xing Hu;Jacky Wai Keung;Jin Liu;Xin Xia","doi":"10.1109/TSE.2024.3492204","DOIUrl":"10.1109/TSE.2024.3492204","url":null,"abstract":"With the increasing utilization of large language models such as ChatGPT during software development, it has become crucial to verify the quality of code content it generates. Recent studies proposed utilizing ChatGPT as both a developer and tester for multi-agent collaborative software development. The multi-agent collaboration empowers ChatGPT to produce test reports for its generated code, enabling it to self-verify the code content and fix bugs based on these reports. However, these studies did not assess the effectiveness of the generated test reports in validating the code. Therefore, we conduct a comprehensive empirical investigation to evaluate ChatGPT's self-verification capability in code generation, code completion, and program repair. We request ChatGPT to (1) generate correct code and then self-verify its correctness; (2) complete code without vulnerabilities and then self-verify for the presence of vulnerabilities; and (3) repair buggy code and then self-verify whether the bugs are resolved. Our findings on two code generation datasets, one code completion dataset, and two program repair datasets reveal the following observations: (1) ChatGPT often erroneously predicts its generated incorrect code as correct, its vulnerable completed code as non-vulnerable, and its failed program repairs as successful during its self-verification. (2) The self-contradictory hallucinations in ChatGPT's behavior arise: (a) ChatGPT initially generates code that it believes to be correct but later predicts it to be incorrect; (b) ChatGPT initially generates code completions that it deems secure but later predicts them to be vulnerable; (c) ChatGPT initially outputs code that it considers successfully repaired but later predicts it to be buggy during its self-verification. (3) The self-verification capability of ChatGPT can be enhanced by asking the guiding question, which queries whether ChatGPT agrees with assertions about incorrectly generated or repaired code and vulnerabilities in completed code. (4) Using test reports generated by ChatGPT can identify more vulnerabilities in completed code, but the explanations for incorrectly generated code and failed repairs are mostly inaccurate in the test reports. Based on these findings, we provide implications for further research or development using ChatGPT.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 12","pages":"3435-3453"},"PeriodicalIF":6.5,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142588717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AIM: Automated Input Set Minimization for Metamorphic Security Testing AIM:自动最小化输入集,用于变形安全测试
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-10-30 DOI: 10.1109/TSE.2024.3488525
Nazanin Bayati Chaleshtari;Yoann Marquer;Fabrizio Pastore;Lionel C. Briand
Although the security testing of Web systems can be automated by generating crafted inputs, solutions to automate the test oracle, i.e., vulnerability detection, remain difficult to apply in practice. Specifically, though previous work has demonstrated the potential of metamorphic testing—security failures can be determined by metamorphic relations that turn valid inputs into malicious inputs—metamorphic relations are typically executed on a large set of inputs, which is time-consuming and thus makes metamorphic testing impractical. We propose AIM, an approach that automatically selects inputs to reduce testing costs while preserving vulnerability detection capabilities. AIM includes a clustering-based black-box approach, to identify similar inputs based on their security properties. It also relies on a novel genetic algorithm to efficiently select diverse inputs while minimizing their total cost. Further, it contains a problem-reduction component to reduce the search space and speed up the minimization process. We evaluated the effectiveness of AIM on two well-known Web systems, Jenkins and Joomla, with documented vulnerabilities. We compared AIM's results with four baselines involving standard search approaches. Overall, AIM reduced metamorphic testing time by 84% for Jenkins and 82% for Joomla, while preserving the same level of vulnerability detection. Furthermore, AIM significantly outperformed all the considered baselines regarding vulnerability coverage.
尽管Web系统的安全测试可以通过生成精心设计的输入来实现自动化,但是自动化测试oracle的解决方案,即漏洞检测,在实践中仍然难以应用。具体来说,尽管以前的工作已经证明了变形测试的潜力——安全故障可以通过将有效输入转换为恶意输入的变形关系来确定——变形关系通常在大量输入上执行,这是耗时的,因此使变形测试不切实际。我们提出了AIM,一种自动选择输入以降低测试成本同时保留漏洞检测能力的方法。AIM包括一种基于聚类的黑盒方法,根据它们的安全属性识别相似的输入。它还依赖于一种新的遗传算法来有效地选择不同的输入,同时最小化它们的总成本。此外,它还包含一个问题减少组件,以减少搜索空间并加快最小化过程。我们在两个知名的Web系统Jenkins和Joomla上评估了AIM的有效性,并记录了漏洞。我们将AIM的结果与四个涉及标准搜索方法的基线进行了比较。总体而言,AIM将Jenkins和Joomla的变形测试时间分别减少了84%和82%,同时保持了相同的漏洞检测水平。此外,AIM显著优于所有考虑到的关于漏洞覆盖率的基线。
{"title":"AIM: Automated Input Set Minimization for Metamorphic Security Testing","authors":"Nazanin Bayati Chaleshtari;Yoann Marquer;Fabrizio Pastore;Lionel C. Briand","doi":"10.1109/TSE.2024.3488525","DOIUrl":"10.1109/TSE.2024.3488525","url":null,"abstract":"Although the security testing of Web systems can be automated by generating crafted inputs, solutions to automate the test oracle, i.e., vulnerability detection, remain difficult to apply in practice. Specifically, though previous work has demonstrated the potential of metamorphic testing—security failures can be determined by metamorphic relations that turn valid inputs into malicious inputs—metamorphic relations are typically executed on a large set of inputs, which is time-consuming and thus makes metamorphic testing impractical. We propose AIM, an approach that automatically selects inputs to reduce testing costs while preserving vulnerability detection capabilities. AIM includes a clustering-based black-box approach, to identify similar inputs based on their security properties. It also relies on a novel genetic algorithm to efficiently select diverse inputs while minimizing their total cost. Further, it contains a problem-reduction component to reduce the search space and speed up the minimization process. We evaluated the effectiveness of AIM on two well-known Web systems, Jenkins and Joomla, with documented vulnerabilities. We compared AIM's results with four baselines involving standard search approaches. Overall, AIM reduced metamorphic testing time by 84% for Jenkins and 82% for Joomla, while preserving the same level of vulnerability detection. Furthermore, AIM significantly outperformed all the considered baselines regarding vulnerability coverage.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 12","pages":"3403-3434"},"PeriodicalIF":6.5,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142555913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Comprehensive Study on Static Application Security Testing (SAST) Tools for Android 安卓静态应用安全测试 (SAST) 工具综合研究
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-10-30 DOI: 10.1109/TSE.2024.3488041
Jingyun Zhu;Kaixuan Li;Sen Chen;Lingling Fan;Junjie Wang;Xiaofei Xie
To identify security vulnerabilities in Android applications, numerous static application security testing (SAST) tools have been proposed. However, it poses significant challenges to assess their overall performance on diverse vulnerability types. The task is non-trivial and poses considerable challenges. Firstly, the absence of a unified evaluation platform for defining and describing tools’ supported vulnerability types, coupled with the lack of normalization for the intricate and varied reports generated by different tools, significantly adds to the complexity. Secondly, there is a scarcity of adequate benchmarks, particularly those derived from real-world scenarios. To address these problems, we are the first to propose a unified platform named VulsTotal, supporting various vulnerability types, enabling comprehensive and versatile analysis across diverse SAST tools. Specifically, we begin by meticulously selecting 11 free and open-sourced SAST tools from a pool of 97 existing options, adhering to clearly defined criteria. After that, we invest significant efforts in comprehending the detection rules of each tool, subsequently unifying 67 general/common vulnerability types for Android SAST tools. We also redefine and implement a standardized reporting format, ensuring uniformity in presenting results across all tools. Additionally, to mitigate the problem of benchmarks, we conducted a manual analysis of huge amounts of CVEs to construct a new CVE-based benchmark based on our comprehension of Android app vulnerabilities. Leveraging the evaluation platform, which integrates both existing synthetic benchmarks and newly constructed CVE-based benchmarks from this study, we conducted a comprehensive analysis to evaluate and compare these selected tools from various perspectives, such as general vulnerability type coverage, type consistency, tool effectiveness, and time performance. Our observations yielded impressive findings, like the technical reasons underlying the performance, which provide insights for different stakeholders.
为了识别Android应用程序中的安全漏洞,已经提出了许多静态应用程序安全测试(SAST)工具。然而,评估它们在不同漏洞类型上的整体性能提出了重大挑战。这项任务并不简单,而且具有相当大的挑战。首先,缺乏一个统一的评估平台来定义和描述工具支持的漏洞类型,再加上不同工具生成的复杂多样的报告缺乏规范化,大大增加了复杂性。其次,缺乏足够的基准,特别是那些来自现实世界场景的基准。为了解决这些问题,我们首先提出了一个名为VulsTotal的统一平台,支持各种漏洞类型,使各种SAST工具能够进行全面和通用的分析。具体来说,我们首先从97个现有选项中精心选择11个免费和开源的SAST工具,并遵循明确定义的标准。之后,我们投入了大量的精力来理解每个工具的检测规则,随后统一了67种Android SAST工具的通用/常见漏洞类型。我们还重新定义并实现了标准化的报告格式,以确保在所有工具中呈现结果的一致性。此外,为了缓解基准测试的问题,我们对大量cve进行了手动分析,基于我们对Android应用程序漏洞的理解,构建了一个新的基于cve的基准测试。我们利用评估平台,集成了本研究中已有的综合基准和新构建的基于cve的基准,从一般漏洞类型覆盖率、类型一致性、工具有效性和时间性能等多个角度对这些选择的工具进行了综合分析,进行了评估和比较。我们的观察产生了令人印象深刻的发现,比如性能背后的技术原因,这为不同的利益相关者提供了见解。
{"title":"A Comprehensive Study on Static Application Security Testing (SAST) Tools for Android","authors":"Jingyun Zhu;Kaixuan Li;Sen Chen;Lingling Fan;Junjie Wang;Xiaofei Xie","doi":"10.1109/TSE.2024.3488041","DOIUrl":"10.1109/TSE.2024.3488041","url":null,"abstract":"To identify security vulnerabilities in Android applications, numerous static application security testing (SAST) tools have been proposed. However, it poses significant challenges to assess their overall performance on diverse vulnerability types. The task is non-trivial and poses considerable challenges. Firstly, the absence of a unified evaluation platform for defining and describing tools’ supported vulnerability types, coupled with the lack of normalization for the intricate and varied reports generated by different tools, significantly adds to the complexity. Secondly, there is a scarcity of adequate benchmarks, particularly those derived from real-world scenarios. To address these problems, we are the first to propose a unified platform named \u0000<italic>VulsTotal</i>\u0000, supporting various vulnerability types, enabling comprehensive and versatile analysis across diverse SAST tools. Specifically, we begin by meticulously selecting 11 free and open-sourced SAST tools from a pool of 97 existing options, adhering to clearly defined criteria. After that, we invest significant efforts in comprehending the detection rules of each tool, subsequently unifying 67 general/common vulnerability types for Android SAST tools. We also redefine and implement a standardized reporting format, ensuring uniformity in presenting results across all tools. Additionally, to mitigate the problem of benchmarks, we conducted a manual analysis of huge amounts of CVEs to construct a new CVE-based benchmark based on our comprehension of Android app vulnerabilities. Leveraging the evaluation platform, which integrates both existing synthetic benchmarks and newly constructed CVE-based benchmarks from this study, we conducted a comprehensive analysis to evaluate and compare these selected tools from various perspectives, such as general vulnerability type coverage, type consistency, tool effectiveness, and time performance. Our observations yielded impressive findings, like the technical reasons underlying the performance, which provide insights for different stakeholders.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 12","pages":"3385-3402"},"PeriodicalIF":6.5,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142555915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gotcha! This Model Uses My Code! Evaluating Membership Leakage Risks in Code Models 找到了这个模型使用了我的代码评估代码模型中的成员泄漏风险
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-10-25 DOI: 10.1109/TSE.2024.3482719
Zhou Yang;Zhipeng Zhao;Chenyu Wang;Jieke Shi;Dongsun Kim;DongGyun Han;David Lo
Leveraging large-scale datasets from open-source projects and advances in large language models, recent progress has led to sophisticated code models for key software engineering tasks, such as program repair and code completion. These models are trained on data from various sources, including public open-source projects like GitHub and private, confidential code from companies, raising significant privacy concerns. This paper investigates a crucial but unexplored question: What is the risk of membership information leakage in code models? Membership leakage refers to the vulnerability where an attacker can infer whether a specific data point was part of the training dataset. We present Gotcha, a novel membership inference attack method designed for code models, and evaluate its effectiveness on Java-based datasets. Gotcha simultaneously considers three key factors: model input, model output, and ground truth. Our ablation study confirms that each factor significantly enhances attack performance. Our ablation study confirms that each factor significantly enhances attack performance. Our investigation reveals a troubling finding: membership leakage risk is significantly elevated. While previous methods had accuracy close to random guessing, Gotcha achieves high precision, with a true positive rate of 0.95 and a low false positive rate of 0.10. We also demonstrate that the attacker's knowledge of the victim model (e.g., model architecture and pre-training data) affects attack success. Additionally, modifying decoding strategies can help reduce membership leakage risks. This research highlights the urgent need to better understand the privacy vulnerabilities of code models and develop strong countermeasures against these threats.
利用来自开源项目的大规模数据集和大型语言模型的进步,最近的进展导致了用于关键软件工程任务的复杂代码模型,例如程序修复和代码完成。这些模型是根据各种来源的数据进行训练的,包括GitHub等公共开源项目和公司的私人机密代码,这引起了严重的隐私问题。本文研究了一个关键但尚未探讨的问题:代码模型中成员信息泄漏的风险是什么?成员泄漏是指攻击者可以推断出特定数据点是否属于训练数据集的一部分。提出了一种针对代码模型设计的成员推理攻击方法Gotcha,并对其在基于java的数据集上的有效性进行了评估。Gotcha同时考虑三个关键因素:模型输入、模型输出和真实情况。我们的消融研究证实,每个因素都能显著提高攻击性能。我们的消融研究证实,每个因素都能显著提高攻击性能。我们的调查显示了一个令人不安的发现:会员泄露的风险显著增加。以前的方法准确率接近随机猜测,Gotcha达到了很高的精度,真阳性率为0.95,假阳性率为0.10。我们还证明了攻击者对受害者模型的了解(例如,模型架构和预训练数据)会影响攻击的成功。此外,修改解码策略可以帮助减少成员泄漏风险。这项研究强调了迫切需要更好地了解代码模型的隐私漏洞,并制定强有力的对策来应对这些威胁。
{"title":"Gotcha! This Model Uses My Code! Evaluating Membership Leakage Risks in Code Models","authors":"Zhou Yang;Zhipeng Zhao;Chenyu Wang;Jieke Shi;Dongsun Kim;DongGyun Han;David Lo","doi":"10.1109/TSE.2024.3482719","DOIUrl":"10.1109/TSE.2024.3482719","url":null,"abstract":"Leveraging large-scale datasets from open-source projects and advances in large language models, recent progress has led to sophisticated code models for key software engineering tasks, such as program repair and code completion. These models are trained on data from various sources, including public open-source projects like GitHub and private, confidential code from companies, raising significant privacy concerns. This paper investigates a crucial but unexplored question: \u0000<italic>What is the risk of membership information leakage in code models?</i>\u0000 Membership leakage refers to the vulnerability where an attacker can infer whether a specific data point was part of the training dataset. We present \u0000<sc>Gotcha</small>\u0000, a novel membership inference attack method designed for code models, and evaluate its effectiveness on Java-based datasets. \u0000<sc>Gotcha</small>\u0000 simultaneously considers three key factors: model input, model output, and ground truth. Our ablation study confirms that each factor significantly enhances attack performance. Our ablation study confirms that each factor significantly enhances attack performance. Our investigation reveals a troubling finding: \u0000<bold>membership leakage risk is significantly elevated</b>\u0000. While previous methods had accuracy close to random guessing, \u0000<sc>Gotcha</small>\u0000 achieves high precision, with a true positive rate of 0.95 and a low false positive rate of 0.10. We also demonstrate that the attacker's knowledge of the victim model (e.g., model architecture and pre-training data) affects attack success. Additionally, modifying decoding strategies can help reduce membership leakage risks. This research highlights the urgent need to better understand the privacy vulnerabilities of code models and develop strong countermeasures against these threats.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 12","pages":"3290-3306"},"PeriodicalIF":6.5,"publicationDate":"2024-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142490464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
$mathbf{A^{3}}$A3-CodGen: A Repository-Level Code Generation Framework for Code Reuse With Local-Aware, Global-Aware, and Third-Party-Library-Aware A3-CodGen : 面向代码重用的存储库级代码生成框架,具有本地感知、全局感知和第三方库感知功能
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-10-24 DOI: 10.1109/TSE.2024.3486195
Dianshu Liao;Shidong Pan;Xiaoyu Sun;Xiaoxue Ren;Qing Huang;Zhenchang Xing;Huan Jin;Qinying Li
LLM-based code generation tools are essential to help developers in the software development process. Existing tools often disconnect with the working context, i.e., the code repository, causing the generated code to be not similar to human developers. In this paper, we propose a novel code generation framework, dubbed $A^{3}$-CodGen, to harness information within the code repository to generate code with fewer potential logical errors, code redundancy, and library-induced compatibility issues. We identify three types of representative information for the code repository: local-aware information from the current code file, global-aware information from other code files, and third-party-library information. Results demonstrate that by adopting the $A^{3}$-CodGen framework, we successfully extract, fuse, and feed code repository information into the LLM, generating more accurate, efficient, and highly reusable code. The effectiveness of our framework is further underscored by generating code with a higher reuse rate, compared to human developers. This research contributes significantly to the field of code generation, providing developers with a more powerful tool to address the evolving demands in software development in practice.
基于llm的代码生成工具对于在软件开发过程中帮助开发人员是必不可少的。现有的工具经常与工作环境脱节,例如,代码存储库,导致生成的代码与人类开发人员不相似。在本文中,我们提出了一个新的代码生成框架,称为$ a ^{3}$-CodGen,以利用代码存储库中的信息来生成具有更少潜在逻辑错误、代码冗余和库引起的兼容性问题的代码。我们为代码存储库确定了三种类型的代表性信息:来自当前代码文件的本地感知信息,来自其他代码文件的全局感知信息,以及第三方库信息。结果表明,通过采用$A^{3}$-CodGen框架,我们成功地提取、融合并将代码库信息提供给LLM,生成了更准确、高效和高可重用性的代码。与人类开发人员相比,通过生成具有更高重用率的代码,进一步强调了我们框架的有效性。该研究对代码生成领域做出了重大贡献,为开发人员提供了一个更强大的工具来解决软件开发实践中不断变化的需求。
{"title":"$mathbf{A^{3}}$A3-CodGen: A Repository-Level Code Generation Framework for Code Reuse With Local-Aware, Global-Aware, and Third-Party-Library-Aware","authors":"Dianshu Liao;Shidong Pan;Xiaoyu Sun;Xiaoxue Ren;Qing Huang;Zhenchang Xing;Huan Jin;Qinying Li","doi":"10.1109/TSE.2024.3486195","DOIUrl":"10.1109/TSE.2024.3486195","url":null,"abstract":"LLM-based code generation tools are essential to help developers in the software development process. Existing tools often disconnect with the working context, i.e., the code repository, causing the generated code to be not similar to human developers. In this paper, we propose a novel code generation framework, dubbed \u0000<inline-formula><tex-math>$A^{3}$</tex-math></inline-formula>\u0000-CodGen, to harness information within the code repository to generate code with fewer potential logical errors, code redundancy, and library-induced compatibility issues. We identify three types of representative information for the code repository: local-aware information from the current code file, global-aware information from other code files, and third-party-library information. Results demonstrate that by adopting the \u0000<inline-formula><tex-math>$A^{3}$</tex-math></inline-formula>\u0000-CodGen framework, we successfully extract, fuse, and feed code repository information into the LLM, generating more accurate, efficient, and highly reusable code. The effectiveness of our framework is further underscored by generating code with a higher reuse rate, compared to human developers. This research contributes significantly to the field of code generation, providing developers with a more powerful tool to address the evolving demands in software development in practice.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 12","pages":"3369-3384"},"PeriodicalIF":6.5,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142489538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Don’t Confuse! Redrawing GUI Navigation Flow in Mobile Apps for Visually Impaired Users 不要混淆!为视障用户重绘移动应用程序中的图形用户界面导航流程
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-10-23 DOI: 10.1109/TSE.2024.3485225
Mengxi Zhang;Huaxiao Liu;Yuheng Zhou;Chunyang Chen;Pei Huang;Jian Zhao
Mobile applications (apps) are integral to our daily lives, offering diverse services and functionalities. They enable sighted users to access information coherently in an extremely convenient manner. However, it remains unclear if visually impaired users, who rely solely on the screen readers (e.g., Talkback) to navigate and access app information, can do so in the correct and reasonable order. This may result in significant information bias and operational errors. Furthermore, in our preliminary exploration, we explained and clarified that the navigation sequence-related issues encountered by visually impaired users could be categorized into two types: unintuitive navigation sequence and unapparent focus switching. Considering these issues, in this work, we proposed a method named RGNF (Re-draw GUI Navigation Flow). It aimed to enhance the understandability and coherence of accessing the content of each component within the Graphical User Interface (GUI), together with assisting developers in creating well-designed GUI navigation flow (GNF). This method was inspired by the characteristics identified in our preliminary study, where visually impaired users expected navigation to be associated with close position and similar shape of GUI components that were read consecutively. Thus, our method relied on the principles derived from the Gestalt psychological model, aiming to group GUI components into different regions according to the laws of proximity and similarity, thereby redrawing the GNFs. To evaluate the effectiveness of our method, we calculated sequence similarity values before and after redrawing the GNF, and further employed the tools proposed by Alotaibi et al. to measure the reachability of GUI components. Our results demonstrated a substantial improvement in similarity (0.921) compared to the baseline (0.624), together with the reachability (90.31%) compared to the baseline GNF (74.35%). Furthermore, a qualitative user study revealed that our method had a positive effect on providing visually impaired users with an improved user experience.
移动应用程序(app)是我们日常生活中不可或缺的一部分,提供各种服务和功能。它们使视力正常的用户能够以一种极其方便的方式连贯地获取信息。然而,目前尚不清楚的是,视障用户是否完全依赖屏幕阅读器(例如,Talkback)来导航和访问应用程序信息,是否能够以正确合理的顺序这样做。这可能导致严重的信息偏差和操作错误。此外,在我们的初步探索中,我们解释和澄清了视障用户遇到的与导航序列相关的问题可以分为两类:导航序列不直观和焦点切换不明显。考虑到这些问题,在这项工作中,我们提出了一种名为RGNF(重绘GUI导航流)的方法。它旨在增强访问图形用户界面(GUI)中每个组件内容的可理解性和一致性,并协助开发人员创建设计良好的GUI导航流(GNF)。这种方法的灵感来自于我们初步研究中确定的特征,视障用户期望导航与连续阅读的GUI组件的接近位置和相似形状相关联。因此,我们的方法依赖于格式塔心理模型的原理,旨在根据接近性和相似性的规律将GUI组件分组到不同的区域,从而重新绘制gnf。为了评估我们方法的有效性,我们计算了重绘GNF前后的序列相似性值,并进一步使用Alotaibi等人提出的工具来衡量GUI组件的可达性。我们的结果表明,与基线(0.624)相比,相似度(0.921)有了显著提高,可达性(90.31%)与基线GNF(74.35%)相比有了显著提高。此外,一项定性的用户研究表明,我们的方法对为视障用户提供更好的用户体验有积极的影响。
{"title":"Don’t Confuse! Redrawing GUI Navigation Flow in Mobile Apps for Visually Impaired Users","authors":"Mengxi Zhang;Huaxiao Liu;Yuheng Zhou;Chunyang Chen;Pei Huang;Jian Zhao","doi":"10.1109/TSE.2024.3485225","DOIUrl":"10.1109/TSE.2024.3485225","url":null,"abstract":"Mobile applications (apps) are integral to our daily lives, offering diverse services and functionalities. They enable sighted users to access information coherently in an extremely convenient manner. However, it remains unclear if visually impaired users, who rely solely on the screen readers (e.g., Talkback) to navigate and access app information, can do so in the correct and reasonable order. This may result in significant information bias and operational errors. Furthermore, in our preliminary exploration, we explained and clarified that the navigation sequence-related issues encountered by visually impaired users could be categorized into two types: unintuitive navigation sequence and unapparent focus switching. Considering these issues, in this work, we proposed a method named RGNF (Re-draw GUI Navigation Flow). It aimed to enhance the understandability and coherence of accessing the content of each component within the Graphical User Interface (GUI), together with assisting developers in creating well-designed GUI navigation flow (GNF). This method was inspired by the characteristics identified in our preliminary study, where visually impaired users expected navigation to be associated with close position and similar shape of GUI components that were read consecutively. Thus, our method relied on the principles derived from the Gestalt psychological model, aiming to group GUI components into different regions according to the laws of proximity and similarity, thereby redrawing the GNFs. To evaluate the effectiveness of our method, we calculated sequence similarity values before and after redrawing the GNF, and further employed the tools proposed by Alotaibi et al. to measure the reachability of GUI components. Our results demonstrated a substantial improvement in similarity (0.921) compared to the baseline (0.624), together with the reachability (90.31%) compared to the baseline GNF (74.35%). Furthermore, a qualitative user study revealed that our method had a positive effect on providing visually impaired users with an improved user experience.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 12","pages":"3351-3368"},"PeriodicalIF":6.5,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142488445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Software Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1