IEEE Transactions on Software Engineering最新文献_第10页

Does Treatment Adherence Impact Experiment Results in TDD? 坚持治疗会影响 TDD 的实验结果吗？

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering

Pub Date : 2024-11-15 DOI: 10.1109/TSE.2024.3497332

Itir Karac;Jose Ignacio Panach;Burak Turhan;Natalia Juristo

Context: In software engineering (SE) experiments, the way in which a treatment is applied could affect results. Different interpretations of how to apply the treatment and decisions on treatment adherence could lead to different results when data are analysed. Objective: This paper aims to study whether treatment adherence has an impact on the results of an SE experiment. Method: The experiment used as test case for our research uses Test-Driven Development (TDD) and Incremental Test-Last Development, (ITLD) as treatments. We reported elsewhere the design and results of such an experiment where 24 participants were recruited from industry. Here, we compare experiment results depending on the use of data from adherent participants or data from all the participants irrespective of their adherence to treatments. Results: Only 40% of the participants adhere to both TDD protocol and to the ITLD protocol; 27% never followed TDD; 20% used TDD even in the control group; 13% are defiers (used TDD in ITLD session but not in TDD session). Considering that both TDD and ITLD are less complex than other SE methods, we can hypothesize that more complex SE techniques could get even lower adherence to the treatment. Conclusion: Both TDD and ITLD are applied differently across participants. Training participants could not be enough to ensure a medium to large adherence of experiment participants. Adherence to treatments impacts results and should not be taken for granted in SE experiments.

背景：在软件工程（SE）实验中，应用处理的方式可能会影响结果。在分析数据时，对如何应用治疗和对治疗依从性的决定的不同解释可能导致不同的结果。目的：研究治疗依从性是否对SE实验结果有影响。方法：实验作为我们研究的测试用例，使用测试驱动开发（TDD）和增量测试最后开发（ITLD）作为处理方法。我们在其他地方报道了这样一个实验的设计和结果，该实验从工业界招募了24名参与者。在这里，我们根据使用来自坚持治疗的参与者的数据或来自所有参与者的数据来比较实验结果，而不管他们是否坚持治疗。结果：只有40%的参与者同时遵守TDD协议和ITLD协议；27%的人从未遵循TDD；对照组也有20%使用TDD；13%是定义者（在ITLD会话中使用TDD，但不在TDD会话中使用）。考虑到TDD和ITLD都没有其他SE方法那么复杂，我们可以假设更复杂的SE技术可以获得更低的治疗依从性。结论：TDD和ITLD在参与者中的应用不同。培训参与者不足以确保实验参与者的中等到较大的依从性。坚持治疗会影响结果，在SE实验中不应被认为是理所当然的。

{"title":"Does Treatment Adherence Impact Experiment Results in TDD?","authors":"Itir Karac;Jose Ignacio Panach;Burak Turhan;Natalia Juristo","doi":"10.1109/TSE.2024.3497332","DOIUrl":"10.1109/TSE.2024.3497332","url":null,"abstract":"<bold>Context:\u0000 In software engineering (SE) experiments, the way in which a treatment is applied could affect results. Different interpretations of how to apply the treatment and decisions on treatment adherence could lead to different results when data are analysed. \u0000<bold>Objective:\u0000 This paper aims to study whether treatment adherence has an impact on the results of an SE experiment. \u0000<bold>Method:\u0000 The experiment used as test case for our research uses Test-Driven Development (TDD) and Incremental Test-Last Development, (ITLD) as treatments. We reported elsewhere the design and results of such an experiment where 24 participants were recruited from industry. Here, we compare experiment results depending on the use of data from adherent participants or data from all the participants irrespective of their adherence to treatments. \u0000<bold>Results:\u0000 Only 40% of the participants adhere to both TDD protocol and to the ITLD protocol; 27% never followed TDD; 20% used TDD even in the control group; 13% are defiers (used TDD in ITLD session but not in TDD session). Considering that both TDD and ITLD are less complex than other SE methods, we can hypothesize that more complex SE techniques could get even lower adherence to the treatment. \u0000<bold>Conclusion:\u0000 Both TDD and ITLD are applied differently across participants. Training participants could not be enough to ensure a medium to large adherence of experiment participants. Adherence to treatments impacts results and should not be taken for granted in SE experiments.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 1","pages":"135-152"},"PeriodicalIF":6.5,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10754655","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142642983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Scoping Software Engineering for AI: The TSE Perspective 人工智能软件工程的范围界定：TSE 的视角

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering

Pub Date : 2024-11-13 DOI: 10.1109/TSE.2024.3470368

Sebastian Uchitel;Marsha Chechik;Massimiliano Di Penta;Bram Adams;Nazareno Aguirre;Gabriele Bavota;Domenico Bianculli;Kelly Blincoe;Ana Cavalcanti;Yvonne Dittrich;Filomena Ferrucci;Rashina Hoda;LiGuo Huang;David Lo;Michael R. Lyu;Lei Ma;Jonathan I. Maletic;Leonardo Mariani;Collin McMillan;Tim Menzies;Martin Monperrus;Ana Moreno;Nachiappan Nagappan;Liliana Pasquale;Patrizio Pelliccione;Michael Pradel;Rahul Purandare;Sukyoung Ryu;Mehrdad Sabetzadeh;Alexander Serebrenik;Jun Sun;Kla Tantithamthavorn;Christoph Treude;Manuel Wimmer;Yingfei Xiong;Tao Yue;Andy Zaidman;Tao Zhang;Hao Zhong

引用次数: 0

A Context-Aware Clustering Approach for Assisting Operators in Classifying Security Alerts 协助操作员对安全警报进行分类的情境感知聚类方法

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering

Pub Date : 2024-11-13 DOI: 10.1109/TSE.2024.3497588

Yu Liu;Tong Li;Runzi Zhang;Zhao Jin;Mingkai Tong;Wenmao Liu;Yiting Wang;Zhen Yang

Modern software has evolved from delivering software products to web services and applications, which need to be protected by security operation centers (SOC) against ubiquitous cyber attacks. Numerous security alerts are continuously generated every day, which have to be efficiently and correctly processed to identify potential threats. Many AIOps (artificial intelligence for IT operations) approaches have been proposed to (semi-)automate the inspection of alerts so as to reduce manual effort as much as possible. However, due to the ever-complicating attacks, a significant amount of manual work is still required in practice to ensure correct analysis results. In this paper, we propose a Context-Aware cLustering approach for cLassifying sEcurity alErts (CALLEE), which fully exploits the rich relationships among alerts in order to precisely identify similar alerts, significantly reducing the workload of SOC. Specifically, we first design a core conceptual model to capture connections among security alerts, based on which we establish corresponding heterogeneous information networks. Next, we systematically design a set of meta-paths to profile typical alert scenarios precisely, contributing to obtaining the representation of security alerts. We then cluster security alerts based on their contextual similarities, considering the tradeoff between the number of clusters and the homogeneity of each cluster. Finally, security operators only need to manually inspect a limited number of alerts within each cluster, pragmatically reducing their workload while ensuring the accuracy of alert classification. To evaluate the effectiveness of our approach, we collaborate with our industrial partner and pragmatically apply the approach to a real alert dataset. The results show that our approach can reduce the workload of SOC by 99.76%, outperforming baseline approaches. In addition, we further investigate the integration of our proposal with the real business scenario of our industrial partner. The feedback from practitioners shows that CALLEE is pragmatically applicable and helpful in industrial settings.

现代软件已经从提供软件产品发展到web服务和应用程序，这些服务和应用程序需要由安全操作中心（SOC）保护，以抵御无处不在的网络攻击。每天都会不断产生大量的安全警报，必须对这些警报进行有效和正确的处理，以识别潜在的威胁。已经提出了许多AIOps （IT操作的人工智能）方法来（半）自动化警报检查，以便尽可能减少人工工作。然而，由于越来越复杂的攻击，在实践中仍然需要大量的手工工作来确保正确的分析结果。在本文中，我们提出了一种上下文感知的安全警报分类聚类方法（CALLEE），该方法充分利用警报之间的丰富关系来精确识别相似的警报，从而大大减少了SOC的工作量。具体而言，我们首先设计了一个核心概念模型来捕获安全警报之间的联系，并在此基础上建立了相应的异构信息网络。接下来，我们系统地设计了一组元路径来精确地分析典型的警报场景，有助于获得安全警报的表示。然后，我们根据上下文相似性对安全警报进行集群，考虑集群数量和每个集群的同质性之间的权衡。最后，安全操作员只需要手动检查每个集群中有限数量的警报，在确保警报分类准确性的同时，切实减少了他们的工作量。为了评估我们方法的有效性，我们与我们的工业合作伙伴合作，并将该方法实用地应用于真实的警报数据集。结果表明，我们的方法可以将SOC的工作负载减少99.76%，优于基准方法。此外，我们进一步研究我们的建议与我们的工业合作伙伴的实际业务场景的集成。从业人员的反馈表明CALLEE在工业环境中具有实用的适用性和帮助。

{"title":"A Context-Aware Clustering Approach for Assisting Operators in Classifying Security Alerts","authors":"Yu Liu;Tong Li;Runzi Zhang;Zhao Jin;Mingkai Tong;Wenmao Liu;Yiting Wang;Zhen Yang","doi":"10.1109/TSE.2024.3497588","DOIUrl":"10.1109/TSE.2024.3497588","url":null,"abstract":"Modern software has evolved from delivering software products to web services and applications, which need to be protected by security operation centers (SOC) against ubiquitous cyber attacks. Numerous security alerts are continuously generated every day, which have to be efficiently and correctly processed to identify potential threats. Many AIOps (artificial intelligence for IT operations) approaches have been proposed to (semi-)automate the inspection of alerts so as to reduce manual effort as much as possible. However, due to the ever-complicating attacks, a significant amount of manual work is still required in practice to ensure correct analysis results. In this paper, we propose a Context-Aware cLustering approach for cLassifying sEcurity alErts (CALLEE), which fully exploits the rich relationships among alerts in order to precisely identify similar alerts, significantly reducing the workload of SOC. Specifically, we first design a core conceptual model to capture connections among security alerts, based on which we establish corresponding heterogeneous information networks. Next, we systematically design a set of meta-paths to profile typical alert scenarios precisely, contributing to obtaining the representation of security alerts. We then cluster security alerts based on their contextual similarities, considering the tradeoff between the number of clusters and the homogeneity of each cluster. Finally, security operators only need to manually inspect a limited number of alerts within each cluster, pragmatically reducing their workload while ensuring the accuracy of alert classification. To evaluate the effectiveness of our approach, we collaborate with our industrial partner and pragmatically apply the approach to a real alert dataset. The results show that our approach can reduce the workload of SOC by 99.76%, outperforming baseline approaches. In addition, we further investigate the integration of our proposal with the real business scenario of our industrial partner. The feedback from practitioners shows that CALLEE is pragmatically applicable and helpful in industrial settings.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 1","pages":"153-171"},"PeriodicalIF":6.5,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142610632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

StagedVulBERT: Multigranular Vulnerability Detection With a Novel Pretrained Code Model StagedVulBERT：利用新颖的预训练代码模型进行多粒度漏洞检测

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering

Pub Date : 2024-11-07 DOI: 10.1109/TSE.2024.3493245

Yuan Jiang;Yujian Zhang;Xiaohong Su;Christoph Treude;Tiantian Wang

The emergence of pre-trained model-based vulnerability detection methods has significantly advanced the field of automated vulnerability detection. However, these methods still face several challenges, such as difficulty in learning effective feature representations of statements for fine-grained predictions and struggling to process overly long code sequences. To address these issues, this study introduces StagedVulBERT, a novel vulnerability detection framework that leverages a pre-trained code language model and employs a coarse-to-fine strategy. The key innovation and contribution of our research lies in the development of the CodeBERT-HLS component within our framework, specialized in hierarchical, layered, and semantic encoding. This component is designed to capture semantics at both the token and statement levels simultaneously, which is crucial for achieving more accurate multi-granular vulnerability detection. Additionally, CodeBERT-HLS efficiently processes longer code token sequences, making it more suited to real-world vulnerability detection. Comprehensive experiments demonstrate that our method enhances the performance of vulnerability detection at both coarse- and fine-grained levels. Specifically, in coarse-grained vulnerability detection, StagedVulBERT achieves an F1 score of 92.26%, marking a 6.58% improvement over the best-performing methods. At the fine-grained level, our method achieves a Top-5% accuracy of 65.69%, which outperforms the state-of-the-art methods by up to 75.17%.

基于预训练模型的漏洞检测方法的出现，极大地推动了自动化漏洞检测领域的发展。然而，这些方法仍然面临着一些挑战，例如难以学习用于细粒度预测的语句的有效特征表示，以及难以处理过长的代码序列。为了解决这些问题，本研究引入了StagedVulBERT，这是一种新的漏洞检测框架，它利用了预训练的代码语言模型，并采用了从粗到精的策略。我们研究的关键创新和贡献在于在我们的框架内开发CodeBERT-HLS组件，专门用于分层，分层和语义编码。该组件旨在同时捕获令牌和语句级别的语义，这对于实现更准确的多粒度漏洞检测至关重要。此外，CodeBERT-HLS有效地处理较长的代码令牌序列，使其更适合现实世界的漏洞检测。综合实验表明，我们的方法在粗粒度和细粒度两个层次上都提高了漏洞检测的性能。具体来说，在粗粒度漏洞检测中，StagedVulBERT达到了92.26%的F1分数，比性能最好的方法提高了6.58%。在细粒度级别，我们的方法达到了65.69%的Top-5%准确率，比最先进的方法高出75.17%。

{"title":"StagedVulBERT: Multigranular Vulnerability Detection With a Novel Pretrained Code Model","authors":"Yuan Jiang;Yujian Zhang;Xiaohong Su;Christoph Treude;Tiantian Wang","doi":"10.1109/TSE.2024.3493245","DOIUrl":"10.1109/TSE.2024.3493245","url":null,"abstract":"The emergence of pre-trained model-based vulnerability detection methods has significantly advanced the field of automated vulnerability detection. However, these methods still face several challenges, such as difficulty in learning effective feature representations of statements for fine-grained predictions and struggling to process overly long code sequences. To address these issues, this study introduces StagedVulBERT, a novel vulnerability detection framework that leverages a pre-trained code language model and employs a coarse-to-fine strategy. The key innovation and contribution of our research lies in the development of the CodeBERT-HLS component within our framework, specialized in hierarchical, layered, and semantic encoding. This component is designed to capture semantics at both the token and statement levels simultaneously, which is crucial for achieving more accurate multi-granular vulnerability detection. Additionally, CodeBERT-HLS efficiently processes longer code token sequences, making it more suited to real-world vulnerability detection. Comprehensive experiments demonstrate that our method enhances the performance of vulnerability detection at both coarse- and fine-grained levels. Specifically, in coarse-grained vulnerability detection, StagedVulBERT achieves an F1 score of 92.26%, marking a 6.58% improvement over the best-performing methods. At the fine-grained level, our method achieves a Top-5% accuracy of 65.69%, which outperforms the state-of-the-art methods by up to 75.17%.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 12","pages":"3454-3471"},"PeriodicalIF":6.5,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142596496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SMARLA: A Safety Monitoring Approach for Deep Reinforcement Learning Agents SMARLA：深度强化学习代理的安全监控方法

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering

Pub Date : 2024-11-06 DOI: 10.1109/TSE.2024.3491496

Amirhossein Zolfagharian;Manel Abdellatif;Lionel C. Briand;Ramesh S

Deep Reinforcement Learning (DRL) has made significant advancements in various fields, such as autonomous driving, healthcare, and robotics, by enabling agents to learn optimal policies through interactions with their environments. However, the application of DRL in safety-critical domains presents challenges, particularly concerning the safety of the learned policies. DRL agents, which are focused on maximizing rewards, may select unsafe actions, leading to safety violations. Runtime safety monitoring is thus essential to ensure the safe operation of these agents, especially in unpredictable and dynamic environments. This paper introduces SMARLA, a black-box safety monitoring approach specifically designed for DRL agents. SMARLA utilizes machine learning to predict safety violations by observing the agent's behavior during execution. The approach is based on Q-values, which reflect the expected reward for taking actions in specific states. SMARLA employs state abstraction to reduce the complexity of the state space, enhancing the predictive capabilities of the monitoring model. Such abstraction enables the early detection of unsafe states, allowing for the implementation of corrective and preventive measures before incidents occur. We quantitatively and qualitatively validated SMARLA on three well-known case studies widely used in DRL research. Empirical results reveal that SMARLA is accurate at predicting safety violations, with a low false positive rate, and can predict violations at an early stage, approximately halfway through the execution of the agent, before violations occur. We also discuss different decision criteria, based on confidence intervals of the predicted violation probabilities, to trigger safety mechanisms aiming at a trade-off between early detection and low false positive rates.

深度强化学习（DRL）通过使智能体通过与其环境的交互学习最佳策略，在自动驾驶、医疗保健和机器人等各个领域取得了重大进展。然而，DRL在安全关键领域的应用面临着挑战，特别是在学习策略的安全性方面。DRL主体以奖励最大化为目标，可能会选择不安全的行为，导致安全违规。因此，运行时安全监控对于确保这些代理的安全操作至关重要，尤其是在不可预测和动态的环境中。本文介绍了一种专门为DRL代理设计的黑盒安全监测方法SMARLA。SMARLA利用机器学习，通过观察代理在执行过程中的行为来预测安全违规行为。该方法基于q值，它反映了在特定状态下采取行动的预期回报。SMARLA采用状态抽象来降低状态空间的复杂性，增强了监控模型的预测能力。这样的抽象可以使不安全状态的早期检测成为可能，允许在事故发生之前实施纠正和预防措施。我们通过在DRL研究中广泛使用的三个知名案例对SMARLA进行了定量和定性验证。实证结果表明，SMARLA在预测安全违规方面是准确的，假阳性率低，并且可以在违规发生之前的早期阶段（大约在代理执行的中途）预测违规。我们还讨论了不同的决策标准，基于预测违规概率的置信区间，以触发安全机制，旨在早期检测和低假阳性率之间的权衡。

{"title":"SMARLA: A Safety Monitoring Approach for Deep Reinforcement Learning Agents","authors":"Amirhossein Zolfagharian;Manel Abdellatif;Lionel C. Briand;Ramesh S","doi":"10.1109/TSE.2024.3491496","DOIUrl":"10.1109/TSE.2024.3491496","url":null,"abstract":"Deep Reinforcement Learning (DRL) has made significant advancements in various fields, such as autonomous driving, healthcare, and robotics, by enabling agents to learn optimal policies through interactions with their environments. However, the application of DRL in safety-critical domains presents challenges, particularly concerning the safety of the learned policies. DRL agents, which are focused on maximizing rewards, may select unsafe actions, leading to safety violations. Runtime safety monitoring is thus essential to ensure the safe operation of these agents, especially in unpredictable and dynamic environments. This paper introduces \u0000<italic>SMARLA\u0000, a black-box safety monitoring approach specifically designed for DRL agents. \u0000<italic>SMARLA\u0000 utilizes machine learning to predict safety violations by observing the agent's behavior during execution. The approach is based on Q-values, which reflect the expected reward for taking actions in specific states. \u0000<italic>SMARLA\u0000 employs state abstraction to reduce the complexity of the state space, enhancing the predictive capabilities of the monitoring model. Such abstraction enables the early detection of unsafe states, allowing for the implementation of corrective and preventive measures before incidents occur. We quantitatively and qualitatively validated \u0000<italic>SMARLA\u0000 on three well-known case studies widely used in DRL research. Empirical results reveal that \u0000<italic>SMARLA\u0000 is accurate at predicting safety violations, with a low false positive rate, and can predict violations at an early stage, approximately halfway through the execution of the agent, before violations occur. We also discuss different decision criteria, based on confidence intervals of the predicted violation probabilities, to trigger safety mechanisms aiming at a trade-off between early detection and low false positive rates.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 1","pages":"82-105"},"PeriodicalIF":6.5,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142594787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Diversity-Oriented Testing for Competitive Game Agent via Constraint-Guided Adversarial Agent Training 通过约束引导的对抗式代理训练，对竞争性游戏代理进行以多样性为导向的测试

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering

Pub Date : 2024-11-05 DOI: 10.1109/TSE.2024.3491193

Xuyan Ma;Yawen Wang;Junjie Wang;Xiaofei Xie;Boyu Wu;Yiguang Yan;Shoubin Li;Fanjiang Xu;Qing Wang

Deep reinforcement learning has achieved remarkable success in competitive games, surpassing human performance in applications ranging from business competitions to video games. In competitive environments, agents face the challenge of adapting to continuously shifting adversary strategies, necessitating the ability to handle diverse scenarios. Existing studies primarily focus on evaluating agent robustness either through perturbing observations, which has practical limitations, or through training adversarial agents to expose weaknesses, which lacks strategy diversity exploration. There are also studies which rely on curiosity-based mechanism to explore the diversity, yet they may lack direct guidance to enhance identified decision-making flaws. In this paper, we propose a novel diversity-oriented testing framework (called AdvTest) to test the competitive game agent via constraint-guided adversarial agent training. Specifically, AdvTest adds constraints as the explicit guidance during adversarial agent training to make it capable of defeating the target agent using diverse strategies. To realize the method, three challenges need to be addressed, i.e., what are the suitable constraints, when to introduce constraints, and which constraint should be added. We experimentally evaluate AdvTest on the commonly-used competitive game environment, StarCraft II. The results on four maps show that AdvTest exposes more diverse failure scenarios compared with the commonly-used and state-of-the-art baselines.

深度强化学习在竞技游戏中取得了显著的成功，在从商业竞赛到视频游戏的应用中超越了人类的表现。在竞争环境中，智能体面临着适应不断变化的对手策略的挑战，需要处理不同场景的能力。现有的研究主要集中在通过扰动观察来评估智能体的鲁棒性，这有实际的局限性，或者通过训练对抗智能体来暴露弱点，这缺乏对策略多样性的探索。也有一些研究依靠基于好奇心的机制来探索多样性，但它们可能缺乏直接指导来增强已识别的决策缺陷。在本文中，我们提出了一个新的面向多样性的测试框架（称为AdvTest），通过约束引导的对抗代理训练来测试竞争博弈代理。具体来说，AdvTest在对抗性代理训练过程中增加了约束作为明确的指导，使其能够使用不同的策略击败目标代理。为了实现该方法，需要解决三个挑战，即，什么是合适的约束，何时引入约束，以及应该添加哪些约束。我们在常用的竞技游戏环境《星际争霸2》中对AdvTest进行了实验评估。四个地图上的结果显示，与常用的和最先进的基线相比，AdvTest暴露了更多不同的故障场景。

{"title":"Diversity-Oriented Testing for Competitive Game Agent via Constraint-Guided Adversarial Agent Training","authors":"Xuyan Ma;Yawen Wang;Junjie Wang;Xiaofei Xie;Boyu Wu;Yiguang Yan;Shoubin Li;Fanjiang Xu;Qing Wang","doi":"10.1109/TSE.2024.3491193","DOIUrl":"10.1109/TSE.2024.3491193","url":null,"abstract":"Deep reinforcement learning has achieved remarkable success in competitive games, surpassing human performance in applications ranging from business competitions to video games. In competitive environments, agents face the challenge of adapting to continuously shifting adversary strategies, necessitating the ability to handle diverse scenarios. Existing studies primarily focus on evaluating agent robustness either through perturbing observations, which has practical limitations, or through training adversarial agents to expose weaknesses, which lacks strategy diversity exploration. There are also studies which rely on curiosity-based mechanism to explore the diversity, yet they may lack direct guidance to enhance identified decision-making flaws. In this paper, we propose a novel diversity-oriented testing framework (called AdvTest) to test the competitive game agent via constraint-guided adversarial agent training. Specifically, AdvTest adds constraints as the explicit guidance during adversarial agent training to make it capable of defeating the target agent using diverse strategies. To realize the method, three challenges need to be addressed, i.e., what are the suitable constraints, when to introduce constraints, and which constraint should be added. We experimentally evaluate AdvTest on the commonly-used competitive game environment, StarCraft II. The results on four maps show that AdvTest exposes more diverse failure scenarios compared with the commonly-used and state-of-the-art baselines.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 1","pages":"66-81"},"PeriodicalIF":6.5,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142588843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dividable Configuration Performance Learning 可分割配置性能学习

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering

Pub Date : 2024-11-05 DOI: 10.1109/TSE.2024.3491945

Jingzhi Gong;Tao Chen;Rami Bahsoon

Machine/deep learning models have been widely adopted to predict the configuration performance of software systems. However, a crucial yet unaddressed challenge is how to cater for the sparsity inherited from the configuration landscape: the influence of configuration options (features) and the distribution of data samples are highly sparse. In this paper, we propose a model-agnostic and sparsity-robust framework for predicting configuration performance, dubbed DaL, based on the new paradigm of dividable learning that builds a model via “divide-and-learn”. To handle sample sparsity, the samples from the configuration landscape are divided into distant divisions, for each of which we build a sparse local model, e.g., regularized Hierarchical Interaction Neural Network, to deal with the feature sparsity. A newly given configuration would then be assigned to the right model of division for the final prediction. Further, DaL adaptively determines the optimal number of divisions required for a system and sample size without any extra training or profiling. Experiment results from 12 real-world systems and five sets of training data reveal that, compared with the state-of-the-art approaches, DaL performs no worse than the best counterpart on 44 out of 60 cases (within which 31 cases are significantly better) with up to

$1.61times$

improvement on accuracy; requires fewer samples to reach the same/better accuracy; and producing acceptable training overhead. In particular, the mechanism that adapted the parameter

$d$

can reach the optimal value for 76.43% of the individual runs. The result also confirms that the paradigm of dividable learning is more suitable than other similar paradigms such as ensemble learning for predicting configuration performance. Practically, DaL considerably improves different global models when using them as the underlying local models, which further strengthens its flexibility. To promote open science, all the data, code, and supplementary materials of this work can be accessed at our repository: https://github.com/ideas-labo/DaL-ext.

机器/深度学习模型已被广泛用于预测软件系统的配置性能。然而，一个关键但尚未解决的挑战是如何满足从配置环境继承的稀疏性：配置选项（特征）的影响和数据样本的分布是高度稀疏的。在本文中，我们提出了一个模型不可知和稀疏鲁棒的框架，用于预测配置性能，称为DaL，基于可分学习的新范式，该范式通过“分与学习”构建模型。为了处理样本稀疏性，将来自配置景观的样本划分为遥远的分区，我们为每个分区构建一个稀疏的局部模型，例如正则化层次交互神经网络，以处理特征稀疏性。然后将新给出的配置分配给正确的划分模型以进行最终预测。此外，DaL自适应地确定系统所需的最佳划分数量和样本大小，而无需任何额外的训练或分析。来自12个真实世界系统和5组训练数据的实验结果表明，与最先进的方法相比，在60个案例中有44个案例（其中31个案例明显更好），DaL的表现并不比最好的方法差，准确率提高了1.61倍；需要更少的样本来达到相同/更好的精度；产生可接受的训练开销。特别是，适应参数$d$的机制在76.43%的单次运行中可以达到最优值。结果还证实了可分学习范式比集成学习等其他类似范式更适合于预测配置性能。实际上，当使用不同的全局模型作为底层局部模型时，DaL显著地改进了它们，这进一步增强了它的灵活性。为了促进开放科学，本书的所有数据、代码和补充材料都可以在我们的知识库中访问：https://github.com/ideas-labo/DaL-ext。

{"title":"Dividable Configuration Performance Learning","authors":"Jingzhi Gong;Tao Chen;Rami Bahsoon","doi":"10.1109/TSE.2024.3491945","DOIUrl":"10.1109/TSE.2024.3491945","url":null,"abstract":"Machine/deep learning models have been widely adopted to predict the configuration performance of software systems. However, a crucial yet unaddressed challenge is how to cater for the sparsity inherited from the configuration landscape: the influence of configuration options (features) and the distribution of data samples are highly sparse. In this paper, we propose a model-agnostic and sparsity-robust framework for predicting configuration performance, dubbed \u0000<monospace>DaL</monospace>\u0000, based on the new paradigm of dividable learning that builds a model via “divide-and-learn”. To handle sample sparsity, the samples from the configuration landscape are divided into distant divisions, for each of which we build a sparse local model, e.g., regularized Hierarchical Interaction Neural Network, to deal with the feature sparsity. A newly given configuration would then be assigned to the right model of division for the final prediction. Further, \u0000<monospace>DaL</monospace>\u0000 adaptively determines the optimal number of divisions required for a system and sample size without any extra training or profiling. Experiment results from 12 real-world systems and five sets of training data reveal that, compared with the state-of-the-art approaches, \u0000<monospace>DaL</monospace>\u0000 performs no worse than the best counterpart on 44 out of 60 cases (within which 31 cases are significantly better) with up to \u0000<inline-formula><tex-math>$1.61times$</tex-math></inline-formula>\u0000 improvement on accuracy; requires fewer samples to reach the same/better accuracy; and producing acceptable training overhead. In particular, the mechanism that adapted the parameter \u0000<inline-formula><tex-math>$d$</tex-math></inline-formula>\u0000 can reach the optimal value for 76.43% of the individual runs. The result also confirms that the paradigm of dividable learning is more suitable than other similar paradigms such as ensemble learning for predicting configuration performance. Practically, \u0000<monospace>DaL</monospace>\u0000 considerably improves different global models when using them as the underlying local models, which further strengthens its flexibility. To promote open science, all the data, code, and supplementary materials of this work can be accessed at our repository: \u0000<uri>https://github.com/ideas-labo/DaL-ext</uri>\u0000.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 1","pages":"106-134"},"PeriodicalIF":6.5,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10744216","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142588716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fight Fire With Fire: How Much Can We Trust ChatGPT on Source Code-Related Tasks? 以毒攻毒：在源代码相关任务上，我们能在多大程度上信任 ChatGPT？

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering

Pub Date : 2024-11-05 DOI: 10.1109/TSE.2024.3492204

Xiao Yu;Lei Liu;Xing Hu;Jacky Wai Keung;Jin Liu;Xin Xia

With the increasing utilization of large language models such as ChatGPT during software development, it has become crucial to verify the quality of code content it generates. Recent studies proposed utilizing ChatGPT as both a developer and tester for multi-agent collaborative software development. The multi-agent collaboration empowers ChatGPT to produce test reports for its generated code, enabling it to self-verify the code content and fix bugs based on these reports. However, these studies did not assess the effectiveness of the generated test reports in validating the code. Therefore, we conduct a comprehensive empirical investigation to evaluate ChatGPT's self-verification capability in code generation, code completion, and program repair. We request ChatGPT to (1) generate correct code and then self-verify its correctness; (2) complete code without vulnerabilities and then self-verify for the presence of vulnerabilities; and (3) repair buggy code and then self-verify whether the bugs are resolved. Our findings on two code generation datasets, one code completion dataset, and two program repair datasets reveal the following observations: (1) ChatGPT often erroneously predicts its generated incorrect code as correct, its vulnerable completed code as non-vulnerable, and its failed program repairs as successful during its self-verification. (2) The self-contradictory hallucinations in ChatGPT's behavior arise: (a) ChatGPT initially generates code that it believes to be correct but later predicts it to be incorrect; (b) ChatGPT initially generates code completions that it deems secure but later predicts them to be vulnerable; (c) ChatGPT initially outputs code that it considers successfully repaired but later predicts it to be buggy during its self-verification. (3) The self-verification capability of ChatGPT can be enhanced by asking the guiding question, which queries whether ChatGPT agrees with assertions about incorrectly generated or repaired code and vulnerabilities in completed code. (4) Using test reports generated by ChatGPT can identify more vulnerabilities in completed code, but the explanations for incorrectly generated code and failed repairs are mostly inaccurate in the test reports. Based on these findings, we provide implications for further research or development using ChatGPT.

随着在软件开发过程中对ChatGPT等大型语言模型的使用越来越多，验证其生成的代码内容的质量变得至关重要。最近的研究提出利用ChatGPT作为多代理协作软件开发的开发人员和测试人员。多代理协作使ChatGPT能够为其生成的代码生成测试报告，使其能够自我验证代码内容并根据这些报告修复错误。然而，这些研究并没有评估在验证代码时生成的测试报告的有效性。因此，我们进行了全面的实证调查，以评估ChatGPT在代码生成、代码完成和程序修复方面的自我验证能力。我们要求ChatGPT(1)生成正确的代码，然后自我验证其正确性；(2)完整的代码，不存在漏洞，然后自我验证是否存在漏洞；(3)修复有bug的代码，然后自我验证bug是否被解决。我们对两个代码生成数据集、一个代码完成数据集和两个程序修复数据集的研究结果揭示了以下观察结果：(1)ChatGPT在自我验证期间经常错误地将其生成的不正确代码预测为正确，将其易受攻击的完成代码预测为非易受攻击，并将其失败的程序修复预测为成功。(2) ChatGPT行为产生自相矛盾的幻觉：(a) ChatGPT最初生成它认为正确的代码，但后来预测它是错误的；(b) ChatGPT最初生成它认为安全的代码补全，但后来预测它们容易受到攻击；(c) ChatGPT最初输出它认为已成功修复的代码，但后来在自我验证期间预测它有bug。(3)通过提出引导问题来增强ChatGPT的自验证能力，引导问题询问ChatGPT是否同意关于错误生成或修复的代码以及已完成代码中的漏洞的断言。(4)使用ChatGPT生成的测试报告可以识别出更多已完成代码中的漏洞，但测试报告中对于生成错误的代码和修复失败的解释大多不准确。基于这些发现，我们提供了使用ChatGPT进一步研究或开发的含义。

{"title":"Fight Fire With Fire: How Much Can We Trust ChatGPT on Source Code-Related Tasks?","authors":"Xiao Yu;Lei Liu;Xing Hu;Jacky Wai Keung;Jin Liu;Xin Xia","doi":"10.1109/TSE.2024.3492204","DOIUrl":"10.1109/TSE.2024.3492204","url":null,"abstract":"With the increasing utilization of large language models such as ChatGPT during software development, it has become crucial to verify the quality of code content it generates. Recent studies proposed utilizing ChatGPT as both a developer and tester for multi-agent collaborative software development. The multi-agent collaboration empowers ChatGPT to produce test reports for its generated code, enabling it to self-verify the code content and fix bugs based on these reports. However, these studies did not assess the effectiveness of the generated test reports in validating the code. Therefore, we conduct a comprehensive empirical investigation to evaluate ChatGPT's self-verification capability in code generation, code completion, and program repair. We request ChatGPT to (1) generate correct code and then self-verify its correctness; (2) complete code without vulnerabilities and then self-verify for the presence of vulnerabilities; and (3) repair buggy code and then self-verify whether the bugs are resolved. Our findings on two code generation datasets, one code completion dataset, and two program repair datasets reveal the following observations: (1) ChatGPT often erroneously predicts its generated incorrect code as correct, its vulnerable completed code as non-vulnerable, and its failed program repairs as successful during its self-verification. (2) The self-contradictory hallucinations in ChatGPT's behavior arise: (a) ChatGPT initially generates code that it believes to be correct but later predicts it to be incorrect; (b) ChatGPT initially generates code completions that it deems secure but later predicts them to be vulnerable; (c) ChatGPT initially outputs code that it considers successfully repaired but later predicts it to be buggy during its self-verification. (3) The self-verification capability of ChatGPT can be enhanced by asking the guiding question, which queries whether ChatGPT agrees with assertions about incorrectly generated or repaired code and vulnerabilities in completed code. (4) Using test reports generated by ChatGPT can identify more vulnerabilities in completed code, but the explanations for incorrectly generated code and failed repairs are mostly inaccurate in the test reports. Based on these findings, we provide implications for further research or development using ChatGPT.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 12","pages":"3435-3453"},"PeriodicalIF":6.5,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142588717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AIM: Automated Input Set Minimization for Metamorphic Security Testing AIM：自动最小化输入集，用于变形安全测试

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering

Pub Date : 2024-10-30 DOI: 10.1109/TSE.2024.3488525

Nazanin Bayati Chaleshtari;Yoann Marquer;Fabrizio Pastore;Lionel C. Briand

Although the security testing of Web systems can be automated by generating crafted inputs, solutions to automate the test oracle, i.e., vulnerability detection, remain difficult to apply in practice. Specifically, though previous work has demonstrated the potential of metamorphic testing—security failures can be determined by metamorphic relations that turn valid inputs into malicious inputs—metamorphic relations are typically executed on a large set of inputs, which is time-consuming and thus makes metamorphic testing impractical. We propose AIM, an approach that automatically selects inputs to reduce testing costs while preserving vulnerability detection capabilities. AIM includes a clustering-based black-box approach, to identify similar inputs based on their security properties. It also relies on a novel genetic algorithm to efficiently select diverse inputs while minimizing their total cost. Further, it contains a problem-reduction component to reduce the search space and speed up the minimization process. We evaluated the effectiveness of AIM on two well-known Web systems, Jenkins and Joomla, with documented vulnerabilities. We compared AIM's results with four baselines involving standard search approaches. Overall, AIM reduced metamorphic testing time by 84% for Jenkins and 82% for Joomla, while preserving the same level of vulnerability detection. Furthermore, AIM significantly outperformed all the considered baselines regarding vulnerability coverage.

尽管Web系统的安全测试可以通过生成精心设计的输入来实现自动化，但是自动化测试oracle的解决方案，即漏洞检测，在实践中仍然难以应用。具体来说，尽管以前的工作已经证明了变形测试的潜力——安全故障可以通过将有效输入转换为恶意输入的变形关系来确定——变形关系通常在大量输入上执行，这是耗时的，因此使变形测试不切实际。我们提出了AIM，一种自动选择输入以降低测试成本同时保留漏洞检测能力的方法。AIM包括一种基于聚类的黑盒方法，根据它们的安全属性识别相似的输入。它还依赖于一种新的遗传算法来有效地选择不同的输入，同时最小化它们的总成本。此外，它还包含一个问题减少组件，以减少搜索空间并加快最小化过程。我们在两个知名的Web系统Jenkins和Joomla上评估了AIM的有效性，并记录了漏洞。我们将AIM的结果与四个涉及标准搜索方法的基线进行了比较。总体而言，AIM将Jenkins和Joomla的变形测试时间分别减少了84%和82%，同时保持了相同的漏洞检测水平。此外，AIM显著优于所有考虑到的关于漏洞覆盖率的基线。

{"title":"AIM: Automated Input Set Minimization for Metamorphic Security Testing","authors":"Nazanin Bayati Chaleshtari;Yoann Marquer;Fabrizio Pastore;Lionel C. Briand","doi":"10.1109/TSE.2024.3488525","DOIUrl":"10.1109/TSE.2024.3488525","url":null,"abstract":"Although the security testing of Web systems can be automated by generating crafted inputs, solutions to automate the test oracle, i.e., vulnerability detection, remain difficult to apply in practice. Specifically, though previous work has demonstrated the potential of metamorphic testing—security failures can be determined by metamorphic relations that turn valid inputs into malicious inputs—metamorphic relations are typically executed on a large set of inputs, which is time-consuming and thus makes metamorphic testing impractical. We propose AIM, an approach that automatically selects inputs to reduce testing costs while preserving vulnerability detection capabilities. AIM includes a clustering-based black-box approach, to identify similar inputs based on their security properties. It also relies on a novel genetic algorithm to efficiently select diverse inputs while minimizing their total cost. Further, it contains a problem-reduction component to reduce the search space and speed up the minimization process. We evaluated the effectiveness of AIM on two well-known Web systems, Jenkins and Joomla, with documented vulnerabilities. We compared AIM's results with four baselines involving standard search approaches. Overall, AIM reduced metamorphic testing time by 84% for Jenkins and 82% for Joomla, while preserving the same level of vulnerability detection. Furthermore, AIM significantly outperformed all the considered baselines regarding vulnerability coverage.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 12","pages":"3403-3434"},"PeriodicalIF":6.5,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142555913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Comprehensive Study on Static Application Security Testing (SAST) Tools for Android 安卓静态应用安全测试 (SAST) 工具综合研究

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering

Pub Date : 2024-10-30 DOI: 10.1109/TSE.2024.3488041

Jingyun Zhu;Kaixuan Li;Sen Chen;Lingling Fan;Junjie Wang;Xiaofei Xie

To identify security vulnerabilities in Android applications, numerous static application security testing (SAST) tools have been proposed. However, it poses significant challenges to assess their overall performance on diverse vulnerability types. The task is non-trivial and poses considerable challenges. Firstly, the absence of a unified evaluation platform for defining and describing tools’ supported vulnerability types, coupled with the lack of normalization for the intricate and varied reports generated by different tools, significantly adds to the complexity. Secondly, there is a scarcity of adequate benchmarks, particularly those derived from real-world scenarios. To address these problems, we are the first to propose a unified platform named VulsTotal, supporting various vulnerability types, enabling comprehensive and versatile analysis across diverse SAST tools. Specifically, we begin by meticulously selecting 11 free and open-sourced SAST tools from a pool of 97 existing options, adhering to clearly defined criteria. After that, we invest significant efforts in comprehending the detection rules of each tool, subsequently unifying 67 general/common vulnerability types for Android SAST tools. We also redefine and implement a standardized reporting format, ensuring uniformity in presenting results across all tools. Additionally, to mitigate the problem of benchmarks, we conducted a manual analysis of huge amounts of CVEs to construct a new CVE-based benchmark based on our comprehension of Android app vulnerabilities. Leveraging the evaluation platform, which integrates both existing synthetic benchmarks and newly constructed CVE-based benchmarks from this study, we conducted a comprehensive analysis to evaluate and compare these selected tools from various perspectives, such as general vulnerability type coverage, type consistency, tool effectiveness, and time performance. Our observations yielded impressive findings, like the technical reasons underlying the performance, which provide insights for different stakeholders.

为了识别Android应用程序中的安全漏洞，已经提出了许多静态应用程序安全测试（SAST）工具。然而，评估它们在不同漏洞类型上的整体性能提出了重大挑战。这项任务并不简单，而且具有相当大的挑战。首先，缺乏一个统一的评估平台来定义和描述工具支持的漏洞类型，再加上不同工具生成的复杂多样的报告缺乏规范化，大大增加了复杂性。其次，缺乏足够的基准，特别是那些来自现实世界场景的基准。为了解决这些问题，我们首先提出了一个名为VulsTotal的统一平台，支持各种漏洞类型，使各种SAST工具能够进行全面和通用的分析。具体来说，我们首先从97个现有选项中精心选择11个免费和开源的SAST工具，并遵循明确定义的标准。之后，我们投入了大量的精力来理解每个工具的检测规则，随后统一了67种Android SAST工具的通用/常见漏洞类型。我们还重新定义并实现了标准化的报告格式，以确保在所有工具中呈现结果的一致性。此外，为了缓解基准测试的问题，我们对大量cve进行了手动分析，基于我们对Android应用程序漏洞的理解，构建了一个新的基于cve的基准测试。我们利用评估平台，集成了本研究中已有的综合基准和新构建的基于cve的基准，从一般漏洞类型覆盖率、类型一致性、工具有效性和时间性能等多个角度对这些选择的工具进行了综合分析，进行了评估和比较。我们的观察产生了令人印象深刻的发现，比如性能背后的技术原因，这为不同的利益相关者提供了见解。

{"title":"A Comprehensive Study on Static Application Security Testing (SAST) Tools for Android","authors":"Jingyun Zhu;Kaixuan Li;Sen Chen;Lingling Fan;Junjie Wang;Xiaofei Xie","doi":"10.1109/TSE.2024.3488041","DOIUrl":"10.1109/TSE.2024.3488041","url":null,"abstract":"To identify security vulnerabilities in Android applications, numerous static application security testing (SAST) tools have been proposed. However, it poses significant challenges to assess their overall performance on diverse vulnerability types. The task is non-trivial and poses considerable challenges. Firstly, the absence of a unified evaluation platform for defining and describing tools’ supported vulnerability types, coupled with the lack of normalization for the intricate and varied reports generated by different tools, significantly adds to the complexity. Secondly, there is a scarcity of adequate benchmarks, particularly those derived from real-world scenarios. To address these problems, we are the first to propose a unified platform named \u0000<italic>VulsTotal\u0000, supporting various vulnerability types, enabling comprehensive and versatile analysis across diverse SAST tools. Specifically, we begin by meticulously selecting 11 free and open-sourced SAST tools from a pool of 97 existing options, adhering to clearly defined criteria. After that, we invest significant efforts in comprehending the detection rules of each tool, subsequently unifying 67 general/common vulnerability types for Android SAST tools. We also redefine and implement a standardized reporting format, ensuring uniformity in presenting results across all tools. Additionally, to mitigate the problem of benchmarks, we conducted a manual analysis of huge amounts of CVEs to construct a new CVE-based benchmark based on our comprehension of Android app vulnerabilities. Leveraging the evaluation platform, which integrates both existing synthetic benchmarks and newly constructed CVE-based benchmarks from this study, we conducted a comprehensive analysis to evaluate and compare these selected tools from various perspectives, such as general vulnerability type coverage, type consistency, tool effectiveness, and time performance. Our observations yielded impressive findings, like the technical reasons underlying the performance, which provide insights for different stakeholders.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 12","pages":"3385-3402"},"PeriodicalIF":6.5,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142555915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0