首页 > 最新文献

IEEE Transactions on Software Engineering最新文献

英文 中文
Clopper-Pearson Algorithms for Efficient Statistical Model Checking Estimation 高效统计模型检验估算的 Clopper-Pearson 算法
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-04-23 DOI: 10.1109/TSE.2024.3392720
Hao Bu;Meng Sun
Statistical model checking (SMC) is a simulation-based formal verification technique to deal with the scalability problem faced by traditional model checking. The main workflow of SMC is to perform iterative simulations. The number of simulations depends on users’ requirement for the verification results, which can be very large if users require a high level of confidence and precision. Therefore, how to perform as fewer simulations as possible while achieving the same level of confidence and precision is one of the core problems of SMC. In this paper, we consider the estimation problem of SMC. Most existing statistical model checkers use the Okamoto bound to decide the simulation number. Although the Okamoto bound is sound, it is well known to be overly conservative. The simulation number decided by the Okamoto bound is usually much higher than it actually needs, which leads to a waste of time and computation resources. To tackle this problem, we propose an efficient, sound and lightweight estimation algorithm using the Clopper-Pearson confidence interval. We perform comprehensive numerical experiments and case studies to evaluate the performance of our algorithm, and the results show that our algorithm uses 40%-60% fewer simulations than the Okamoto bound. Our algorithm can be directly integrated into existing model checkers to reduce the verification time of SMC estimation problems.
统计模型检查(SMC)是一种基于仿真的形式化验证技术,用于解决传统模型检查面临的可扩展性问题。SMC 的主要工作流程是执行迭代模拟。仿真的次数取决于用户对验证结果的要求,如果用户对验证结果的置信度和精度要求很高,那么仿真的次数就会非常多。因此,如何在达到相同置信度和精度的前提下尽可能减少模拟次数是 SMC 的核心问题之一。本文考虑的是 SMC 的估计问题。现有的统计模型检验器大多使用冈本约束来决定模拟次数。尽管冈本约束是合理的,但众所周知它过于保守。根据冈本约束确定的模拟次数通常比实际需要的要高得多,从而导致时间和计算资源的浪费。为了解决这个问题,我们提出了一种使用 Clopper-Pearson 置信区间的高效、合理和轻量级估计算法。我们进行了全面的数值实验和案例研究来评估我们算法的性能,结果表明我们算法的模拟次数比 Okamoto 约束少 40%-60%。我们的算法可以直接集成到现有的模型检查器中,以减少 SMC 估计问题的验证时间。
{"title":"Clopper-Pearson Algorithms for Efficient Statistical Model Checking Estimation","authors":"Hao Bu;Meng Sun","doi":"10.1109/TSE.2024.3392720","DOIUrl":"10.1109/TSE.2024.3392720","url":null,"abstract":"Statistical model checking (SMC) is a simulation-based formal verification technique to deal with the scalability problem faced by traditional model checking. The main workflow of SMC is to perform iterative simulations. The number of simulations depends on users’ requirement for the verification results, which can be very large if users require a high level of confidence and precision. Therefore, how to perform as fewer simulations as possible while achieving the same level of confidence and precision is one of the core problems of SMC. In this paper, we consider the estimation problem of SMC. Most existing statistical model checkers use the Okamoto bound to decide the simulation number. Although the Okamoto bound is sound, it is well known to be overly conservative. The simulation number decided by the Okamoto bound is usually much higher than it actually needs, which leads to a waste of time and computation resources. To tackle this problem, we propose an efficient, sound and lightweight estimation algorithm using the Clopper-Pearson confidence interval. We perform comprehensive numerical experiments and case studies to evaluate the performance of our algorithm, and the results show that our algorithm uses 40%-60% fewer simulations than the Okamoto bound. Our algorithm can be directly integrated into existing model checkers to reduce the verification time of SMC estimation problems.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":null,"pages":null},"PeriodicalIF":6.5,"publicationDate":"2024-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140639943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Platform-Agnostic Framework for Automatically Identifying Performance Issue Reports With Heuristic Linguistic Patterns 利用启发式语言模式自动识别性能问题报告的平台诊断框架
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-04-17 DOI: 10.1109/TSE.2024.3390623
Yutong Zhao;Lu Xiao;Sunny Wong
Software performance is critical for system efficiency, with performance issues potentially resulting in budget overruns, project delays, and market losses. Such problems are reported to developers through issue tracking systems, which are often under-tagged, as the manual tagging process is voluntary and time-consuming. Existing automated performance issue tagging techniques, such as keyword matching and machine/deep learning models, struggle due to imbalanced datasets and a high degree of variance. This paper presents a novel hybrid classification approach, combining Heuristic Linguistic Patterns (HLPs) with machine/deep learning models to enable practitioners to automatically identify performance-related issues. The proposed approach works across three progressive levels: HLP tagging, sentence tagging, and issue tagging, with a focus on linguistic analysis of issue descriptions. The authors evaluate the approach on three different datasets collected from different projects and issue-tracking platforms to prove that the proposed framework is accurate, project- and platform-agnostic, and robust to imbalanced datasets. Furthermore, this study also examined how the two unique techniques of the framework, including the fuzzy HLP matching and the Issue HLP Matrix, contribute to the accuracy. Finally, the study explored the effectiveness and impact of two off-the-shelf feature selection techniques, Boruta and RFE, with the proposed framework. The results showed that the proposed framework has great potential for practitioners to accurately (with up to 100% precision, 66% recall, and 79% F1-score) identify performance issues, with robustness to imbalanced data and good transferability to new projects and issue tracking platforms.
软件性能对系统效率至关重要,性能问题可能导致预算超支、项目延误和市场损失。这些问题会通过问题跟踪系统报告给开发人员,但由于手动标记过程是自愿的且耗时,这些问题往往标记不足。现有的自动性能问题标记技术,如关键字匹配和机器/深度学习模型,由于数据集不平衡和差异较大而难以发挥作用。本文提出了一种新颖的混合分类方法,将启发式语言模式(HLP)与机器/深度学习模型相结合,使从业人员能够自动识别与性能相关的问题。所提出的方法跨越三个渐进的层次:HLP 标记、句子标记和问题标记,重点是问题描述的语言分析。作者在从不同项目和问题跟踪平台收集的三个不同数据集上对该方法进行了评估,以证明所提出的框架是准确的、与项目和平台无关的,并且对不平衡数据集具有鲁棒性。此外,本研究还考察了该框架的两项独特技术,包括模糊 HLP 匹配和问题 HLP 矩阵,是如何提高准确性的。最后,本研究还探讨了 Boruta 和 RFE 这两种现成的特征选择技术与拟议框架的有效性和影响。研究结果表明,所提出的框架对于从业人员准确(精确度高达 100%,召回率高达 66%,F1 分数高达 79%)识别性能问题具有巨大的潜力,而且对不平衡数据具有鲁棒性,并能很好地移植到新项目和问题跟踪平台中。
{"title":"A Platform-Agnostic Framework for Automatically Identifying Performance Issue Reports With Heuristic Linguistic Patterns","authors":"Yutong Zhao;Lu Xiao;Sunny Wong","doi":"10.1109/TSE.2024.3390623","DOIUrl":"10.1109/TSE.2024.3390623","url":null,"abstract":"Software performance is critical for system efficiency, with performance issues potentially resulting in budget overruns, project delays, and market losses. Such problems are reported to developers through issue tracking systems, which are often under-tagged, as the manual tagging process is voluntary and time-consuming. Existing automated performance issue tagging techniques, such as keyword matching and machine/deep learning models, struggle due to imbalanced datasets and a high degree of variance. This paper presents a novel hybrid classification approach, combining Heuristic Linguistic Patterns (\u0000<italic>HLP</i>\u0000s) with machine/deep learning models to enable practitioners to automatically identify performance-related issues. The proposed approach works across three progressive levels: \u0000<italic>HLP</i>\u0000 tagging, sentence tagging, and issue tagging, with a focus on linguistic analysis of issue descriptions. The authors evaluate the approach on three different datasets collected from different projects and issue-tracking platforms to prove that the proposed framework is accurate, project- and platform-agnostic, and robust to imbalanced datasets. Furthermore, this study also examined how the two unique techniques of the framework, including the fuzzy \u0000<italic>HLP</i>\u0000 matching and the \u0000<italic>Issue HLP Matrix</i>\u0000, contribute to the accuracy. Finally, the study explored the effectiveness and impact of two off-the-shelf feature selection techniques, \u0000<italic>Boruta</i>\u0000 and \u0000<italic>RFE</i>\u0000, with the proposed framework. The results showed that the proposed framework has great potential for practitioners to accurately (with up to 100% precision, 66% recall, and 79% \u0000<italic>F1</i>\u0000-score) identify performance issues, with robustness to imbalanced data and good transferability to new projects and issue tracking platforms.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":null,"pages":null},"PeriodicalIF":6.5,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10504708","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140607858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MMO: Meta Multi-Objectivization for Software Configuration Tuning MMO:用于软件配置调整的元多目标化(Meta Multi-Objectivization
IF 7.4 1区 计算机科学 Q1 Computer Science Pub Date : 2024-04-15 DOI: 10.1109/TSE.2024.3388910
Pengzhou Chen;Tao Chen;Miqing Li
Software configuration tuning is essential for optimizing a given performance objective (e.g., minimizing latency). Yet, due to the software's intrinsically complex configuration landscape and expensive measurement, there has been a rather mild success, particularly in preventing the search from being trapped in local optima. To address this issue, in this paper we take a different perspective. Instead of focusing on improving the optimizer, we work on the level of optimization model and propose a meta multi-objectivization (MMO) model that considers an auxiliary performance objective (e.g., throughput in addition to latency). What makes this model distinct is that we do not optimize the auxiliary performance objective, but rather use it to make similarly-performing while different configurations less comparable (i.e. Pareto nondominated to each other), thus preventing the search from being trapped in local optima. Importantly, by designing a new normalization method, we show how to effectively use the MMO model without worrying about its weight—the only yet highly sensitive parameter that can affect its effectiveness. Experiments on 22 cases from 11 real-world software systems/environments confirm that our MMO model with the new normalization performs better than its state-of-the-art single-objective counterparts on 82% cases while achieving up to $2.09times$ speedup. For 68% of the cases, the new normalization also enables the MMO model to outperform the instance when using it with the normalization from our prior FSE work under pre-tuned best weights, saving a great amount of resources which would be otherwise necessary to find a good weight. We also demonstrate that the MMO model with the new normalization can consolidate recent model-based tuning tools on 68% of the cases with up to $1.22times$ speedup in general.
软件配置调整对于优化给定的性能目标(如最小化延迟)至关重要。然而,由于软件固有的复杂配置环境和昂贵的测量费用,特别是在防止搜索陷入局部最优方面,取得的成功相当有限。为了解决这个问题,我们在本文中采取了不同的视角。我们不专注于改进优化器,而是从优化模型的层面着手,提出了一种元多目标化(MMO)模型,该模型考虑了辅助性能目标(例如,除延迟外的吞吐量)。该模型的与众不同之处在于,我们并不优化辅助性能目标,而是利用它来降低性能相似但配置不同的可比性(即帕累托互不占优),从而防止搜索陷入局部最优。重要的是,通过设计一种新的归一化方法,我们展示了如何有效地使用 MMO 模型,而无需担心其权重--唯一会影响其有效性的高度敏感参数。对来自 11 个真实世界软件系统/环境的 22 个案例进行的实验证实,在 82% 的案例中,我们采用新归一化方法的 MMO 模型比最先进的单目标模型表现更好,同时速度提高了 2.09 美元/次。在 68% 的情况下,新的归一化方法还能使 MMO 模型在预先调整最佳权重的情况下,在与我们之前的 FSE 工作中的归一化方法一起使用时,表现优于实例,从而节省了大量资源,而这些资源原本是用来寻找好权重的。我们还证明,带有新归一化的 MMO 模型可以在 68% 的情况下整合最近基于模型的调整工具,一般来说速度可提高 1.22 美元/次。
{"title":"MMO: Meta Multi-Objectivization for Software Configuration Tuning","authors":"Pengzhou Chen;Tao Chen;Miqing Li","doi":"10.1109/TSE.2024.3388910","DOIUrl":"10.1109/TSE.2024.3388910","url":null,"abstract":"Software configuration tuning is essential for optimizing a given performance objective (e.g., minimizing latency). Yet, due to the software's intrinsically complex configuration landscape and expensive measurement, there has been a rather mild success, particularly in preventing the search from being trapped in local optima. To address this issue, in this paper we take a different perspective. Instead of focusing on improving the optimizer, we work on the level of optimization model and propose a meta multi-objectivization (MMO) model that considers an auxiliary performance objective (e.g., throughput in addition to latency). What makes this model distinct is that we do not optimize the auxiliary performance objective, but rather use it to make similarly-performing while different configurations less comparable (i.e. Pareto nondominated to each other), thus preventing the search from being trapped in local optima. Importantly, by designing a new normalization method, we show how to effectively use the MMO model without worrying about its weight—the only yet highly sensitive parameter that can affect its effectiveness. Experiments on 22 cases from 11 real-world software systems/environments confirm that our MMO model with the new normalization performs better than its state-of-the-art single-objective counterparts on 82% cases while achieving up to \u0000<inline-formula><tex-math>$2.09times$</tex-math></inline-formula>\u0000 speedup. For 68% of the cases, the new normalization also enables the MMO model to outperform the instance when using it with the normalization from our prior FSE work under pre-tuned best weights, saving a great amount of resources which would be otherwise necessary to find a good weight. We also demonstrate that the MMO model with the new normalization can consolidate recent model-based tuning tools on 68% of the cases with up to \u0000<inline-formula><tex-math>$1.22times$</tex-math></inline-formula>\u0000 speedup in general.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":null,"pages":null},"PeriodicalIF":7.4,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10500748","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140556688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pretrain, Prompt, and Transfer: Evolving Digital Twins for Time-to-Event Analysis in Cyber-Physical Systems 预训练、提示和转移:用于网络物理系统中时间到事件分析的进化数字双胞胎
IF 7.4 1区 计算机科学 Q1 Computer Science Pub Date : 2024-04-15 DOI: 10.1109/TSE.2024.3388572
Qinghua Xu;Tao Yue;Shaukat Ali;Maite Arratibel
Cyber-physicalnd systems (CPSs), e.g., elevators and autonomous driving systems, are progressively permeating our everyday lives. To ensure their safety, various analyses need to be conducted, such as anomaly detection and time-to-event analysis (the focus of this paper). Recently, it has been widely accepted that digital Twins (DTs) can be an efficient method to aid in developing, maintaining, and safe and secure operation of CPSs. However, CPSs frequently evolve, e.g., with new or updated functionalities, which demand their corresponding DTs be co-evolved, i.e., in synchronization with the CPSs. To that end, we propose a novel method, named PPT, utilizing an uncertainty-aware transfer learning for DT evolution. Specifically, we first pretrain PPT with a pretraining dataset to acquire generic knowledge about the CPSs, followed by adapting it to a specific CPS with the help of prompt tuning. Results highlight that PPT is effective in time-to-event analysis in both elevator and autonomous driving case studies, on average, outperforming a baseline method by 7.31 and 12.58 in terms of Huber loss, respectively. The experiment results also affirm the effectiveness of transfer learning, prompt tuning, and uncertainty quantification in terms of reducing Huber loss by at least 21.32, 3.14, and 4.08, respectively, in both case studies.
网络物理系统(CPS),如电梯和自动驾驶系统,正逐渐渗透到我们的日常生活中。为确保其安全性,需要进行各种分析,如异常检测和时间到事件分析(本文的重点)。最近,人们普遍认为数字孪生(DTs)是帮助开发、维护和安全运行 CPS 的有效方法。然而,CPS 经常会发生进化,例如新增或更新功能,这就要求相应的 DTs 与 CPS 同步进化。为此,我们提出了一种名为 PPT 的新方法,利用不确定性感知迁移学习来实现 DT 演化。具体来说,我们首先使用预训练数据集对 PPT 进行预训练,以获取有关 CPS 的通用知识,然后在及时调整的帮助下将其适应于特定的 CPS。实验结果表明,在电梯和自动驾驶案例研究中,PPT 都能有效地进行时间到事件分析,在 Huber 损失方面分别比基线方法高出 7.31 和 12.58。实验结果还肯定了迁移学习、及时调整和不确定性量化的有效性,在这两个案例研究中,迁移学习、及时调整和不确定性量化分别将 Huber 损失减少了至少 21.32、3.14 和 4.08。
{"title":"Pretrain, Prompt, and Transfer: Evolving Digital Twins for Time-to-Event Analysis in Cyber-Physical Systems","authors":"Qinghua Xu;Tao Yue;Shaukat Ali;Maite Arratibel","doi":"10.1109/TSE.2024.3388572","DOIUrl":"10.1109/TSE.2024.3388572","url":null,"abstract":"Cyber-physicalnd systems (CPSs), e.g., elevators and autonomous driving systems, are progressively permeating our everyday lives. To ensure their safety, various analyses need to be conducted, such as anomaly detection and time-to-event analysis (the focus of this paper). Recently, it has been widely accepted that digital Twins (DTs) can be an efficient method to aid in developing, maintaining, and safe and secure operation of CPSs. However, CPSs frequently evolve, e.g., with new or updated functionalities, which demand their corresponding DTs be co-evolved, i.e., in synchronization with the CPSs. To that end, we propose a novel method, named \u0000<monospace>PPT</monospace>\u0000, utilizing an uncertainty-aware transfer learning for DT evolution. Specifically, we first pretrain \u0000<monospace>PPT</monospace>\u0000 with a pretraining dataset to acquire generic knowledge about the CPSs, followed by adapting it to a specific CPS with the help of prompt tuning. Results highlight that \u0000<monospace>PPT</monospace>\u0000 is effective in time-to-event analysis in both elevator and autonomous driving case studies, on average, outperforming a baseline method by 7.31 and 12.58 in terms of Huber loss, respectively. The experiment results also affirm the effectiveness of transfer learning, prompt tuning, and uncertainty quantification in terms of reducing Huber loss by at least 21.32, 3.14, and 4.08, respectively, in both case studies.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":null,"pages":null},"PeriodicalIF":7.4,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140556778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generic Sensitivity: Generics-Guided Context Sensitivity for Pointer Analysis 通用灵敏度:通用指导下的指针分析上下文敏感性
IF 7.4 1区 计算机科学 Q1 Computer Science Pub Date : 2024-04-12 DOI: 10.1109/TSE.2024.3377645
Haofeng Li;Tian Tan;Yue Li;Jie Lu;Haining Meng;Liqing Cao;Yongheng Huang;Lian Li;Lin Gao;Peng Di;Liang Lin;ChenXi Cui
Generic programming has found widespread application in object-oriented languages like Java. However, existing context-sensitive pointer analyses fail to leverage the benefits of generic programming. This paper introduces generic sensitivity, a new context customization scheme targeting generics. We design our context customization scheme in such a way that generic instantiation sites, i.e., locations instantiating generic classes/methods with concrete types, are always preserved as key context elements. This is realized by augmenting contexts with a type variable lookup map, which is efficiently generated in a context-sensitive manner throughout the analysis process. We have implemented various variants of generic-sensitive analysis in WALA and conducted extensive experiments to compare it with state-of-the-art approaches, including both traditional and selective context-sensitivity methods. The evaluation results demonstrate that generic sensitivity effectively enhances existing context-sensitivity approaches, striking a new balance between efficiency and precision. For instance, it enables a 1-object-sensitive analysis to achieve overall better precision compared to a 2-object-sensitive analysis, with an average speedup of 12.6 times (up to 62 times).
通用编程在 Java 等面向对象语言中得到了广泛应用。然而,现有的上下文敏感指针分析未能充分利用泛型编程的优势。本文介绍了泛型敏感性,这是一种针对泛型的新上下文定制方案。我们在设计上下文定制方案时,始终将泛型实例化位置(即使用具体类型实例化泛型类/方法的位置)作为关键上下文元素保留下来。这是通过在上下文中添加类型变量查找图来实现的,该查找图会在整个分析过程中以对上下文敏感的方式有效生成。我们在 WALA 中实现了各种通用敏感性分析变体,并进行了大量实验,将其与包括传统和选择性上下文敏感性方法在内的最先进方法进行比较。评估结果表明,泛函敏感性有效地增强了现有的上下文敏感性方法,在效率和精度之间取得了新的平衡。例如,与双对象敏感性分析相比,它能使单对象敏感性分析获得更高的精度,平均速度提高了 12.6 倍(最高 62 倍)。
{"title":"Generic Sensitivity: Generics-Guided Context Sensitivity for Pointer Analysis","authors":"Haofeng Li;Tian Tan;Yue Li;Jie Lu;Haining Meng;Liqing Cao;Yongheng Huang;Lian Li;Lin Gao;Peng Di;Liang Lin;ChenXi Cui","doi":"10.1109/TSE.2024.3377645","DOIUrl":"10.1109/TSE.2024.3377645","url":null,"abstract":"Generic programming has found widespread application in object-oriented languages like Java. However, existing context-sensitive pointer analyses fail to leverage the benefits of generic programming. This paper introduces \u0000<italic>generic sensitivity</i>\u0000, a new context customization scheme targeting generics. We design our context customization scheme in such a way that generic instantiation sites, i.e., locations instantiating generic classes/methods with concrete types, are always preserved as key context elements. This is realized by augmenting contexts with a type variable lookup map, which is efficiently generated in a context-sensitive manner throughout the analysis process. We have implemented various variants of generic-sensitive analysis in \u0000<sc>WALA</small>\u0000 and conducted extensive experiments to compare it with state-of-the-art approaches, including both traditional and selective context-sensitivity methods. The evaluation results demonstrate that generic sensitivity effectively enhances existing context-sensitivity approaches, striking a new balance between efficiency and precision. For instance, it enables a 1-object-sensitive analysis to achieve overall better precision compared to a 2-object-sensitive analysis, with an average speedup of 12.6 times (up to 62 times).","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":null,"pages":null},"PeriodicalIF":7.4,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140550562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Characterizing Timeout Builds in Continuous Integration 描述持续集成中超时构建的特点
IF 7.4 1区 计算机科学 Q1 Computer Science Pub Date : 2024-04-11 DOI: 10.1109/TSE.2024.3387840
Nimmi Weeraddana;Mahmoud Alfadel;Shane McIntosh
Compute resources that enable Continuous Integration (CI, i.e., the automatic build and test cycle applied to the change sets that development teams produce) are a shared commodity that organizations need to manage. To prevent (erroneous) builds from consuming a large amount of resources, CI service providers often impose a time limit. CI builds that exceed the time limit are automatically terminated. While imposing a time limit helps to prevent abuse of the service, builds that timeout (a) consume the maximum amount of resources that a CI service is willing to provide and (b) leave CI users without an indication of whether the change set will pass or fail the CI process. Therefore, understanding timeout builds and the factors that contribute to them is important for improving the stability and quality of a CI service. In this paper, we investigate the prevalence of timeout builds and the characteristics associated with them. By analyzing a curated dataset of 936 projects that adopt the CircleCI service and report at least one timeout build, we find that the median duration of a timeout build (19.7 minutes) is more than five times that of a build that produces a pass or fail result (3.4 minutes). To better understand the factors contributing to timeout builds, we model timeout builds using characteristics of project build history, build queued time, timeout tendency, size, and author experience based on data collected from 105,663 CI builds. Our model demonstrates a discriminatory power that vastly surpasses that of a random predictor (Area Under the Receiver Operating characteristic Curve, i.e., $AUROC$ = 0.939) and is highly stable in its performance ($AUROC$ optimism = 0.0001). Moreover, our model reveals that the build history and timeout tendency features are strong indicators of timeout builds, with the timeout status of the most recent build accounting for the largest proportion of the explanatory power. A longitudinal analysis of the incidences of timeout builds (i.e., a study conducted over a period of time) indicates that 64.03% of timeout builds occur consecutively. In such cases, it takes a median of 24 hours before a build that passes or fails occurs. Our results imply that CI providers should exploit build history to anticipate timeout builds.
支持持续集成(CI,即对开发团队产生的变更集进行自动构建和测试的周期)的计算资源是企业需要管理的共享资源。为防止(错误的)构建消耗大量资源,CI 服务提供商通常会规定一个时间限制。超过时限的 CI 构建会自动终止。虽然设置时间限制有助于防止滥用服务,但超时的构建(a)会消耗 CI 服务愿意提供的最大资源量,(b)会让 CI 用户不知道变更集是通过还是未通过 CI 流程。因此,了解超时的建立和导致超时的因素对于提高 CI 服务的稳定性和质量非常重要。在本文中,我们研究了超时构建的普遍性及其相关特征。通过分析由 936 个采用 CircleCI 服务并报告至少一次超时构建的项目组成的数据集,我们发现超时构建的中位持续时间(19.7 分钟)是产生通过或失败结果的构建时间(3.4 分钟)的五倍多。为了更好地理解导致超时构建的因素,我们根据从 105,663 次 CI 构建中收集到的数据,利用项目构建历史、构建排队时间、超时趋势、规模和作者经验等特征对超时构建进行了建模。我们的模型显示出远远超过随机预测因子的判别能力(接收者操作特征曲线下面积,即 $AUROC$ = 0.939),并且性能高度稳定($AUROC$ 乐观值 = 0.0001)。此外,我们的模型显示,构建历史和超时趋势特征是超时构建的有力指标,其中最近一次构建的超时状态占解释力的最大比例。对超时构建发生率的纵向分析(即在一段时间内进行的研究)表明,64.03% 的超时构建是连续发生的。在这种情况下,中位数为 24 小时后才会出现通过或失败的构建。我们的研究结果表明,CI 提供商应利用构建历史来预测超时构建。
{"title":"Characterizing Timeout Builds in Continuous Integration","authors":"Nimmi Weeraddana;Mahmoud Alfadel;Shane McIntosh","doi":"10.1109/TSE.2024.3387840","DOIUrl":"10.1109/TSE.2024.3387840","url":null,"abstract":"Compute resources that enable Continuous Integration (CI, i.e., the automatic build and test cycle applied to the change sets that development teams produce) are a shared commodity that organizations need to manage. To prevent (erroneous) builds from consuming a large amount of resources, CI service providers often impose a time limit. CI builds that exceed the time limit are automatically terminated. While imposing a time limit helps to prevent abuse of the service, builds that timeout (a) consume the maximum amount of resources that a CI service is willing to provide and (b) leave CI users without an indication of whether the change set will pass or fail the CI process. Therefore, understanding timeout builds and the factors that contribute to them is important for improving the stability and quality of a CI service. In this paper, we investigate the prevalence of timeout builds and the characteristics associated with them. By analyzing a curated dataset of 936 projects that adopt the CircleCI service and report at least one timeout build, we find that the median duration of a timeout build (19.7 minutes) is more than five times that of a build that produces a pass or fail result (3.4 minutes). To better understand the factors contributing to timeout builds, we model timeout builds using characteristics of project build history, build queued time, timeout tendency, size, and author experience based on data collected from 105,663 CI builds. Our model demonstrates a discriminatory power that vastly surpasses that of a random predictor (Area Under the Receiver Operating characteristic Curve, i.e., \u0000<inline-formula><tex-math>$AUROC$</tex-math></inline-formula>\u0000 = 0.939) and is highly stable in its performance (\u0000<inline-formula><tex-math>$AUROC$</tex-math></inline-formula>\u0000 optimism = 0.0001). Moreover, our model reveals that the build history and timeout tendency features are strong indicators of timeout builds, with the timeout status of the most recent build accounting for the largest proportion of the explanatory power. A longitudinal analysis of the incidences of timeout builds (i.e., a study conducted over a period of time) indicates that 64.03% of timeout builds occur consecutively. In such cases, it takes a median of 24 hours before a build that passes or fails occurs. Our results imply that CI providers should exploit build history to anticipate timeout builds.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":null,"pages":null},"PeriodicalIF":7.4,"publicationDate":"2024-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140547889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LIVABLE: Exploring Long-Tailed Classification of Software Vulnerability Types LIVABLE:探索软件漏洞类型的长尾分类
IF 7.4 1区 计算机科学 Q1 Computer Science Pub Date : 2024-04-11 DOI: 10.1109/TSE.2024.3382361
Xin-Cheng Wen;Cuiyun Gao;Feng Luo;Haoyu Wang;Ge Li;Qing Liao
Prior studies generally focus on software vulnerability detection and have demonstrated the effectiveness of Graph Neural Network (GNN)-based approaches for the task. Considering the various types of software vulnerabilities and the associated different degrees of severity, it is also beneficial to determine the type of each vulnerable code for developers. In this paper, we observe that the distribution of vulnerability type is long-tailed in practice, where a small portion of classes have massive samples (i.e., head classes) but the others contain only a few samples (i.e., tail classes). Directly adopting previous vulnerability detection approaches tends to result in poor detection performance, mainly due to two reasons. First, it is difficult to effectively learn the vulnerability representation due to the over-smoothing issue of GNNs. Second, vulnerability types in tails are hard to be predicted due to the extremely few associated samples. To alleviate these issues, we propose a Long-taIled software VulnerABiLity typE classification approach, called LIVABLE. LIVABLE mainly consists of two modules, including (1) vulnerability representation learning module, which improves the propagation steps in GNN to distinguish node representations by a differentiated propagation method. A sequence-to-sequence model is also involved to enhance the vulnerability representations. (2) adaptive re-weighting module, which adjusts the learning weights for different types according to the training epochs and numbers of associated samples by a novel training loss. We verify the effectiveness of LIVABLE in both type classification and vulnerability detection tasks. For vulnerability type classification, the experiments on the Fan et al. dataset show that LIVABLE outperforms the state-of-the-art methods by 24.18% in terms of the accuracy metric, and also improves the performance in predicting tail classes by 7.7%. To evaluate the efficacy of the vulnerability representation learning module in LIVABLE, we further compare it with the recent vulnerability detection approaches on three benchmark datasets, which shows that the proposed representation learning module improves the best baselines by 4.03% on average in terms of accuracy.
先前的研究一般侧重于软件漏洞检测,并证明了基于图神经网络(GNN)的方法在这项任务中的有效性。考虑到各种类型的软件漏洞以及与之相关的不同严重程度,为开发人员确定每种漏洞代码的类型也是有益的。在本文中,我们观察到漏洞类型的分布在实践中是长尾型的,其中一小部分类具有大量样本(即头部类),但其他类只包含少量样本(即尾部类)。直接采用以前的漏洞检测方法往往会导致检测效果不佳,主要有两个原因。首先,由于 GNN 的过度平滑问题,很难有效地学习漏洞表示。其次,由于相关样本极少,尾部的漏洞类型很难预测。为了缓解这些问题,我们提出了一种名为 LIVABLE 的长尾软件漏洞类型分类方法。LIVABLE 主要由两个模块组成,包括:(1)漏洞表征学习模块,该模块改进了 GNN 中的传播步骤,通过差异化传播方法区分节点表征。此外,还涉及序列到序列模型,以增强脆弱性表征。(2) 自适应再加权模块,根据训练历时和相关样本的数量,通过一种新的训练损失来调整不同类型的学习权重。我们验证了 LIVABLE 在类型分类和漏洞检测任务中的有效性。在漏洞类型分类方面,在 Fan 等人的数据集上进行的实验表明,LIVABLE 在准确度指标上比最先进的方法高出 24.18%,在预测尾类方面的性能也提高了 7.7%。为了评估 LIVABLE 中的漏洞表征学习模块的功效,我们在三个基准数据集上进一步将其与最近的漏洞检测方法进行了比较,结果表明,所提出的表征学习模块在准确率方面平均比最佳基准方法提高了 4.03%。
{"title":"LIVABLE: Exploring Long-Tailed Classification of Software Vulnerability Types","authors":"Xin-Cheng Wen;Cuiyun Gao;Feng Luo;Haoyu Wang;Ge Li;Qing Liao","doi":"10.1109/TSE.2024.3382361","DOIUrl":"10.1109/TSE.2024.3382361","url":null,"abstract":"Prior studies generally focus on software vulnerability detection and have demonstrated the effectiveness of Graph Neural Network (GNN)-based approaches for the task. Considering the various types of software vulnerabilities and the associated different degrees of severity, it is also beneficial to determine the type of each vulnerable code for developers. In this paper, we observe that the distribution of vulnerability type is long-tailed in practice, where a small portion of classes have massive samples (i.e., head classes) but the others contain only a few samples (i.e., tail classes). Directly adopting previous vulnerability detection approaches tends to result in poor detection performance, mainly due to two reasons. First, it is difficult to effectively learn the vulnerability representation due to the over-smoothing issue of GNNs. Second, vulnerability types in tails are hard to be predicted due to the extremely few associated samples. To alleviate these issues, we propose a \u0000<bold>L</b>\u0000ong-ta\u0000<bold>I</b>\u0000led software \u0000<bold>V</b>\u0000ulner\u0000<bold>AB</b>\u0000i\u0000<bold>L</b>\u0000ity typ\u0000<bold>E</b>\u0000 classification approach, called \u0000<bold>LIVABLE</b>\u0000. LIVABLE mainly consists of two modules, including (1) vulnerability representation learning module, which improves the propagation steps in GNN to distinguish node representations by a differentiated propagation method. A sequence-to-sequence model is also involved to enhance the vulnerability representations. (2) adaptive re-weighting module, which adjusts the learning weights for different types according to the training epochs and numbers of associated samples by a novel training loss. We verify the effectiveness of LIVABLE in both type classification and vulnerability detection tasks. For vulnerability type classification, the experiments on the Fan et al. dataset show that LIVABLE outperforms the state-of-the-art methods by 24.18% in terms of the accuracy metric, and also improves the performance in predicting tail classes by 7.7%. To evaluate the efficacy of the vulnerability representation learning module in LIVABLE, we further compare it with the recent vulnerability detection approaches on three benchmark datasets, which shows that the proposed representation learning module improves the best baselines by 4.03% on average in terms of accuracy.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":null,"pages":null},"PeriodicalIF":7.4,"publicationDate":"2024-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140547623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Controller Synthesis for Autonomous Systems With Deep-Learning Perception Components 带有深度学习感知组件的自主系统控制器合成
IF 7.4 1区 计算机科学 Q1 Computer Science Pub Date : 2024-04-10 DOI: 10.1109/TSE.2024.3385378
Radu Calinescu;Calum Imrie;Ravi Mangal;Genaína Nunes Rodrigues;Corina Păsăreanu;Misael Alpizar Santana;Gricel Vázquez
We present DeepDECS, a new method for the synthesis of correct-by-construction software controllers for autonomous systems that use deep neural network (DNN) classifiers for the perception step of their decision-making processes. Despite major advances in deep learning in recent years, providing safety guarantees for these systems remains very challenging. Our controller synthesis method addresses this challenge by integrating DNN verification with the synthesis of verified Markov models. The synthesised models correspond to discrete-event software controllers guaranteed to satisfy the safety, dependability and performance requirements of the autonomous system, and to be Pareto optimal with respect to a set of optimisation objectives. We evaluate the method in simulation by using it to synthesise controllers for mobile-robot collision limitation, and for maintaining driver attentiveness in shared-control autonomous driving.
我们介绍了 DeepDECS,这是一种为自主系统合成正确的构造软件控制器的新方法,这些系统在决策过程的感知步骤中使用了深度神经网络(DNN)分类器。尽管近年来深度学习取得了重大进展,但为这些系统提供安全保证仍然非常具有挑战性。我们的控制器合成方法通过将 DNN 验证与已验证马尔可夫模型的合成相结合来应对这一挑战。合成的模型对应于离散事件软件控制器,保证满足自主系统的安全性、可靠性和性能要求,并在一系列优化目标方面达到帕累托最优。我们在仿真中对该方法进行了评估,将其用于合成限制移动机器人碰撞的控制器,以及在共享控制自动驾驶中保持驾驶员注意力的控制器。
{"title":"Controller Synthesis for Autonomous Systems With Deep-Learning Perception Components","authors":"Radu Calinescu;Calum Imrie;Ravi Mangal;Genaína Nunes Rodrigues;Corina Păsăreanu;Misael Alpizar Santana;Gricel Vázquez","doi":"10.1109/TSE.2024.3385378","DOIUrl":"10.1109/TSE.2024.3385378","url":null,"abstract":"We present DeepDECS, a new method for the synthesis of correct-by-construction software controllers for autonomous systems that use deep neural network (DNN) classifiers for the perception step of their decision-making processes. Despite major advances in deep learning in recent years, providing safety guarantees for these systems remains very challenging. Our controller synthesis method addresses this challenge by integrating DNN verification with the synthesis of verified Markov models. The synthesised models correspond to discrete-event software controllers guaranteed to satisfy the safety, dependability and performance requirements of the autonomous system, and to be Pareto optimal with respect to a set of optimisation objectives. We evaluate the method in simulation by using it to synthesise controllers for mobile-robot collision limitation, and for maintaining driver attentiveness in shared-control autonomous driving.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":null,"pages":null},"PeriodicalIF":7.4,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10496502","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140545014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Domain-Driven Design for Microservices: An Evidence-Based Investigation 微服务的领域驱动设计:基于证据的调查
IF 7.4 1区 计算机科学 Q1 Computer Science Pub Date : 2024-04-10 DOI: 10.1109/TSE.2024.3385835
Chenxing Zhong;Shanshan Li;Huang Huang;Xiaodong Liu;Zhikun Chen;Yi Zhang;He Zhang
MicroService Architecture (MSA), a predominant architectural style in recent years, still faces the arduous task of identifying the boundaries of microservices. Domain-Driven Design (DDD) is regarded as one of the major design methods for addressing this task in practice, which aims to iteratively build domain models using a series of patterns, principles, and practices. The adoption of DDD for MSA (DDD4M in short) can, however, present considerable challenges in terms of a sufficient understanding of the methodological requirements and the application domains. It is imperative to establish a systematic understanding about the various aspects of employing DDD4M and provide effective guidance. This study reports an empirical inquiry that integrates a systematic literature review and a confirmatory survey. By reviewing 34 scientific studies and consulting 63 practitioners, this study reveals several distinctive findings with regard to the state and challenges of as well as the possible solutions for DDD4M applications, from the 5W1H perspectives: When, Where, Why, Who, What, and How. The analysis and synthesis of evidence show a wide variation in understanding of domain modeling artifacts. The status quo indicates the need for further methodological support in terms of application process, domain model design and implementation, and domain knowledge acquisition and management. To advance the state-of-the-practice, our findings were organized into a preliminary checklist that intends to assist practitioners by illuminating a DDD4M application process and the specific key considerations along the way.
微服务架构(MicroService Architecture,MSA)是近年来流行的一种架构风格,但它仍然面临着确定微服务边界的艰巨任务。领域驱动设计(DDD)被认为是在实践中解决这一任务的主要设计方法之一,其目的是利用一系列模式、原则和实践迭代地构建领域模型。然而,在 MSA 中采用 DDD(简称 DDD4M)可能会在充分了解方法要求和应用领域方面遇到相当大的挑战。当务之急是系统地了解采用 DDD4M 的各个方面,并提供有效的指导。本研究报告是一项实证调查,综合了系统的文献综述和确认性调查。通过回顾 34 项科学研究和咨询 63 名从业人员,本研究从 5W1H 角度揭示了 DDD4M 应用的现状、挑战和可能的解决方案:时间(When)、地点(Where)、原因(Why)、人员(Who)、内容(What)和方法(How)。对证据的分析和综合显示,对领域建模工件的理解存在很大差异。现状表明,在应用过程、领域模型设计和实施以及领域知识获取和管理方面,需要进一步的方法论支持。为了推动实践的发展,我们将研究结果整理成了一份初步清单,旨在通过阐明 DDD4M 的应用过程和沿途的具体关键注意事项来帮助实践者。
{"title":"Domain-Driven Design for Microservices: An Evidence-Based Investigation","authors":"Chenxing Zhong;Shanshan Li;Huang Huang;Xiaodong Liu;Zhikun Chen;Yi Zhang;He Zhang","doi":"10.1109/TSE.2024.3385835","DOIUrl":"10.1109/TSE.2024.3385835","url":null,"abstract":"MicroService Architecture (MSA), a predominant architectural style in recent years, still faces the arduous task of identifying the boundaries of microservices. Domain-Driven Design (DDD) is regarded as one of the major design methods for addressing this task in practice, which aims to iteratively build domain models using a series of patterns, principles, and practices. The adoption of DDD for MSA (\u0000<italic>DDD4M</i>\u0000 in short) can, however, present considerable challenges in terms of a sufficient understanding of the methodological requirements and the application domains. It is imperative to establish a systematic understanding about the various aspects of employing DDD4M and provide effective guidance. This study reports an empirical inquiry that integrates a systematic literature review and a confirmatory survey. By reviewing 34 scientific studies and consulting 63 practitioners, this study reveals several distinctive findings with regard to the state and challenges of as well as the possible solutions for DDD4M applications, from the \u0000<italic>5W1H</i>\u0000 perspectives: \u0000<italic>When</i>\u0000, \u0000<italic>Where</i>\u0000, \u0000<italic>Why</i>\u0000, \u0000<italic>Who</i>\u0000, \u0000<italic>What</i>\u0000, and \u0000<italic>How</i>\u0000. The analysis and synthesis of evidence show a wide variation in understanding of domain modeling artifacts. The status quo indicates the need for further methodological support in terms of application process, domain model design and implementation, and domain knowledge acquisition and management. To advance the state-of-the-practice, our findings were organized into a preliminary checklist that intends to assist practitioners by illuminating a DDD4M application process and the specific key considerations along the way.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":null,"pages":null},"PeriodicalIF":7.4,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140545169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Test Input Prioritization for Graph Neural Networks 图神经网络的测试输入优先级
IF 7.4 1区 计算机科学 Q1 Computer Science Pub Date : 2024-04-05 DOI: 10.1109/TSE.2024.3385538
Yinghua Li;Xueqi Dang;Weiguo Pian;Andrew Habib;Jacques Klein;Tegawendé F. Bissyandé
GNNs have shown remarkable performance in a variety of classification tasks. The reliability of GNN models needs to be thoroughly validated before their deployment to ensure their accurate functioning. Therefore, effective testing is essential for identifying vulnerabilities in GNN models. However, given the complexity and size of graph-structured data, the cost of manual labelling of GNN test inputs can be prohibitively high for real-world use cases. Although several approaches have been proposed in the general domain of Deep Neural Network (DNN) testing to alleviate this labelling cost issue, these approaches are not suitable for GNNs because they do not account for the interdependence between GNN test inputs, which is crucial for GNN inference. In this paper, we propose NodeRank, a novel test prioritization approach specifically for GNNs, guided by ensemble learning-based mutation analysis. Inspired by traditional mutation testing, where specific operators are applied to mutate code statements to identify whether provided test cases reveal faults, NodeRank operates on a crucial premise: If a test input (node) can kill many mutated models and produce different prediction results with many mutated inputs, this input is considered more likely to be misclassified by the GNN model and should be prioritized higher. Through prioritization, these potentially misclassified inputs can be identified earlier with limited manual labeling cost. NodeRank introduces mutation operators suitable for GNNs, focusing on three key aspects: the graph structure, the features of the graph nodes, and the GNN model itself. NodeRank generates mutants and compares their predictions against that of the initial test inputs. Based on the comparison results, a mutation feature vector is generated for each test input and used as the input to ranking models for test prioritization. Leveraging ensemble learning techniques, NodeRank combines the prediction results of the base ranking models and produces a misclassification score for each test input, which can indicate the likelihood of this input being misclassified. NodeRank sorts all the test inputs based on their scores in descending order. To evaluate NodeRank, we build 124 GNN subjects (i.e., a pair of dataset and GNN model), incorporating both natural and adversarial contexts. Our results demonstrate that NodeRank outperforms all the compared test prioritization approaches in terms of both APFD and PFD, which are widely-adopted metrics in this field. Specifically, NodeRank achieves an average improvement of between 4.41% and 58.11% on original datasets and between 4.96% and 62.15% on adversarial datasets.
GNN 在各种分类任务中表现出了卓越的性能。在部署 GNN 模型之前,需要对其可靠性进行全面验证,以确保其准确运行。因此,有效的测试对于识别 GNN 模型的漏洞至关重要。然而,考虑到图结构数据的复杂性和大小,对 GNN 测试输入进行人工标注的成本对于实际应用案例来说可能过高。虽然在深度神经网络(DNN)测试的一般领域中已经提出了几种方法来缓解这种标记成本问题,但这些方法并不适用于 GNN,因为它们没有考虑到 GNN 测试输入之间的相互依赖关系,而这种关系对于 GNN 推断至关重要。在本文中,我们提出了 NodeRank,这是一种专门针对 GNN 的新型测试优先级排序方法,以基于集合学习的突变分析为指导。受传统突变测试的启发,NodeRank 基于一个重要前提进行操作:如果一个测试输入(节点)能杀死许多突变模型,并在许多突变输入的情况下产生不同的预测结果,那么这个输入被认为更有可能被 GNN 模型错误分类,因此应被优先考虑。通过优先级排序,这些可能被误判的输入可以在人工标注成本有限的情况下更早地被识别出来。NodeRank 引入了适用于 GNN 的突变算子,重点关注三个关键方面:图结构、图节点的特征和 GNN 模型本身。NodeRank 生成突变体,并将其预测结果与初始测试输入结果进行比较。根据比较结果,为每个测试输入生成突变特征向量,并将其作为排序模型的输入,以确定测试的优先级。NodeRank 利用集合学习技术,将基本排序模型的预测结果结合起来,为每个测试输入生成一个误分类分数,该分数可显示该输入被误分类的可能性。NodeRank 根据得分从高到低对所有测试输入进行排序。为了评估 NodeRank,我们建立了 124 个 GNN 主体(即一对数据集和 GNN 模型),其中既有自然语境,也有对抗语境。我们的结果表明,NodeRank 在 APFD 和 PFD 这两个在该领域被广泛采用的指标方面都优于所有比较过的测试优先级排序方法。具体来说,NodeRank 在原始数据集上平均提高了 4.41% 到 58.11%,在对抗数据集上平均提高了 4.96% 到 62.15%。
{"title":"Test Input Prioritization for Graph Neural Networks","authors":"Yinghua Li;Xueqi Dang;Weiguo Pian;Andrew Habib;Jacques Klein;Tegawendé F. Bissyandé","doi":"10.1109/TSE.2024.3385538","DOIUrl":"10.1109/TSE.2024.3385538","url":null,"abstract":"GNNs have shown remarkable performance in a variety of classification tasks. The reliability of GNN models needs to be thoroughly validated before their deployment to ensure their accurate functioning. Therefore, effective testing is essential for identifying vulnerabilities in GNN models. However, given the complexity and size of graph-structured data, the cost of manual labelling of GNN test inputs can be prohibitively high for real-world use cases. Although several approaches have been proposed in the general domain of Deep Neural Network (DNN) testing to alleviate this labelling cost issue, these approaches are not suitable for GNNs because they do not account for the interdependence between GNN test inputs, which is crucial for GNN inference. In this paper, we propose NodeRank, a novel test prioritization approach specifically for GNNs, guided by ensemble learning-based mutation analysis. Inspired by traditional mutation testing, where specific operators are applied to mutate code statements to identify whether provided test cases reveal faults, NodeRank operates on a crucial premise: If a test input (node) can kill many mutated models and produce different prediction results with many mutated inputs, this input is considered more likely to be misclassified by the GNN model and should be prioritized higher. Through prioritization, these potentially misclassified inputs can be identified earlier with limited manual labeling cost. NodeRank introduces mutation operators suitable for GNNs, focusing on three key aspects: the graph structure, the features of the graph nodes, and the GNN model itself. NodeRank generates mutants and compares their predictions against that of the initial test inputs. Based on the comparison results, a mutation feature vector is generated for each test input and used as the input to ranking models for test prioritization. Leveraging ensemble learning techniques, NodeRank combines the prediction results of the base ranking models and produces a misclassification score for each test input, which can indicate the likelihood of this input being misclassified. NodeRank sorts all the test inputs based on their scores in descending order. To evaluate NodeRank, we build 124 GNN subjects (i.e., a pair of dataset and GNN model), incorporating both natural and adversarial contexts. Our results demonstrate that NodeRank outperforms all the compared test prioritization approaches in terms of both APFD and PFD, which are widely-adopted metrics in this field. Specifically, NodeRank achieves an average improvement of between 4.41% and 58.11% on original datasets and between 4.96% and 62.15% on adversarial datasets.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":null,"pages":null},"PeriodicalIF":7.4,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10494069","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140352303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Software Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1