Test Input Prioritization for Graph Neural Networks

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING IEEE Transactions on Software Engineering Pub Date : 2024-04-05 DOI:10.1109/TSE.2024.3385538

Yinghua Li;Xueqi Dang;Weiguo Pian;Andrew Habib;Jacques Klein;Tegawendé F. Bissyandé

{"title":"Test Input Prioritization for Graph Neural Networks","authors":"Yinghua Li;Xueqi Dang;Weiguo Pian;Andrew Habib;Jacques Klein;Tegawendé F. Bissyandé","doi":"10.1109/TSE.2024.3385538","DOIUrl":null,"url":null,"abstract":"GNNs have shown remarkable performance in a variety of classification tasks. The reliability of GNN models needs to be thoroughly validated before their deployment to ensure their accurate functioning. Therefore, effective testing is essential for identifying vulnerabilities in GNN models. However, given the complexity and size of graph-structured data, the cost of manual labelling of GNN test inputs can be prohibitively high for real-world use cases. Although several approaches have been proposed in the general domain of Deep Neural Network (DNN) testing to alleviate this labelling cost issue, these approaches are not suitable for GNNs because they do not account for the interdependence between GNN test inputs, which is crucial for GNN inference. In this paper, we propose NodeRank, a novel test prioritization approach specifically for GNNs, guided by ensemble learning-based mutation analysis. Inspired by traditional mutation testing, where specific operators are applied to mutate code statements to identify whether provided test cases reveal faults, NodeRank operates on a crucial premise: If a test input (node) can kill many mutated models and produce different prediction results with many mutated inputs, this input is considered more likely to be misclassified by the GNN model and should be prioritized higher. Through prioritization, these potentially misclassified inputs can be identified earlier with limited manual labeling cost. NodeRank introduces mutation operators suitable for GNNs, focusing on three key aspects: the graph structure, the features of the graph nodes, and the GNN model itself. NodeRank generates mutants and compares their predictions against that of the initial test inputs. Based on the comparison results, a mutation feature vector is generated for each test input and used as the input to ranking models for test prioritization. Leveraging ensemble learning techniques, NodeRank combines the prediction results of the base ranking models and produces a misclassification score for each test input, which can indicate the likelihood of this input being misclassified. NodeRank sorts all the test inputs based on their scores in descending order. To evaluate NodeRank, we build 124 GNN subjects (i.e., a pair of dataset and GNN model), incorporating both natural and adversarial contexts. Our results demonstrate that NodeRank outperforms all the compared test prioritization approaches in terms of both APFD and PFD, which are widely-adopted metrics in this field. Specifically, NodeRank achieves an average improvement of between 4.41% and 58.11% on original datasets and between 4.96% and 62.15% on adversarial datasets.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":null,"pages":null},"PeriodicalIF":6.5000,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10494069","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10494069/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

GNNs have shown remarkable performance in a variety of classification tasks. The reliability of GNN models needs to be thoroughly validated before their deployment to ensure their accurate functioning. Therefore, effective testing is essential for identifying vulnerabilities in GNN models. However, given the complexity and size of graph-structured data, the cost of manual labelling of GNN test inputs can be prohibitively high for real-world use cases. Although several approaches have been proposed in the general domain of Deep Neural Network (DNN) testing to alleviate this labelling cost issue, these approaches are not suitable for GNNs because they do not account for the interdependence between GNN test inputs, which is crucial for GNN inference. In this paper, we propose NodeRank, a novel test prioritization approach specifically for GNNs, guided by ensemble learning-based mutation analysis. Inspired by traditional mutation testing, where specific operators are applied to mutate code statements to identify whether provided test cases reveal faults, NodeRank operates on a crucial premise: If a test input (node) can kill many mutated models and produce different prediction results with many mutated inputs, this input is considered more likely to be misclassified by the GNN model and should be prioritized higher. Through prioritization, these potentially misclassified inputs can be identified earlier with limited manual labeling cost. NodeRank introduces mutation operators suitable for GNNs, focusing on three key aspects: the graph structure, the features of the graph nodes, and the GNN model itself. NodeRank generates mutants and compares their predictions against that of the initial test inputs. Based on the comparison results, a mutation feature vector is generated for each test input and used as the input to ranking models for test prioritization. Leveraging ensemble learning techniques, NodeRank combines the prediction results of the base ranking models and produces a misclassification score for each test input, which can indicate the likelihood of this input being misclassified. NodeRank sorts all the test inputs based on their scores in descending order. To evaluate NodeRank, we build 124 GNN subjects (i.e., a pair of dataset and GNN model), incorporating both natural and adversarial contexts. Our results demonstrate that NodeRank outperforms all the compared test prioritization approaches in terms of both APFD and PFD, which are widely-adopted metrics in this field. Specifically, NodeRank achieves an average improvement of between 4.41% and 58.11% on original datasets and between 4.96% and 62.15% on adversarial datasets.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

图神经网络的测试输入优先级

GNN 在各种分类任务中表现出了卓越的性能。在部署 GNN 模型之前，需要对其可靠性进行全面验证，以确保其准确运行。因此，有效的测试对于识别 GNN 模型的漏洞至关重要。然而，考虑到图结构数据的复杂性和大小，对 GNN 测试输入进行人工标注的成本对于实际应用案例来说可能过高。虽然在深度神经网络（DNN）测试的一般领域中已经提出了几种方法来缓解这种标记成本问题，但这些方法并不适用于 GNN，因为它们没有考虑到 GNN 测试输入之间的相互依赖关系，而这种关系对于 GNN 推断至关重要。在本文中，我们提出了 NodeRank，这是一种专门针对 GNN 的新型测试优先级排序方法，以基于集合学习的突变分析为指导。受传统突变测试的启发，NodeRank 基于一个重要前提进行操作：如果一个测试输入（节点）能杀死许多突变模型，并在许多突变输入的情况下产生不同的预测结果，那么这个输入被认为更有可能被 GNN 模型错误分类，因此应被优先考虑。通过优先级排序，这些可能被误判的输入可以在人工标注成本有限的情况下更早地被识别出来。NodeRank 引入了适用于 GNN 的突变算子，重点关注三个关键方面：图结构、图节点的特征和 GNN 模型本身。NodeRank 生成突变体，并将其预测结果与初始测试输入结果进行比较。根据比较结果，为每个测试输入生成突变特征向量，并将其作为排序模型的输入，以确定测试的优先级。NodeRank 利用集合学习技术，将基本排序模型的预测结果结合起来，为每个测试输入生成一个误分类分数，该分数可显示该输入被误分类的可能性。NodeRank 根据得分从高到低对所有测试输入进行排序。为了评估 NodeRank，我们建立了 124 个 GNN 主体（即一对数据集和 GNN 模型），其中既有自然语境，也有对抗语境。我们的结果表明，NodeRank 在 APFD 和 PFD 这两个在该领域被广泛采用的指标方面都优于所有比较过的测试优先级排序方法。具体来说，NodeRank 在原始数据集上平均提高了 4.41% 到 58.11%，在对抗数据集上平均提高了 4.96% 到 62.15%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Software Engineering 工程技术-工程：电子与电气

CiteScore

9.70

自引率

10.80%

发文量

724

审稿时长

6 months

期刊介绍： IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include: a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models. b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects. c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards. d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues. e) System issues: Hardware-software trade-offs. f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.