TEASMA: A Practical Methodology for Test Adequacy Assessment of Deep Neural Networks

IF 5.6 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING IEEE Transactions on Software Engineering Pub Date : 2024-10-17 DOI:10.1109/TSE.2024.3482984

Amin Abbasishahkoo;Mahboubeh Dadkhah;Lionel Briand;Dayi Lin

{"title":"TEASMA: A Practical Methodology for Test Adequacy Assessment of Deep Neural Networks","authors":"Amin Abbasishahkoo;Mahboubeh Dadkhah;Lionel Briand;Dayi Lin","doi":"10.1109/TSE.2024.3482984","DOIUrl":null,"url":null,"abstract":"Successful deployment of Deep Neural Networks (DNNs), particularly in safety-critical systems, requires their validation with an adequate test set to ensure a sufficient degree of confidence in test outcomes. Although well-established test adequacy assessment techniques from traditional software, such as mutation analysis and coverage criteria, have been adapted to DNNs in recent years, we still need to investigate their application within a comprehensive methodology for accurately predicting the fault detection ability of test sets and thus assessing their adequacy. In this paper, we propose and evaluate \nTEASMA\n, a comprehensive and practical methodology designed to accurately assess the adequacy of test sets for DNNs. In practice, \nTEASMA\n allows engineers to decide whether they can trust high-accuracy test results and thus validate the DNN before its deployment. Based on a DNN model's training set, \nTEASMA\n provides a procedure to build accurate DNN-specific prediction models of the Fault Detection Rate (FDR) of a test set using an existing adequacy metric, thus enabling its assessment. We evaluated \nTEASMA\n with four state-of-the-art test adequacy metrics: Distance-based Surprise Coverage (DSC), Likelihood-based Surprise Coverage (LSC), Input Distribution Coverage (IDC), and Mutation Score (MS). We calculated MS based on mutation operators that directly modify the trained DNN model (i.e., post-training operators) due to their significant computational advantage compared to the operators that modify the DNN's training set or program (i.e., pre-training operators). Our extensive empirical evaluation, conducted across multiple DNN models and input sets, including large input sets such as ImageNet, reveals a strong linear correlation between the predicted and actual FDR values derived from MS, DSC, and IDC, with minimum \n<inline-formula><tex-math>$R^{2}$</tex-math></inline-formula>\n values of 0.94 for MS and 0.90 for DSC and IDC. Furthermore, a low average Root Mean Square Error (RMSE) of 9% between actual and predicted FDR values across all subjects, when relying on regression analysis and MS, demonstrates the latter's superior accuracy when compared to DSC and IDC, with RMSE values of 0.17 and 0.18, respectively. Overall, these results suggest that \nTEASMA\n provides a reliable basis for confidently deciding whether to trust test results for DNN models.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 12","pages":"3307-3329"},"PeriodicalIF":5.6000,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10720834/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Successful deployment of Deep Neural Networks (DNNs), particularly in safety-critical systems, requires their validation with an adequate test set to ensure a sufficient degree of confidence in test outcomes. Although well-established test adequacy assessment techniques from traditional software, such as mutation analysis and coverage criteria, have been adapted to DNNs in recent years, we still need to investigate their application within a comprehensive methodology for accurately predicting the fault detection ability of test sets and thus assessing their adequacy. In this paper, we propose and evaluate TEASMA , a comprehensive and practical methodology designed to accurately assess the adequacy of test sets for DNNs. In practice, TEASMA allows engineers to decide whether they can trust high-accuracy test results and thus validate the DNN before its deployment. Based on a DNN model's training set, TEASMA provides a procedure to build accurate DNN-specific prediction models of the Fault Detection Rate (FDR) of a test set using an existing adequacy metric, thus enabling its assessment. We evaluated TEASMA with four state-of-the-art test adequacy metrics: Distance-based Surprise Coverage (DSC), Likelihood-based Surprise Coverage (LSC), Input Distribution Coverage (IDC), and Mutation Score (MS). We calculated MS based on mutation operators that directly modify the trained DNN model (i.e., post-training operators) due to their significant computational advantage compared to the operators that modify the DNN's training set or program (i.e., pre-training operators). Our extensive empirical evaluation, conducted across multiple DNN models and input sets, including large input sets such as ImageNet, reveals a strong linear correlation between the predicted and actual FDR values derived from MS, DSC, and IDC, with minimum

$R^{2}$

values of 0.94 for MS and 0.90 for DSC and IDC. Furthermore, a low average Root Mean Square Error (RMSE) of 9% between actual and predicted FDR values across all subjects, when relying on regression analysis and MS, demonstrates the latter's superior accuracy when compared to DSC and IDC, with RMSE values of 0.17 and 0.18, respectively. Overall, these results suggest that TEASMA provides a reliable basis for confidently deciding whether to trust test results for DNN models.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

TEASMA：深度神经网络测试充分性评估的实用方法论

深度神经网络（dnn）的成功部署，特别是在安全关键系统中，需要通过足够的测试集进行验证，以确保对测试结果有足够的置信度。虽然传统软件中成熟的测试充分性评估技术，如突变分析和覆盖标准，近年来已经适用于深度神经网络，但我们仍然需要研究它们在综合方法中的应用，以准确预测测试集的故障检测能力，从而评估它们的充分性。在本文中，我们提出并评估了TEASMA，这是一种全面实用的方法，旨在准确评估dnn测试集的充分性。在实践中，TEASMA允许工程师决定他们是否可以信任高精度的测试结果，从而在部署之前验证DNN。基于DNN模型的训练集，TEASMA提供了一个程序，使用现有的充分性度量来构建准确的DNN特定于测试集的故障检测率（FDR）预测模型，从而能够对其进行评估。我们用四种最先进的测试充分性指标来评估TEASMA：基于距离的惊喜覆盖率（DSC）、基于似然的惊喜覆盖率（LSC）、输入分布覆盖率（IDC）和突变评分（MS）。我们基于直接修改训练好的DNN模型的突变算子（即训练后算子）计算MS，因为与修改DNN训练集或程序的算子（即预训练算子）相比，它们具有显著的计算优势。我们对多个DNN模型和输入集（包括ImageNet等大型输入集）进行了广泛的实证评估，揭示了MS、DSC和IDC得出的预测值和实际FDR值之间存在很强的线性相关性，MS和DSC和IDC的最小R^{2}$值为0.94，而DSC和IDC的最小R^{2}$值为0.90。此外，当依靠回归分析和MS时，所有受试者的实际和预测FDR值之间的平均均方根误差（RMSE）较低，为9%，表明后者与DSC和IDC相比具有更高的准确性，RMSE分别为0.17和0.18。总的来说，这些结果表明，TEASMA为自信地决定是否信任DNN模型的测试结果提供了可靠的基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Software Engineering 工程技术-工程：电子与电气

CiteScore

9.70

自引率

10.80%

发文量

724

审稿时长

6 months

期刊介绍： IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include: a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models. b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects. c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards. d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues. e) System issues: Hardware-software trade-offs. f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.