Dynamic robustness evaluation for automated model selection in operation

IF 4.3 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Information and Software Technology Pub Date : 2024-10-21 DOI:10.1016/j.infsof.2024.107603

Jin Zhang , Jingyue Li , Zhirong Yang

{"title":"Dynamic robustness evaluation for automated model selection in operation","authors":"Jin Zhang , Jingyue Li , Zhirong Yang","doi":"10.1016/j.infsof.2024.107603","DOIUrl":null,"url":null,"abstract":"<div><h3>Context:</h3><div>The increasing use of artificial neural network (ANN) classifiers in systems, especially safety-critical systems (SCSs), requires ensuring their robustness against out-of-distribution (OOD) shifts in operation, which are changes in the underlying data distribution from the data training the classifier. However, measuring the robustness of classifiers in operation with only unlabeled data is challenging. Additionally, machine learning engineers may need to compare different models or versions of the same model and switch to an optimal version based on their robustness.</div></div><div><h3>Objective:</h3><div>This paper explores the problem of dynamic robustness evaluation for automated model selection. We aim to find efficient and effective metrics for evaluating and comparing the robustness of multiple ANN classifiers using unlabeled operational data.</div></div><div><h3>Methods:</h3><div>To quantitatively measure the differences between the model outputs and assess robustness under OOD shifts using unlabeled data, we choose distance-based metrics. An empirical comparison of five such metrics, suitable for higher-dimensional data like images, is performed. The selected metrics include Wasserstein distance (WD), maximum mean discrepancy (MMD), Hellinger distance (HL), Kolmogorov–Smirnov statistic (KS), and Kullback–Leibler divergence (KL), known for their efficacy in quantifying distribution differences. We evaluate these metrics on 20 state-of-the-art models (ten CIFAR10-based models, five CIFAR100-based models, and five ImageNet-based models) from a widely used robustness benchmark (<strong>RobustBench</strong>) using data perturbed with various types and magnitudes of corruptions to mimic real-world OOD shifts.</div></div><div><h3>Results:</h3><div>Our findings reveal that the WD metric outperforms others when ranking multiple ANN models for CIFAR10- and CIFAR100-based models, while the KS metric demonstrates superior performance for ImageNet-based models. MMD can be used as a reliable second option for both datasets.</div></div><div><h3>Conclusion:</h3><div>This study highlights the effectiveness of distance-based metrics in ranking models’ robustness for automated model selection. It also emphasizes the significance of advancing research in dynamic robustness evaluation.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"178 ","pages":"Article 107603"},"PeriodicalIF":4.3000,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Software Technology","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950584924002088","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Context:

The increasing use of artificial neural network (ANN) classifiers in systems, especially safety-critical systems (SCSs), requires ensuring their robustness against out-of-distribution (OOD) shifts in operation, which are changes in the underlying data distribution from the data training the classifier. However, measuring the robustness of classifiers in operation with only unlabeled data is challenging. Additionally, machine learning engineers may need to compare different models or versions of the same model and switch to an optimal version based on their robustness.

Objective:

This paper explores the problem of dynamic robustness evaluation for automated model selection. We aim to find efficient and effective metrics for evaluating and comparing the robustness of multiple ANN classifiers using unlabeled operational data.

Methods:

To quantitatively measure the differences between the model outputs and assess robustness under OOD shifts using unlabeled data, we choose distance-based metrics. An empirical comparison of five such metrics, suitable for higher-dimensional data like images, is performed. The selected metrics include Wasserstein distance (WD), maximum mean discrepancy (MMD), Hellinger distance (HL), Kolmogorov–Smirnov statistic (KS), and Kullback–Leibler divergence (KL), known for their efficacy in quantifying distribution differences. We evaluate these metrics on 20 state-of-the-art models (ten CIFAR10-based models, five CIFAR100-based models, and five ImageNet-based models) from a widely used robustness benchmark (RobustBench) using data perturbed with various types and magnitudes of corruptions to mimic real-world OOD shifts.

Results:

Our findings reveal that the WD metric outperforms others when ranking multiple ANN models for CIFAR10- and CIFAR100-based models, while the KS metric demonstrates superior performance for ImageNet-based models. MMD can be used as a reliable second option for both datasets.

Conclusion:

This study highlights the effectiveness of distance-based metrics in ranking models’ robustness for automated model selection. It also emphasizes the significance of advancing research in dynamic robustness evaluation.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

运行中自动模型选择的动态稳健性评估

背景：人工神经网络（ANN）分类器在系统中的使用越来越多，尤其是在安全关键系统（SCS）中，这就要求确保分类器在运行时具有抗分布外（OOD）偏移的鲁棒性，分布外偏移是指底层数据分布与训练分类器的数据之间的变化。然而，仅使用未标注数据来测量分类器运行时的鲁棒性是一项挑战。此外，机器学习工程师可能需要比较不同的模型或同一模型的不同版本，并根据它们的鲁棒性切换到最佳版本。方法：为了定量测量模型输出之间的差异，并评估使用无标记数据进行 OOD 转换时的鲁棒性，我们选择了基于距离的指标。我们对适用于图像等高维数据的五个此类指标进行了实证比较。所选指标包括瓦瑟斯坦距离（WD）、最大平均差异（MMD）、海林格距离（HL）、科尔莫哥洛夫-斯米尔诺夫统计量（KS）和库尔巴克-莱伯勒发散（KL），这些指标因其在量化分布差异方面的功效而闻名。结果：我们的研究结果表明，在对基于 CIFAR10 和 CIFAR100 的多个 ANN 模型进行排序时，WD 指标优于其他指标，而 KS 指标在基于 ImageNet 的模型中表现优异。结论：本研究强调了基于距离的度量在自动模型选择中对模型鲁棒性排序的有效性。结论：本研究强调了基于距离的度量在自动模型选择中对模型鲁棒性进行排序的有效性，同时也强调了推进动态鲁棒性评估研究的重要意义。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Information and Software Technology 工程技术-计算机：软件工程

CiteScore

9.10

自引率

7.70%

发文量

164

审稿时长

9.6 weeks

期刊介绍： Information and Software Technology is the international archival journal focusing on research and experience that contributes to the improvement of software development practices. The journal''s scope includes methods and techniques to better engineer software and manage its development. Articles submitted for review should have a clear component of software engineering or address ways to improve the engineering and management of software development. Areas covered by the journal include: • Software management, quality and metrics, • Software processes, • Software architecture, modelling, specification, design and programming • Functional and non-functional software requirements • Software testing and verification & validation • Empirical studies of all aspects of engineering and managing software development Short Communications is a new section dedicated to short papers addressing new ideas, controversial opinions, "Negative" results and much more. Read the Guide for authors for more information. The journal encourages and welcomes submissions of systematic literature studies (reviews and maps) within the scope of the journal. Information and Software Technology is the premiere outlet for systematic literature studies in software engineering.

期刊最新文献

Unsupervised, robust, and lightweight detection of data pattern anomalies and outliers VulATMHD: Joint adaptive triplet mining and hybrid distillation for type-aware vulnerability classification Consensus planning boosts LLM code generation Editorial Board Wise recommender: LLMs refined by iterative critics