Alireza Aghabagherloo, Rafa Gálvez, D. Preuveneers, B. Preneel
{"title":"On the Brittleness of Robust Features: An Exploratory Analysis of Model Robustness and Illusionary Robust Features","authors":"Alireza Aghabagherloo, Rafa Gálvez, D. Preuveneers, B. Preneel","doi":"10.1109/SPW59333.2023.00009","DOIUrl":null,"url":null,"abstract":"Neural networks have been shown to be vulnerable to visual data perturbations imperceptible to the human eye. Nowadays, the leading hypothesis about the reason for the existence of these adversarial examples is the presence of non-robust features, which are highly predictive but brittle. Also, it has been shown that there exist two types of non-robust features depending on whether or not they are entangled with robust features; perturbing non-robust features entangled with robust features can form adversarial examples. This paper extends earlier work by showing that models trained exclusively on robust features are still vulnerable to one type of adversarial example. Standard-trained networks can classify more accurately than robustly trained networks in this situation. Our experiments show that this phenomenon is due to the high correlation between most of the robust features and both correct and incorrect labels. In this work, we define features highly correlated with correct and incorrect labels as illusionary robust features. We discuss how perturbing an image attacking robust models affects the feature space. Based on our observations on the feature space, we explain why standard models are more successful in correctly classifying these perturbed images than robustly trained models. Our observations also show that, similar to changing non-robust features, changing some of the robust features is still imperceptible to the human eye.","PeriodicalId":308378,"journal":{"name":"2023 IEEE Security and Privacy Workshops (SPW)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE Security and Privacy Workshops (SPW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPW59333.2023.00009","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Neural networks have been shown to be vulnerable to visual data perturbations imperceptible to the human eye. Nowadays, the leading hypothesis about the reason for the existence of these adversarial examples is the presence of non-robust features, which are highly predictive but brittle. Also, it has been shown that there exist two types of non-robust features depending on whether or not they are entangled with robust features; perturbing non-robust features entangled with robust features can form adversarial examples. This paper extends earlier work by showing that models trained exclusively on robust features are still vulnerable to one type of adversarial example. Standard-trained networks can classify more accurately than robustly trained networks in this situation. Our experiments show that this phenomenon is due to the high correlation between most of the robust features and both correct and incorrect labels. In this work, we define features highly correlated with correct and incorrect labels as illusionary robust features. We discuss how perturbing an image attacking robust models affects the feature space. Based on our observations on the feature space, we explain why standard models are more successful in correctly classifying these perturbed images than robustly trained models. Our observations also show that, similar to changing non-robust features, changing some of the robust features is still imperceptible to the human eye.