David Berend, Xiaofei Xie, L. Ma, Lingjun Zhou, Yang Liu, Chi Xu, Jianjun Zhao
{"title":"猫不是鱼:深度学习测试需要超出分布的意识","authors":"David Berend, Xiaofei Xie, L. Ma, Lingjun Zhou, Yang Liu, Chi Xu, Jianjun Zhao","doi":"10.1145/3324884.3416609","DOIUrl":null,"url":null,"abstract":"As Deep Learning (DL) is continuously adopted in many industrial applications, its quality and reliability start to raise concerns. Similar to the traditional software development process, testing the DL software to uncover its defects at an early stage is an effective way to reduce risks after deployment. According to the fundamental assumption of deep learning, the DL software does not provide statistical guarantee and has limited capability in handling data that falls outside of its learned distribution, i.e., out-of-distribution (OOD) data. Although recent progress has been made in designing novel testing techniques for DL software, which can detect thousands of errors, the current state-of-the-art DL testing techniques usually do not take the distribution of generated test data into consideration. It is therefore hard to judge whether the “identified errors” are indeed meaningful errors to the DL application (i.e., due to quality issues of the model) or outliers that cannot be handled by the current model (i.e., due to the lack of training data). Tofill this gap, we take thefi rst step and conduct a large scale empirical study, with a total of 451 experiment configurations, 42 deep neural networks (DNNs) and 1.2 million test data instances, to investigate and characterize the impact of OOD-awareness on DL testing. We further analyze the consequences when DL systems go into production by evaluating the effectiveness of adversarial retraining with distribution-aware errors. The results confirm that introducing data distribution awareness in both testing and enhancement phases outperforms distribution unaware retraining by up to 21.5%.","PeriodicalId":106337,"journal":{"name":"2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"52","resultStr":"{\"title\":\"Cats Are Not Fish: Deep Learning Testing Calls for Out-Of-Distribution Awareness\",\"authors\":\"David Berend, Xiaofei Xie, L. Ma, Lingjun Zhou, Yang Liu, Chi Xu, Jianjun Zhao\",\"doi\":\"10.1145/3324884.3416609\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As Deep Learning (DL) is continuously adopted in many industrial applications, its quality and reliability start to raise concerns. Similar to the traditional software development process, testing the DL software to uncover its defects at an early stage is an effective way to reduce risks after deployment. According to the fundamental assumption of deep learning, the DL software does not provide statistical guarantee and has limited capability in handling data that falls outside of its learned distribution, i.e., out-of-distribution (OOD) data. Although recent progress has been made in designing novel testing techniques for DL software, which can detect thousands of errors, the current state-of-the-art DL testing techniques usually do not take the distribution of generated test data into consideration. It is therefore hard to judge whether the “identified errors” are indeed meaningful errors to the DL application (i.e., due to quality issues of the model) or outliers that cannot be handled by the current model (i.e., due to the lack of training data). Tofill this gap, we take thefi rst step and conduct a large scale empirical study, with a total of 451 experiment configurations, 42 deep neural networks (DNNs) and 1.2 million test data instances, to investigate and characterize the impact of OOD-awareness on DL testing. We further analyze the consequences when DL systems go into production by evaluating the effectiveness of adversarial retraining with distribution-aware errors. The results confirm that introducing data distribution awareness in both testing and enhancement phases outperforms distribution unaware retraining by up to 21.5%.\",\"PeriodicalId\":106337,\"journal\":{\"name\":\"2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"52\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3324884.3416609\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3324884.3416609","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 52
摘要
随着深度学习(DL)在许多工业应用中的不断应用,其质量和可靠性开始引起人们的关注。与传统的软件开发过程类似,在早期阶段对DL软件进行测试以发现其缺陷是降低部署后风险的有效方法。根据深度学习的基本假设,深度学习软件不提供统计保证,处理超出其学习分布的数据,即out- distribution (OOD)数据的能力有限。虽然最近在为深度学习软件设计新的测试技术方面取得了进展,它可以检测到数千个错误,但目前最先进的深度学习测试技术通常不会考虑生成测试数据的分布。因此,很难判断“已识别的错误”是否确实是对DL应用有意义的错误(即由于模型的质量问题),还是当前模型无法处理的异常值(即由于缺乏训练数据)。为了填补这一空白,我们采取了第一步,进行了大规模的实证研究,共有451个实验配置,42个深度神经网络(dnn)和120万个测试数据实例,来调查和表征ood感知对深度学习测试的影响。我们通过评估带有分布感知错误的对抗性再训练的有效性,进一步分析了深度学习系统投入生产时的后果。结果证实,在测试和增强阶段引入数据分布意识比不知道分布的再训练高出21.5%。
Cats Are Not Fish: Deep Learning Testing Calls for Out-Of-Distribution Awareness
As Deep Learning (DL) is continuously adopted in many industrial applications, its quality and reliability start to raise concerns. Similar to the traditional software development process, testing the DL software to uncover its defects at an early stage is an effective way to reduce risks after deployment. According to the fundamental assumption of deep learning, the DL software does not provide statistical guarantee and has limited capability in handling data that falls outside of its learned distribution, i.e., out-of-distribution (OOD) data. Although recent progress has been made in designing novel testing techniques for DL software, which can detect thousands of errors, the current state-of-the-art DL testing techniques usually do not take the distribution of generated test data into consideration. It is therefore hard to judge whether the “identified errors” are indeed meaningful errors to the DL application (i.e., due to quality issues of the model) or outliers that cannot be handled by the current model (i.e., due to the lack of training data). Tofill this gap, we take thefi rst step and conduct a large scale empirical study, with a total of 451 experiment configurations, 42 deep neural networks (DNNs) and 1.2 million test data instances, to investigate and characterize the impact of OOD-awareness on DL testing. We further analyze the consequences when DL systems go into production by evaluating the effectiveness of adversarial retraining with distribution-aware errors. The results confirm that introducing data distribution awareness in both testing and enhancement phases outperforms distribution unaware retraining by up to 21.5%.