V. Mosin, M. Staron, Darko Durisic, F. D. O. Neto, Sushant Kumar Pandey, Ashok Chaitanya Koppisetty
{"title":"Comparing Input Prioritization Techniques for Testing Deep Learning Algorithms","authors":"V. Mosin, M. Staron, Darko Durisic, F. D. O. Neto, Sushant Kumar Pandey, Ashok Chaitanya Koppisetty","doi":"10.1109/SEAA56994.2022.00020","DOIUrl":null,"url":null,"abstract":"Deep learning (DL) systems are becoming an essential part of software systems, so it is necessary to test them thoroughly. This is a challenging task since the test sets can grow over time as the new data is being acquired, and it becomes time-consuming. Input prioritization is necessary to reduce the testing time since prioritized test inputs are more likely to reveal the erroneous behavior of a DL system earlier during test execution. Input prioritization approaches have been rudimentary analyzed against each other, this study compares different input prioritization techniques regarding their effectiveness and efficiency. This work considers surprise adequacy, autoencoder-based, and similarity-based input prioritization approaches in the example of testing a DL image classification algorithms applied on MNIST, Fashion-MNIST, CIFAR-10, and STL-10 datasets. To measure effectiveness and efficiency, we use a modified APFD (Average Percentage of Fault Detected), and set up & execution time, respectively. We observe that the surprise adequacy is the most effective (0.785 to 0.914 APFD). The autoencoder-based and similarity-based techniques are less effective, with the performance from 0.532 to 0.744 APFD and 0.579 to 0.709 APFD, respectively. In contrast, the similarity-based and surprise adequacy-based approaches are the most and least efficient, respectively. The findings in this work demonstrate the trade-off between the considered input prioritization techniques to understanding their practical applicability for testing DL algorithms.","PeriodicalId":269970,"journal":{"name":"2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SEAA56994.2022.00020","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Deep learning (DL) systems are becoming an essential part of software systems, so it is necessary to test them thoroughly. This is a challenging task since the test sets can grow over time as the new data is being acquired, and it becomes time-consuming. Input prioritization is necessary to reduce the testing time since prioritized test inputs are more likely to reveal the erroneous behavior of a DL system earlier during test execution. Input prioritization approaches have been rudimentary analyzed against each other, this study compares different input prioritization techniques regarding their effectiveness and efficiency. This work considers surprise adequacy, autoencoder-based, and similarity-based input prioritization approaches in the example of testing a DL image classification algorithms applied on MNIST, Fashion-MNIST, CIFAR-10, and STL-10 datasets. To measure effectiveness and efficiency, we use a modified APFD (Average Percentage of Fault Detected), and set up & execution time, respectively. We observe that the surprise adequacy is the most effective (0.785 to 0.914 APFD). The autoencoder-based and similarity-based techniques are less effective, with the performance from 0.532 to 0.744 APFD and 0.579 to 0.709 APFD, respectively. In contrast, the similarity-based and surprise adequacy-based approaches are the most and least efficient, respectively. The findings in this work demonstrate the trade-off between the considered input prioritization techniques to understanding their practical applicability for testing DL algorithms.