Li Li;Chuanqi Tao;Hongjing Guo;Jingxuan Zhang;Xiaobing Sun
{"title":"FATS: Feature Distribution Analysis-Based Test Selection for Deep Learning Enhancement","authors":"Li Li;Chuanqi Tao;Hongjing Guo;Jingxuan Zhang;Xiaobing Sun","doi":"10.1109/TBDATA.2023.3334648","DOIUrl":null,"url":null,"abstract":"Deep Learning has been applied to many applications across different domains. However, the distribution shift between the test data and training data is a major factor impacting the quality of deep neural networks (DNNs). To address this issue, existing research mainly focuses on enhancing DNN models by retraining them using labeled test data. However, labeling test data is costly, which seriously reduces the efficiency of DNN testing. To solve this problem, test selection strategically selected a small set of tests to label. Unfortunately, existing test selection methods seldom focus on the data distribution shift. To address the issue, this paper proposes an approach for test selection named Feature Distribution Analysis-Based Test Selection (FATS). FATS analyzes the distributions of test data and training data and then adopts learning to rank (a kind of supervised machine learning to solve ranking tasks) to intelligently combine the results of analysis for test selection. We conduct an empirical study on popular datasets and DNN models, and then compare FATS with seven test selection methods. Experiment results show that FATS effectively alleviates the impact of distribution shifts and outperforms the compared methods with the average accuracy improvement of 19.6%\n<inline-formula><tex-math>$\\sim$</tex-math></inline-formula>\n69.7% for DNN model enhancement.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 2","pages":"132-145"},"PeriodicalIF":7.5000,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Big Data","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10323141/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Deep Learning has been applied to many applications across different domains. However, the distribution shift between the test data and training data is a major factor impacting the quality of deep neural networks (DNNs). To address this issue, existing research mainly focuses on enhancing DNN models by retraining them using labeled test data. However, labeling test data is costly, which seriously reduces the efficiency of DNN testing. To solve this problem, test selection strategically selected a small set of tests to label. Unfortunately, existing test selection methods seldom focus on the data distribution shift. To address the issue, this paper proposes an approach for test selection named Feature Distribution Analysis-Based Test Selection (FATS). FATS analyzes the distributions of test data and training data and then adopts learning to rank (a kind of supervised machine learning to solve ranking tasks) to intelligently combine the results of analysis for test selection. We conduct an empirical study on popular datasets and DNN models, and then compare FATS with seven test selection methods. Experiment results show that FATS effectively alleviates the impact of distribution shifts and outperforms the compared methods with the average accuracy improvement of 19.6%
$\sim$
69.7% for DNN model enhancement.
期刊介绍:
The IEEE Transactions on Big Data publishes peer-reviewed articles focusing on big data. These articles present innovative research ideas and application results across disciplines, including novel theories, algorithms, and applications. Research areas cover a wide range, such as big data analytics, visualization, curation, management, semantics, infrastructure, standards, performance analysis, intelligence extraction, scientific discovery, security, privacy, and legal issues specific to big data. The journal also prioritizes applications of big data in fields generating massive datasets.