确定基于深度学习的视频分类器测试用例的优先级

IF 3.5 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Empirical Software Engineering Pub Date : 2024-07-22 DOI:10.1007/s10664-024-10520-1

Yinghua Li, Xueqi Dang, Lei Ma, Jacques Klein, Tegawendé F. Bissyandé

{"title":"确定基于深度学习的视频分类器测试用例的优先级","authors":"Yinghua Li, Xueqi Dang, Lei Ma, Jacques Klein, Tegawendé F. Bissyandé","doi":"10.1007/s10664-024-10520-1","DOIUrl":null,"url":null,"abstract":"The widespread adoption of video-based applications across various fields highlights their importance in modern software systems. However, in comparison to images or text, labelling video test cases for the purpose of assessing system accuracy can lead to increased expenses due to their temporal structure and larger volume. Test prioritization has emerged as a promising approach to mitigate the labeling cost, which prioritizes potentially misclassified test inputs so that such inputs can be identified earlier with limited time and manual labeling efforts. However, applying existing prioritization techniques to video test cases faces certain limitations: they do not account for the unique temporal information present in video data. Unlike static image datasets that only contain spatial information, video inputs consist of multiple frames that capture the dynamic changes of objects over time. In this paper, we propose VRank, the first test prioritization approach designed specifically for video test inputs. The fundamental idea behind VRank is that video-type tests with a higher probability of being misclassified by the evaluated DNN classifier are considered more likely to reveal faults and will be prioritized higher. To this end, we train a ranking model with the aim of predicting the probability of a given test input being misclassified by a DNN classifier. This prediction relies on four types of generated features: temporal features (TF), video embedding features (EF), prediction features (PF), and uncertainty features (UF). We rank all test inputs in the target test set based on their misclassification probabilities. Videos with a higher likelihood of being misclassified will be prioritized higher. We conducted an empirical evaluation to assess the performance of VRank, involving 120 subjects with both natural and noisy datasets. The experimental results reveal VRank outperforms all compared test prioritization methods, with an average improvement of 5.76%\\(\\sim \\)46.51% on natural datasets and 4.26%\\(\\sim \\)53.56% on noisy datasets.","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"45 1","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Prioritizing test cases for deep learning-based video classifiers\",\"authors\":\"Yinghua Li, Xueqi Dang, Lei Ma, Jacques Klein, Tegawendé F. Bissyandé\",\"doi\":\"10.1007/s10664-024-10520-1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The widespread adoption of video-based applications across various fields highlights their importance in modern software systems. However, in comparison to images or text, labelling video test cases for the purpose of assessing system accuracy can lead to increased expenses due to their temporal structure and larger volume. Test prioritization has emerged as a promising approach to mitigate the labeling cost, which prioritizes potentially misclassified test inputs so that such inputs can be identified earlier with limited time and manual labeling efforts. However, applying existing prioritization techniques to video test cases faces certain limitations: they do not account for the unique temporal information present in video data. Unlike static image datasets that only contain spatial information, video inputs consist of multiple frames that capture the dynamic changes of objects over time. In this paper, we propose VRank, the first test prioritization approach designed specifically for video test inputs. The fundamental idea behind VRank is that video-type tests with a higher probability of being misclassified by the evaluated DNN classifier are considered more likely to reveal faults and will be prioritized higher. To this end, we train a ranking model with the aim of predicting the probability of a given test input being misclassified by a DNN classifier. This prediction relies on four types of generated features: temporal features (TF), video embedding features (EF), prediction features (PF), and uncertainty features (UF). We rank all test inputs in the target test set based on their misclassification probabilities. Videos with a higher likelihood of being misclassified will be prioritized higher. We conducted an empirical evaluation to assess the performance of VRank, involving 120 subjects with both natural and noisy datasets. The experimental results reveal VRank outperforms all compared test prioritization methods, with an average improvement of 5.76%\\\\(\\\\sim \\\\)46.51% on natural datasets and 4.26%\\\\(\\\\sim \\\\)53.56% on noisy datasets.\",\"PeriodicalId\":11525,\"journal\":{\"name\":\"Empirical Software Engineering\",\"volume\":\"45 1\",\"pages\":\"\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2024-07-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Empirical Software Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s10664-024-10520-1\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Empirical Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10664-024-10520-1","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

摘要

视频应用广泛应用于各个领域，凸显了其在现代软件系统中的重要性。然而，与图像或文本相比，为评估系统准确性而对视频测试用例进行标注，会因为其时间结构和较大的容量而导致费用增加。测试优先级排序已成为减轻标注成本的一种可行方法，它可对可能被错误分类的测试输入进行优先级排序，以便在有限的时间和人工标注工作中尽早识别出此类输入。然而，将现有的优先级排序技术应用于视频测试用例会面临一定的局限性：它们无法考虑视频数据中独特的时间信息。与只包含空间信息的静态图像数据集不同，视频输入由多个帧组成，可捕捉对象随时间发生的动态变化。在本文中，我们提出了 VRank，这是第一种专为视频测试输入而设计的测试优先级排序方法。VRank 背后的基本思想是，视频类型的测试如果被已评估的 DNN 分类器误分类的概率较高，则被认为更有可能暴露出故障，因此优先级会更高。为此，我们训练了一个排名模型，目的是预测给定测试输入被 DNN 分类器误分类的概率。这种预测依赖于四种类型的生成特征：时间特征 (TF)、视频嵌入特征 (EF)、预测特征 (PF) 和不确定性特征 (UF)。我们根据误分类概率对目标测试集中的所有测试输入进行排序。被误判可能性较高的视频将被优先处理。我们对 VRank 的性能进行了实证评估，共有 120 名受试者参加了自然数据集和噪声数据集的评估。实验结果表明，VRank优于所有比较过的测试优先级排序方法，在自然数据集上平均提高了5.76%（46.51%），在噪声数据集上平均提高了4.26%（53.56%）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Prioritizing test cases for deep learning-based video classifiers

The widespread adoption of video-based applications across various fields highlights their importance in modern software systems. However, in comparison to images or text, labelling video test cases for the purpose of assessing system accuracy can lead to increased expenses due to their temporal structure and larger volume. Test prioritization has emerged as a promising approach to mitigate the labeling cost, which prioritizes potentially misclassified test inputs so that such inputs can be identified earlier with limited time and manual labeling efforts. However, applying existing prioritization techniques to video test cases faces certain limitations: they do not account for the unique temporal information present in video data. Unlike static image datasets that only contain spatial information, video inputs consist of multiple frames that capture the dynamic changes of objects over time. In this paper, we propose VRank, the first test prioritization approach designed specifically for video test inputs. The fundamental idea behind VRank is that video-type tests with a higher probability of being misclassified by the evaluated DNN classifier are considered more likely to reveal faults and will be prioritized higher. To this end, we train a ranking model with the aim of predicting the probability of a given test input being misclassified by a DNN classifier. This prediction relies on four types of generated features: temporal features (TF), video embedding features (EF), prediction features (PF), and uncertainty features (UF). We rank all test inputs in the target test set based on their misclassification probabilities. Videos with a higher likelihood of being misclassified will be prioritized higher. We conducted an empirical evaluation to assess the performance of VRank, involving 120 subjects with both natural and noisy datasets. The experimental results reveal VRank outperforms all compared test prioritization methods, with an average improvement of 5.76%\(\sim \)46.51% on natural datasets and 4.26%\(\sim \)53.56% on noisy datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Empirical Software Engineering 工程技术-计算机：软件工程

CiteScore

8.50

自引率

12.20%

发文量

169

审稿时长

>12 weeks

期刊介绍： Empirical Software Engineering provides a forum for applied software engineering research with a strong empirical component, and a venue for publishing empirical results relevant to both researchers and practitioners. Empirical studies presented here usually involve the collection and analysis of data and experience that can be used to characterize, evaluate and reveal relationships between software development deliverables, practices, and technologies. Over time, it is expected that such empirical results will form a body of knowledge leading to widely accepted and well-formed theories. The journal also offers industrial experience reports detailing the application of software technologies - processes, methods, or tools - and their effectiveness in industrial settings. Empirical Software Engineering promotes the publication of industry-relevant research, to address the significant gap between research and practice.

期刊最新文献

The effect of data complexity on classifier performance. Reinforcement learning for online testing of autonomous driving systems: a replication and extension study. An empirical study on developers’ shared conversations with ChatGPT in GitHub pull requests and issues Quality issues in machine learning software systems An empirical study of token-based micro commits