Pub Date : 2023-01-01DOI: 10.1109/WACVW58289.2023.00019
O. Iancu, Kara Yang, Han Man, Theresa Cabrera Menard
Biodiversity loss and ecosystem degradation are global challenges demanding creative and scalable solutions. Recent increases in data collection coupled with machine learning have the potential to expand landscape monitoring capabilities. We present a computer vision solution to the problem of identifying invasive species. The Australian Tree Fern (Cyathea cooperi) is a fast growing species that is displacing slower growing native plants across the Hawaiian islands. The Nature Conservancy organization has partnered with Amazon Web Services to develop and test an automated tree fern detection and mapping solution based on imagery collected from fixed wing aircraft. We utilize deep learning to identify tree ferns and map their locations. Distinguishing between invasive and native tree ferns in aerial images is challenging for human experts. We explore techniques such as image embeddings and principal component analysis to assist in the classification. Creating quality training datasets is critical for developing ML solutions. We describe how semi-automated labeling tools can expedite this process. These steps are integrated into an automated cloud native inference pipeline that reduces localization time from weeks to minutes. We further investigate issues encountered when the pipeline is utilized on novel images and a decline in performance relative to the training data is observed. We trace the origin of the problem to a subset of images originating from steep mountain slopes and riverbanks which generate blurring and streaking patterns mistakenly labeled as tree ferns. We propose a preprocessing step based on Haralick texture features which detects and flags images different from the training set. Experimental results show that the proposed method performs well and can potentially enhance the model performance by relabeling and retraining the model iteratively.
{"title":"An Automated and Scalable ML Solution for Mapping Invasive Species: the Case of the Australian Tree Fern in Hawaiian Forests","authors":"O. Iancu, Kara Yang, Han Man, Theresa Cabrera Menard","doi":"10.1109/WACVW58289.2023.00019","DOIUrl":"https://doi.org/10.1109/WACVW58289.2023.00019","url":null,"abstract":"Biodiversity loss and ecosystem degradation are global challenges demanding creative and scalable solutions. Recent increases in data collection coupled with machine learning have the potential to expand landscape monitoring capabilities. We present a computer vision solution to the problem of identifying invasive species. The Australian Tree Fern (Cyathea cooperi) is a fast growing species that is displacing slower growing native plants across the Hawaiian islands. The Nature Conservancy organization has partnered with Amazon Web Services to develop and test an automated tree fern detection and mapping solution based on imagery collected from fixed wing aircraft. We utilize deep learning to identify tree ferns and map their locations. Distinguishing between invasive and native tree ferns in aerial images is challenging for human experts. We explore techniques such as image embeddings and principal component analysis to assist in the classification. Creating quality training datasets is critical for developing ML solutions. We describe how semi-automated labeling tools can expedite this process. These steps are integrated into an automated cloud native inference pipeline that reduces localization time from weeks to minutes. We further investigate issues encountered when the pipeline is utilized on novel images and a decline in performance relative to the training data is observed. We trace the origin of the problem to a subset of images originating from steep mountain slopes and riverbanks which generate blurring and streaking patterns mistakenly labeled as tree ferns. We propose a preprocessing step based on Haralick texture features which detects and flags images different from the training set. Experimental results show that the proposed method performs well and can potentially enhance the model performance by relabeling and retraining the model iteratively.","PeriodicalId":306545,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)","volume":"186 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120868261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01DOI: 10.1109/WACVW58289.2023.00056
Jarod Wang, Chirag Mandaviya
The demand for accurate estimation of marketing's incremental effect is rapidly increasing to enable marketers make informed decisions on their ad investment. The process of admapping links an ad shown to consumers on the fixed marketing channels (Linear TV, Digital, Social) to a marketing creative video. Thus, an accurate admapping, which is a special case of video copy detection, is a cornerstone of ensuring exposure of ad is linked to the correct creative and marketing campaign and hence precise marketing effect measurement. With each campaign having tens of creatives and each country (marketplace) having tens of marketing campaigns each week, the current process of human annotation of hundreds of creatives requires over 800+ team's hours annually. Moreover, this manual process causes significant challenges in onboarding new businesses and countries to measurement due to the absence of intelligent model based admapping solution. To solve this problem, we built a machine learning (ML) model that leverages fingerprinting methodology and automatic language identi-fication technology to match each creative to the marketing campaign. In the paper, we present the computing algorithm and implementation details with results from actual campaign dataset. Extensive validation and comparison studies conducted demonstrates improved mapping results with the new proposed method, achieving 87% F1 score and 82% accuracy. To our best knowledge, this is the first model that uses a fusion of visual, audio, language and metadata features for such ML based content mapping solution. The proposed method leads to 90% reduction on the time spent on admapping compared to manual solutions.
{"title":"Can Machines Learn to Map Creative Videos to Marketing Campaigns?","authors":"Jarod Wang, Chirag Mandaviya","doi":"10.1109/WACVW58289.2023.00056","DOIUrl":"https://doi.org/10.1109/WACVW58289.2023.00056","url":null,"abstract":"The demand for accurate estimation of marketing's incremental effect is rapidly increasing to enable marketers make informed decisions on their ad investment. The process of admapping links an ad shown to consumers on the fixed marketing channels (Linear TV, Digital, Social) to a marketing creative video. Thus, an accurate admapping, which is a special case of video copy detection, is a cornerstone of ensuring exposure of ad is linked to the correct creative and marketing campaign and hence precise marketing effect measurement. With each campaign having tens of creatives and each country (marketplace) having tens of marketing campaigns each week, the current process of human annotation of hundreds of creatives requires over 800+ team's hours annually. Moreover, this manual process causes significant challenges in onboarding new businesses and countries to measurement due to the absence of intelligent model based admapping solution. To solve this problem, we built a machine learning (ML) model that leverages fingerprinting methodology and automatic language identi-fication technology to match each creative to the marketing campaign. In the paper, we present the computing algorithm and implementation details with results from actual campaign dataset. Extensive validation and comparison studies conducted demonstrates improved mapping results with the new proposed method, achieving 87% F1 score and 82% accuracy. To our best knowledge, this is the first model that uses a fusion of visual, audio, language and metadata features for such ML based content mapping solution. The proposed method leads to 90% reduction on the time spent on admapping compared to manual solutions.","PeriodicalId":306545,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116517237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01DOI: 10.1109/WACVW58289.2023.00037
Afi Edem Edi Gbekevi, Paloma Vela Achu, Gabriella Pangelinan, M. King, K. Bowyer
Automated face recognition technologies have been under scrutiny in recent years due to noted variations in accuracy relative to race and gender. Much of this concern was driven by media coverage of high error rates for women and persons of color reported in an evaluation of commercial gender classification ('gender from face”) tools. Many decried the conflation of errors observed in the task of gender classification with the task of face recognition. This motivated the question of whether images that are misclas-sified by a gender classification algorithm have increased error rate with face recognition algorithms. In the first experiment, we analyze the False Match Rate (FMR) of face recognition for comparisons in which one or both of the images are gender-misclassified. In the second experiment, we examine match scores of gender-misclassified images when compared to images from their labeled versus classified gender. We find that, in general, gender misclassified images are not associated with an increased FMR. For females, non-mated comparisons involving one misclassified image actually shift the resultant impostor distribution to lower similarity scores, representing improved accuracy. To our knowledge, this is the first work to analyze (1) the FMR of one- and two-misclassification error pairs and (2) non-mated match scores for misclassified images against labeled- and classified-gender categories.
{"title":"Analyzing the Impact of Gender Misclassification on Face Recognition Accuracy","authors":"Afi Edem Edi Gbekevi, Paloma Vela Achu, Gabriella Pangelinan, M. King, K. Bowyer","doi":"10.1109/WACVW58289.2023.00037","DOIUrl":"https://doi.org/10.1109/WACVW58289.2023.00037","url":null,"abstract":"Automated face recognition technologies have been under scrutiny in recent years due to noted variations in accuracy relative to race and gender. Much of this concern was driven by media coverage of high error rates for women and persons of color reported in an evaluation of commercial gender classification ('gender from face”) tools. Many decried the conflation of errors observed in the task of gender classification with the task of face recognition. This motivated the question of whether images that are misclas-sified by a gender classification algorithm have increased error rate with face recognition algorithms. In the first experiment, we analyze the False Match Rate (FMR) of face recognition for comparisons in which one or both of the images are gender-misclassified. In the second experiment, we examine match scores of gender-misclassified images when compared to images from their labeled versus classified gender. We find that, in general, gender misclassified images are not associated with an increased FMR. For females, non-mated comparisons involving one misclassified image actually shift the resultant impostor distribution to lower similarity scores, representing improved accuracy. To our knowledge, this is the first work to analyze (1) the FMR of one- and two-misclassification error pairs and (2) non-mated match scores for misclassified images against labeled- and classified-gender categories.","PeriodicalId":306545,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129861444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01DOI: 10.1109/WACVW58289.2023.00053
Dario Fontanel, David Higham, Benoit Vallade
Video quality assessment (VQA) has sparked a lot of interest in the computer vision community, as it plays a critical role in services that provide customers with high quality video content. Due to the lack of high quality reference videos and the difficulties in collecting subjective evaluations, assessing video quality is a challenging and still unsolved problem. Moreover, most of the public research efforts focus only on user-generated content (UGC), making it unclear if reliable solutions can be adopted for assessing the quality of production-related videos. The goal of this work is to assess the importance of spatial and temporal learning for production-related VQA. In particular, it assesses state-of-the-art UGC video quality assessment perspectives on LIVE-APV dataset, demonstrating the importance of learning contextual characteristics from each video frame, as well as capturing temporal correlations between them.
{"title":"On the Importance of Spatio-Temporal Learning for Video Quality Assessment","authors":"Dario Fontanel, David Higham, Benoit Vallade","doi":"10.1109/WACVW58289.2023.00053","DOIUrl":"https://doi.org/10.1109/WACVW58289.2023.00053","url":null,"abstract":"Video quality assessment (VQA) has sparked a lot of interest in the computer vision community, as it plays a critical role in services that provide customers with high quality video content. Due to the lack of high quality reference videos and the difficulties in collecting subjective evaluations, assessing video quality is a challenging and still unsolved problem. Moreover, most of the public research efforts focus only on user-generated content (UGC), making it unclear if reliable solutions can be adopted for assessing the quality of production-related videos. The goal of this work is to assess the importance of spatial and temporal learning for production-related VQA. In particular, it assesses state-of-the-art UGC video quality assessment perspectives on LIVE-APV dataset, demonstrating the importance of learning contextual characteristics from each video frame, as well as capturing temporal correlations between them.","PeriodicalId":306545,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128283067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01DOI: 10.1109/WACVW58289.2023.00055
Jun-Gyu Jin, Jaehyun Bae, Han-Gyul Baek, Sang-hyo Park
The recent development of video-based content platforms led the easy access to videos decades ago. However, some past videos have a old screen ratio. If an image with this ratio is executed on a display with a wider screen ratio, the image is excessively stretched horizontally or creates a black box, which prevents efficient viewing of content. In this paper, we propose a method for retargeting the old ratio video frames to a wider ratio while maintaining the original ratio of important objects in content using deep learning-based semantic segmentation and inpainting techniques. Our research shows that proposed method can make a retargeted frames visually natural.
{"title":"Object-Ratio-Preserving Video Retargeting Framework based on Segmentation and Inpainting","authors":"Jun-Gyu Jin, Jaehyun Bae, Han-Gyul Baek, Sang-hyo Park","doi":"10.1109/WACVW58289.2023.00055","DOIUrl":"https://doi.org/10.1109/WACVW58289.2023.00055","url":null,"abstract":"The recent development of video-based content platforms led the easy access to videos decades ago. However, some past videos have a old screen ratio. If an image with this ratio is executed on a display with a wider screen ratio, the image is excessively stretched horizontally or creates a black box, which prevents efficient viewing of content. In this paper, we propose a method for retargeting the old ratio video frames to a wider ratio while maintaining the original ratio of important objects in content using deep learning-based semantic segmentation and inpainting techniques. Our research shows that proposed method can make a retargeted frames visually natural.","PeriodicalId":306545,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)","volume":"218 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128969126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01DOI: 10.1109/WACVW58289.2023.00032
Reeve Lambert, Jianwen Li, Jalil Chavez-Galaviz, N. Mahmoudian
Neural network semantic image segmentation has developed into a powerful tool for autonomous navigational environmental comprehension in complex environments. While semantic segmentation networks have seen ample applications in the ground domain, implementations in the surface water domain, especially fluvial (rivers and streams) deployments, have lagged behind due to training data and literature sparsity issues. To tackle this problem the publicly available River Obstacle Segmentation En-Route By USV Dataset (ROSEBUD) was recently published. The dataset provides unique rural fluvial training data for the water binary segmentation task to aid in fluvial scene au-tonomous navigation. Despite new dataset sources, there is still a need for studies on networks that excel at both under-standing marine and fluvial scenes and efficiently operating on the computationally limited embedded systems that are common on autonomous marine platforms like ASVs. To provide insight into state-of-the-art network capabilities on embedded systems a survey of twelve networks encompassing 8 different architectures has been developed. Networks were trained and tested on a combination of three existing datasets, including the ROSEBUD dataset, and then implemented on an NVIDIA Jetson Nano to evaluate performance on real-world hardware. The survey's results layout recommendations for networks to use in autonomous applications in complex and fast-moving environments relative to network performance and inference speed.
{"title":"A Survey on the Deployability of Semantic Segmentation Networks for Fluvial Navigation","authors":"Reeve Lambert, Jianwen Li, Jalil Chavez-Galaviz, N. Mahmoudian","doi":"10.1109/WACVW58289.2023.00032","DOIUrl":"https://doi.org/10.1109/WACVW58289.2023.00032","url":null,"abstract":"Neural network semantic image segmentation has developed into a powerful tool for autonomous navigational environmental comprehension in complex environments. While semantic segmentation networks have seen ample applications in the ground domain, implementations in the surface water domain, especially fluvial (rivers and streams) deployments, have lagged behind due to training data and literature sparsity issues. To tackle this problem the publicly available River Obstacle Segmentation En-Route By USV Dataset (ROSEBUD) was recently published. The dataset provides unique rural fluvial training data for the water binary segmentation task to aid in fluvial scene au-tonomous navigation. Despite new dataset sources, there is still a need for studies on networks that excel at both under-standing marine and fluvial scenes and efficiently operating on the computationally limited embedded systems that are common on autonomous marine platforms like ASVs. To provide insight into state-of-the-art network capabilities on embedded systems a survey of twelve networks encompassing 8 different architectures has been developed. Networks were trained and tested on a combination of three existing datasets, including the ROSEBUD dataset, and then implemented on an NVIDIA Jetson Nano to evaluate performance on real-world hardware. The survey's results layout recommendations for networks to use in autonomous applications in complex and fast-moving environments relative to network performance and inference speed.","PeriodicalId":306545,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)","volume":"502 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121019714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01DOI: 10.1109/WACVW58289.2023.00064
Manuel Kansy, Julian Balletshofer, Jacek Naruniec, Christopher Schroers, Graziana Mignone, M. Gross, Romann M. Weber
High-resolution, high-quality images of human faces are desired as training data and output for many modern applications, such as avatar generation, face super-resolution, and face swapping. The terms high-resolution and high-quality are often used interchangeably; however, the two concepts are not equivalent, and high-resolution does not always imply high-quality. To address this, we motivate and precisely define the concept of effective resolution in this paper. We thereby draw connections to signal and information theory and show why baselines based on frequency analysis or compression fail. Instead, we propose a novel self-supervised learning scheme to train a neural network for effective resolution estimation without human-labeled data. It leverages adversarial augmentations to bridge the domain gap between synthetic and real, authentic degradations - thus allowing us to train on domains, such as hu-man faces, for which no or only few human labels exist. Finally, we demonstrate that our method outperforms state-of-the-art image quality assessment methods in estimating the sharpness of real and generated human faces, despite using only unlabeled data during training.
{"title":"Self-Supervised Effective Resolution Estimation with Adversarial Augmentations","authors":"Manuel Kansy, Julian Balletshofer, Jacek Naruniec, Christopher Schroers, Graziana Mignone, M. Gross, Romann M. Weber","doi":"10.1109/WACVW58289.2023.00064","DOIUrl":"https://doi.org/10.1109/WACVW58289.2023.00064","url":null,"abstract":"High-resolution, high-quality images of human faces are desired as training data and output for many modern applications, such as avatar generation, face super-resolution, and face swapping. The terms high-resolution and high-quality are often used interchangeably; however, the two concepts are not equivalent, and high-resolution does not always imply high-quality. To address this, we motivate and precisely define the concept of effective resolution in this paper. We thereby draw connections to signal and information theory and show why baselines based on frequency analysis or compression fail. Instead, we propose a novel self-supervised learning scheme to train a neural network for effective resolution estimation without human-labeled data. It leverages adversarial augmentations to bridge the domain gap between synthetic and real, authentic degradations - thus allowing us to train on domains, such as hu-man faces, for which no or only few human labels exist. Finally, we demonstrate that our method outperforms state-of-the-art image quality assessment methods in estimating the sharpness of real and generated human faces, despite using only unlabeled data during training.","PeriodicalId":306545,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126759946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01DOI: 10.1109/WACVW58289.2023.00038
Jiaee Cheong, Sinan Kalkan, H. Gunes
The problem of bias in facial affect recognition tools can lead to severe consequences and issues. It has been posited that causality is able to address the gaps induced by the associational nature of traditional machine learning, and one such gap is that of fairness. However, given the nascency of the field, there is still no clear mapping between tools in causality and applications in fair machine learning for the specific task of affect recognition. To address this gap, we provide the first causal structure formalisation of the different biases that can arise in affect recognition. We conducted a proof of concept on utilising causal structure learning for the post-hoc understanding and analysing bias.
{"title":"Causal Structure Learning of Bias for Fair Affect Recognition","authors":"Jiaee Cheong, Sinan Kalkan, H. Gunes","doi":"10.1109/WACVW58289.2023.00038","DOIUrl":"https://doi.org/10.1109/WACVW58289.2023.00038","url":null,"abstract":"The problem of bias in facial affect recognition tools can lead to severe consequences and issues. It has been posited that causality is able to address the gaps induced by the associational nature of traditional machine learning, and one such gap is that of fairness. However, given the nascency of the field, there is still no clear mapping between tools in causality and applications in fair machine learning for the specific task of affect recognition. To address this gap, we provide the first causal structure formalisation of the different biases that can arise in affect recognition. We conducted a proof of concept on utilising causal structure learning for the post-hoc understanding and analysing bias.","PeriodicalId":306545,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126394056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01DOI: 10.1109/WACVW58289.2023.00044
Stefan Matcovici, D. Voinea, A. Popa
Few-shot learning has attracted significant scientific interest in the past decade due to its applicability to visual tasks with a natural long-tailed distribution such as object detection. This paper introduces a novel and flexible few-shot object detection approach which can be adapted effortlessly to any candidate-based object detection frame-work. In particular, our proposed $boldsymbol{kFEW}$ component leverages a kNN retrieval technique over the regions of interest space to build both a class-distribution and a weighted aggregated embedding conditioned by the recovered neighbours. The obtained kNN feature representation is used to drive the training process without any additional trainable parameters as well as during inference time by steering the assumed confidence and the predicted box coordinates of the detection model. We perform extensive experiments and ablation studies on MS COCO and Pascal VOC proving its efficiency and state-of-the-art results (by a margin of 2.3 mAP points on MS COCO and by a margin of 2.5 mAP points on Pascal VOC) in the context of few-shot-object detection. Additionally, we demonstrate its versatility and ease-of-integration aspect by incorporating over competitive few-shot object detection methods and providing superior results.
{"title":"$k-text{NN}$ embeded space conditioning for enhanced few-shot object detection","authors":"Stefan Matcovici, D. Voinea, A. Popa","doi":"10.1109/WACVW58289.2023.00044","DOIUrl":"https://doi.org/10.1109/WACVW58289.2023.00044","url":null,"abstract":"Few-shot learning has attracted significant scientific interest in the past decade due to its applicability to visual tasks with a natural long-tailed distribution such as object detection. This paper introduces a novel and flexible few-shot object detection approach which can be adapted effortlessly to any candidate-based object detection frame-work. In particular, our proposed $boldsymbol{kFEW}$ component leverages a kNN retrieval technique over the regions of interest space to build both a class-distribution and a weighted aggregated embedding conditioned by the recovered neighbours. The obtained kNN feature representation is used to drive the training process without any additional trainable parameters as well as during inference time by steering the assumed confidence and the predicted box coordinates of the detection model. We perform extensive experiments and ablation studies on MS COCO and Pascal VOC proving its efficiency and state-of-the-art results (by a margin of 2.3 mAP points on MS COCO and by a margin of 2.5 mAP points on Pascal VOC) in the context of few-shot-object detection. Additionally, we demonstrate its versatility and ease-of-integration aspect by incorporating over competitive few-shot object detection methods and providing superior results.","PeriodicalId":306545,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127957521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}