Object localization is a fundamental task in computer vision that traditionally requires labeled datasets for accurate results. Recent progress in self-supervised learning has enabled unsupervised object localization, reducing reliance on manual annotations. Unlike supervised encoders, which depend on annotated training data, self-supervised encoders learn semantic representations directly from large collections of unlabeled images. This makes them the natural foundation for unsupervised object localization, as they capture object-relevant features while eliminating the need for costly manual labels. These encoders produce semantically coherent patch embeddings. Grouping these embeddings reveals sets of patches that correspond to objects in an image. These patch sets can be converted into object masks or bounding boxes, enabling tasks such as single-object discovery, multi-object detection, and instance segmentation. By applying off-line mask clustering or using pre-trained vision-language models, unsupervised localization methods can assign semantic labels to discovered objects. This transforms initially class-agnostic objects (objects without class labels) into class-aware ones (objects with class labels), aligning these tasks with their supervised counterparts. This paper provides a structured review of unsupervised object localization methods in both class-agnostic and class-aware settings. In contrast, previous surveys have focused only on class-agnostic localization. We discuss state-of-the-art object discovery strategies based on self-supervised features and provide a detailed comparison of experimental results across a wide range of tasks, datasets, and evaluation metrics.
扫码关注我们
求助内容:
应助结果提醒方式:
