Pub Date : 2026-02-06DOI: 10.1007/s11263-025-02675-1
Camillo Quattrocchi, Antonino Furnari, Daniele Di Mauro, Mario Valerio Giuffrida, Giovanni Maria Farinella
{"title":"Exocentric-to-Egocentric Adaptation for Temporal Action Segmentation with Unlabeled Synchronized Video Pairs","authors":"Camillo Quattrocchi, Antonino Furnari, Daniele Di Mauro, Mario Valerio Giuffrida, Giovanni Maria Farinella","doi":"10.1007/s11263-025-02675-1","DOIUrl":"https://doi.org/10.1007/s11263-025-02675-1","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"9 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146138689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-05DOI: 10.1007/s11263-025-02680-4
Yaqing Ding, Jian Yang, Zuzana Kukelova
Homography refers to a specific type of transformation that relates two images of the same planar surface taken from different perspectives. Recovering motion parameters from a homography matrix is a classic problem in computer vision. It is important to derive a fast and stable solution to homography decomposition, since it forms a critical component of many vision systems, e . g ., in Structure-from-Motion and visual localization. The current state-of-the-art solvers can be categorized into two types of methods, the numerical procedures based on singular value decomposition (SVD), and the closed-form solution. The SVD-based methods are stable but time-consuming, while the existing closed-form solution is faster but less stable. In this paper, we discuss the homography decomposition problem from a different viewpoint. In contrast to the existing methods which focus on the properties of the homography matrix, we propose a new method that uses three random point correspondences to obtain the motion parameters in closed form. The proposed method is conceptually simple, easy to understand and implement, and has a good geometrical interpretation. This solution can be seen as an alternative to the existing closed-form solution. We also discuss the configurations where the closed-form solutions might be unstable and present a framework for homography decomposition taking into account both the efficiency and stability.
{"title":"Homography Decomposition Revisited","authors":"Yaqing Ding, Jian Yang, Zuzana Kukelova","doi":"10.1007/s11263-025-02680-4","DOIUrl":"https://doi.org/10.1007/s11263-025-02680-4","url":null,"abstract":"Homography refers to a specific type of transformation that relates two images of the same planar surface taken from different perspectives. Recovering motion parameters from a homography matrix is a classic problem in computer vision. It is important to derive a fast and stable solution to homography decomposition, since it forms a critical component of many vision systems, <jats:italic>e</jats:italic> . <jats:italic>g</jats:italic> ., in Structure-from-Motion and visual localization. The current state-of-the-art solvers can be categorized into two types of methods, the numerical procedures based on singular value decomposition (SVD), and the closed-form solution. The SVD-based methods are stable but time-consuming, while the existing closed-form solution is faster but less stable. In this paper, we discuss the homography decomposition problem from a different viewpoint. In contrast to the existing methods which focus on the properties of the homography matrix, we propose a new method that uses three random point correspondences to obtain the motion parameters in closed form. The proposed method is conceptually simple, easy to understand and implement, and has a good geometrical interpretation. This solution can be seen as an alternative to the existing closed-form solution. We also discuss the configurations where the closed-form solutions might be unstable and present a framework for homography decomposition taking into account both the efficiency and stability.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"12 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146138691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-02DOI: 10.1007/s11263-025-02637-7
Juncheng Wang, Lei Shang, Ziqi Liu, Wang Lu, Xixu Hu, Zhe Hu, Jindong Wang, Shujun Wang
{"title":"Exploring Scale Shift in Crowd Localization under the Context of Domain Generalization","authors":"Juncheng Wang, Lei Shang, Ziqi Liu, Wang Lu, Xixu Hu, Zhe Hu, Jindong Wang, Shujun Wang","doi":"10.1007/s11263-025-02637-7","DOIUrl":"https://doi.org/10.1007/s11263-025-02637-7","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"285 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146101329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-30DOI: 10.1007/s11263-025-02708-9
Aleksandr Algasov, Ekaterina Nepovinnykh, Fedor Zolotarev, Tuomas Eerola, Heikki Kälviäinen, Charles V. Stewart, Lasha Otarashvili, Jason A. Holmberg
Recent advancements in the automatic re-identification of animal individuals from images have opened up new possibilities for studying wildlife through camera traps and citizen science projects. Existing methods leverage distinct and permanent visual body markings, such as fur patterns or scars, and typically employ one of two approaches: local features or end-to-end learning. The end-to-end learning-based methods outperform local feature-based methods given a sufficient amount of good-quality training data, but the challenge of gathering such datasets for wildlife animals means that local feature-based methods remain a more practical approach for many species. In this study, we aim to achieve two goals: (1) to obtain a better understanding of the impact of training-set size on animal re-identification, and (2) to explore ways to combine various methods to leverage the advantages of their approaches for re-identification. In the work, we conduct comprehensive experiments across six different methods and six animal species with various training set sizes. Furthermore, we propose a simple yet effective combination strategy and show that a properly selected method combinations outperform the individual methods with both small and large training sets up to 30%. Additionally, the proposed combination strategy offers a generalizable framework to improve accuracy across species and address the challenges posed by small datasets, which are common in ecological research. This work lays the foundation for more robust and accessible tools to support wildlife conservation, population monitoring, and behavioral studies.
{"title":"On Combining Animal Re-Identification Models to Address Small Datasets","authors":"Aleksandr Algasov, Ekaterina Nepovinnykh, Fedor Zolotarev, Tuomas Eerola, Heikki Kälviäinen, Charles V. Stewart, Lasha Otarashvili, Jason A. Holmberg","doi":"10.1007/s11263-025-02708-9","DOIUrl":"https://doi.org/10.1007/s11263-025-02708-9","url":null,"abstract":"Recent advancements in the automatic re-identification of animal individuals from images have opened up new possibilities for studying wildlife through camera traps and citizen science projects. Existing methods leverage distinct and permanent visual body markings, such as fur patterns or scars, and typically employ one of two approaches: local features or end-to-end learning. The end-to-end learning-based methods outperform local feature-based methods given a sufficient amount of good-quality training data, but the challenge of gathering such datasets for wildlife animals means that local feature-based methods remain a more practical approach for many species. In this study, we aim to achieve two goals: (1) to obtain a better understanding of the impact of training-set size on animal re-identification, and (2) to explore ways to combine various methods to leverage the advantages of their approaches for re-identification. In the work, we conduct comprehensive experiments across six different methods and six animal species with various training set sizes. Furthermore, we propose a simple yet effective combination strategy and show that a properly selected method combinations outperform the individual methods with both small and large training sets up to 30%. Additionally, the proposed combination strategy offers a generalizable framework to improve accuracy across species and address the challenges posed by small datasets, which are common in ecological research. This work lays the foundation for more robust and accessible tools to support wildlife conservation, population monitoring, and behavioral studies.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"74 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146095833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-30DOI: 10.1007/s11263-025-02700-3
Yuheng Shi, Tong Zhang, Xiaojie Guo
{"title":"Practical Video Object Detection via Feature Selection and Aggregation","authors":"Yuheng Shi, Tong Zhang, Xiaojie Guo","doi":"10.1007/s11263-025-02700-3","DOIUrl":"https://doi.org/10.1007/s11263-025-02700-3","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"288 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146095832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bilateral Transformation of Biased Pseudo-Labels under Distribution Inconsistency","authors":"Ruibing Hou, Hong Chang, MinYang Hu, BingPeng Ma, Shiguang Shan, Xilin Chen","doi":"10.1007/s11263-025-02701-2","DOIUrl":"https://doi.org/10.1007/s11263-025-02701-2","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"8 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146095835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-29DOI: 10.1007/s11263-025-02688-w
Yoel Park, Jaewook Lee, Seulki Lee
In this paper, we introduce a memory-efficient CNN (convolutional neural network), which enables resource-constrained low-end embedded and IoT devices to perform on-device vision and audio tasks, such as image classification, object detection, and audio classification, using extremely low memory, i.e ., only 63 KB on ImageNet classification. Based on the bottleneck block of MobileNet, we propose three design principles that significantly curtail the peak memory usage of a CNN so that it can fit the limited KB memory of the low-end device. First, ‘input segmentation’ divides an input image into a set of patches, including the central patch overlapped with the others, reducing the size (and memory requirement) of a large input image. Second, ‘patch tunneling’ builds independent tunnel-like paths consisting of multiple bottleneck blocks per patch, penetrating through the entire model from an input patch to the last layer of the network, maintaining lightweight memory usage throughout the whole network. Lastly, ‘bottleneck reordering’ rearranges the execution order of convolution operations inside the bottleneck block such that the memory usage remains constant regardless of the size of the convolution output channels. We also present ‘peak memory aware quantization’, enabling desired peak memory reduction in actual deployment of quantized network. The experiment result shows that the proposed network classifies ImageNet with extremely low memory ( i.e ., 63 KB) while achieving competitive top-1 accuracy ( i.e ., 61.58%). To the best of our knowledge, the memory usage of the proposed network is far smaller than state-of-the-art memory-efficient networks, i.e ., up to 89x and 3.1x smaller than MobileNet ( i.e ., 5.6 MB) and MCUNet ( i.e ., 196 KB), respectively.
{"title":"Designing Extremely Memory-Efficient CNNs for On-device Vision and Audio Tasks","authors":"Yoel Park, Jaewook Lee, Seulki Lee","doi":"10.1007/s11263-025-02688-w","DOIUrl":"https://doi.org/10.1007/s11263-025-02688-w","url":null,"abstract":"In this paper, we introduce a memory-efficient CNN (convolutional neural network), which enables resource-constrained low-end embedded and IoT devices to perform on-device vision and audio tasks, such as image classification, object detection, and audio classification, using extremely low memory, <jats:italic>i.e</jats:italic> ., only 63 KB on ImageNet classification. Based on the bottleneck block of MobileNet, we propose three design principles that significantly curtail the peak memory usage of a CNN so that it can fit the limited KB memory of the low-end device. First, ‘input segmentation’ divides an input image into a set of patches, including the central patch overlapped with the others, reducing the size (and memory requirement) of a large input image. Second, ‘patch tunneling’ builds independent tunnel-like paths consisting of multiple bottleneck blocks per patch, penetrating through the entire model from an input patch to the last layer of the network, maintaining lightweight memory usage throughout the whole network. Lastly, ‘bottleneck reordering’ rearranges the execution order of convolution operations inside the bottleneck block such that the memory usage remains constant regardless of the size of the convolution output channels. We also present ‘peak memory aware quantization’, enabling desired peak memory reduction in actual deployment of quantized network. The experiment result shows that the proposed network classifies ImageNet with extremely low memory ( <jats:italic>i.e</jats:italic> ., 63 KB) while achieving competitive top-1 accuracy ( <jats:italic>i.e</jats:italic> ., 61.58%). To the best of our knowledge, the memory usage of the proposed network is far smaller than state-of-the-art memory-efficient networks, <jats:italic>i.e</jats:italic> ., up to 89x and 3.1x smaller than MobileNet ( <jats:italic>i.e</jats:italic> ., 5.6 MB) and MCUNet ( <jats:italic>i.e</jats:italic> ., 196 KB), respectively.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"62 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146095836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}