Pub Date : 2026-01-16DOI: 10.1007/s11263-025-02656-4
Xiaoling Zhou, Zhemg Lee, Wei Ye, Shikun Zhang
{"title":"Dynamic Knowledge Transfer for Mitigating Spurious Correlations in Deep Learning","authors":"Xiaoling Zhou, Zhemg Lee, Wei Ye, Shikun Zhang","doi":"10.1007/s11263-025-02656-4","DOIUrl":"https://doi.org/10.1007/s11263-025-02656-4","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"48 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146005748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-16DOI: 10.1007/s11263-025-02671-5
Changsheng Lu, Hao Zhu, Piotr Koniusz
Deep learning based keypoint detectors can localize specific object (or body) parts well, but still fall short of general keypoint detection. Instead, few-shot keypoint detection (FSKD) is an underexplored yet more general task of localizing either base or novel keypoints, depending on the prompted support samples. In FSKD, how to build robust keypoint representations is the key to success. To this end, we propose an FSKD approach that models relations between keypoints. As keypoints are located on objects, we exploit a class-agnostic visual prior, i.e ., the unsupervised saliency map or DINO attentiveness map to obtain the region of focus within which we perform relation learning between object patches. The class-agnostic visual prior also helps suppress the background noise largely irrelevant to keypoint locations. Then, we propose a novel Visual Prior guided Vision Transformer (VPViT). The visual prior maps are refined by a bespoke morphology learner to include relevant context of objects. The masked self-attention of VPViT takes the adapted prior map as a soft mask to constrain the self-attention to foregrounds. As robust FSKD must also deal with the low number of support samples and occlusions, based on VPViT, we further investigate i) transductive FSKD to enhance keypoint representations with unlabeled data and ii) FSKD with masking and alignment (MAA) to improve robustness. We show that our model performs well in seven public datasets, and also significantly improves the accuracy in transductive inference and under occlusions. Source codes are available at https://github.com/AlanLuSun/VPViT .
{"title":"Exploiting Class-agnostic Visual Prior for Few-shot Keypoint Detection","authors":"Changsheng Lu, Hao Zhu, Piotr Koniusz","doi":"10.1007/s11263-025-02671-5","DOIUrl":"https://doi.org/10.1007/s11263-025-02671-5","url":null,"abstract":"Deep learning based keypoint detectors can localize specific object (or body) parts well, but still fall short of general keypoint detection. Instead, few-shot keypoint detection (FSKD) is an underexplored yet more general task of localizing either base or novel keypoints, depending on the prompted support samples. In FSKD, how to build robust keypoint representations is the key to success. To this end, we propose an FSKD approach that models relations between keypoints. As keypoints are located on objects, we exploit a class-agnostic visual prior, <jats:italic>i.e</jats:italic> ., the unsupervised saliency map or DINO attentiveness map to obtain the region of focus within which we perform relation learning between object patches. The class-agnostic visual prior also helps suppress the background noise largely irrelevant to keypoint locations. Then, we propose a novel Visual Prior guided Vision Transformer (VPViT). The visual prior maps are refined by a bespoke morphology learner to include relevant context of objects. The masked self-attention of VPViT takes the adapted prior map as a soft mask to constrain the self-attention to foregrounds. As robust FSKD must also deal with the low number of support samples and occlusions, based on VPViT, we further investigate i) transductive FSKD to enhance keypoint representations with unlabeled data and ii) FSKD with masking and alignment (MAA) to improve robustness. We show that our model performs well in seven public datasets, and also significantly improves the accuracy in transductive inference and under occlusions. Source codes are available at <jats:ext-link xmlns:xlink=\"http://www.w3.org/1999/xlink\" xlink:href=\"https://github.com/AlanLuSun/VPViT\" ext-link-type=\"uri\">https://github.com/AlanLuSun/VPViT</jats:ext-link> .","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"101 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146005688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-13DOI: 10.1007/s11263-025-02664-4
Toby Perrett, Tengda Han, Dima Damen, Andrew Zisserman
Long videos contain many repeating actions, events and shots. These repetitions are frequently given identical captions, which makes it difficult to retrieve the exact desired clip using a text search. In this paper, we formulate the problem of unique captioning: Given multiple clips with the same caption, we generate a new caption for each clip that uniquely identifies it. We propose Captioning by Discriminative Prompting (CDP), which predicts a property that can separate identically captioned clips, and use it to generate unique captions. We introduce two benchmarks for unique captioning, based on egocentric footage and timeloop movies – where repeating actions are common. We demonstrate that captions generated by CDP improve text-to-video R@1 by 15% for egocentric videos and 10% in timeloop movies. https://tobyperrett.github.io/its-just-another-day
{"title":"It’s Just Another Day: Unique Video Captioning by Discriminitive Prompting","authors":"Toby Perrett, Tengda Han, Dima Damen, Andrew Zisserman","doi":"10.1007/s11263-025-02664-4","DOIUrl":"https://doi.org/10.1007/s11263-025-02664-4","url":null,"abstract":"Long videos contain many repeating actions, events and shots. These repetitions are frequently given identical captions, which makes it difficult to retrieve the exact desired clip using a text search. In this paper, we formulate the problem of unique captioning: Given multiple clips with the same caption, we generate a new caption for each clip that uniquely identifies it. We propose Captioning by Discriminative Prompting (CDP), which predicts a property that can separate identically captioned clips, and use it to generate unique captions. We introduce two benchmarks for unique captioning, based on egocentric footage and timeloop movies – where repeating actions are common. We demonstrate that captions generated by CDP improve text-to-video R@1 by 15% for egocentric videos and 10% in timeloop movies. <jats:ext-link xmlns:xlink=\"http://www.w3.org/1999/xlink\" xlink:href=\"https://tobyperrett.github.io/its-just-another-day\" ext-link-type=\"uri\">https://tobyperrett.github.io/its-just-another-day</jats:ext-link>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"33 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145955211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-13DOI: 10.1007/s11263-025-02635-9
Evgeniy Martyushev, Snehal Bhayani, Tomas Pajdla
{"title":"Automatic Solver Generator for Systems of Laurent Polynomial Equations","authors":"Evgeniy Martyushev, Snehal Bhayani, Tomas Pajdla","doi":"10.1007/s11263-025-02635-9","DOIUrl":"https://doi.org/10.1007/s11263-025-02635-9","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"15 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145955210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-13DOI: 10.1007/s11263-025-02611-3
Yuqin Dai, Wanlu Zhu, Ronghui Li, Xiu Li, Zhenyu Zhang, Jun Li, Jian Yang
{"title":"TCDiff++: An End-to-end Trajectory-Controllable Diffusion Model for Harmonious Music-Driven Group Choreography","authors":"Yuqin Dai, Wanlu Zhu, Ronghui Li, Xiu Li, Zhenyu Zhang, Jun Li, Jian Yang","doi":"10.1007/s11263-025-02611-3","DOIUrl":"https://doi.org/10.1007/s11263-025-02611-3","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"9 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145955870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}