Pub Date : 2026-02-21DOI: 10.1007/s11263-026-02754-x
Zekun Li, Yinghuan Shi, Yang Gao, Dong Xu
{"title":"Diffusion-Based Data Augmentation for Image Recognition: A Systematic Analysis and Evaluation","authors":"Zekun Li, Yinghuan Shi, Yang Gao, Dong Xu","doi":"10.1007/s11263-026-02754-x","DOIUrl":"https://doi.org/10.1007/s11263-026-02754-x","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"5 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146230860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-21DOI: 10.1007/s11263-026-02738-x
Faegheh Sardari, Armin Mustafa, Philip J. B. Jackson, Adrian Hilton
Unlike the sparse label action detection task, where a single action occurs in each timestamp of a video, in a dense multi-label scenario, actions can overlap temporally. To address this challenging task, it is necessary to simultaneously learn (i) co-occurrence action relationships and (ii) temporal dependencies. Current methods model co-occurrence action relationships by explicitly embedding class relations into the transformer network architecture. However, these approaches are not computationally efficient, as the network needs to compute all possible pair action class relations. In this paper, we overcome this by introducing a novel framework trained through a novel learning paradigm that allows the network to benefit from explicitly modelling temporal co-occurrence action dependencies during training without imposing their computational overhead during inference. Furthermore, to model temporal information, recent approaches extract multi-scale temporal features through hierarchical transformer-based networks. However, the self-attention mechanism in transformers inherently loses temporal positional information. We argue that combining this with multiple sub-sampling processes in hierarchical designs can lead to further loss of positional information. Preserving this information is essential for accurate action detection. In this paper, we address this issue by proposing a novel transformer network that (a) employs a non-hierarchical structure when modelling different ranges of temporal dependencies and (b) embeds relative positional encoding in its transformer layers. We evaluate the performance of our proposed approach on two challenging dense multi-label benchmark datasets and show that our method improves the current state-of-the-art results by 1.1% and 0.6% per-frame mAP on the Charades and MultiTHUMOS datasets, respectively, achieving new state-of-the-art per-frame mAP results at 26.5% and 44.6%, respectively. We also performed extensive ablation studies to examine the impact of the different components of our proposed approach. Our code will be released upon paper publication
{"title":"An Effective-Efficient Approach for Dense Multi-Label Action Detection","authors":"Faegheh Sardari, Armin Mustafa, Philip J. B. Jackson, Adrian Hilton","doi":"10.1007/s11263-026-02738-x","DOIUrl":"https://doi.org/10.1007/s11263-026-02738-x","url":null,"abstract":"Unlike the sparse label action detection task, where a single action occurs in each timestamp of a video, in a dense multi-label scenario, actions can overlap temporally. To address this challenging task, it is necessary to simultaneously learn (i) co-occurrence action relationships and (ii) temporal dependencies. Current methods model co-occurrence action relationships by explicitly embedding class relations into the transformer network architecture. However, these approaches are not computationally efficient, as the network needs to compute all possible pair action class relations. In this paper, we overcome this by introducing a novel framework trained through a novel learning paradigm that allows the network to benefit from explicitly modelling temporal co-occurrence action dependencies during training without imposing their computational overhead during inference. Furthermore, to model temporal information, recent approaches extract multi-scale temporal features through hierarchical transformer-based networks. However, the self-attention mechanism in transformers inherently loses temporal positional information. We argue that combining this with multiple sub-sampling processes in hierarchical designs can lead to further loss of positional information. Preserving this information is essential for accurate action detection. In this paper, we address this issue by proposing a novel transformer network that (a) employs a non-hierarchical structure when modelling different ranges of temporal dependencies and (b) embeds relative positional encoding in its transformer layers. We evaluate the performance of our proposed approach on two challenging dense multi-label benchmark datasets and show that our method improves the current state-of-the-art results by 1.1% and 0.6% per-frame mAP on the Charades and MultiTHUMOS datasets, respectively, achieving new state-of-the-art per-frame mAP results at 26.5% and 44.6%, respectively. We also performed extensive ablation studies to examine the impact of the different components of our proposed approach. <jats:ext-link xmlns:xlink=\"http://www.w3.org/1999/xlink\" xlink:href=\"https://github.com/faeghehsardari/E-E-IJCV\" ext-link-type=\"uri\"> <jats:underline>Our code will be released upon paper publication</jats:underline> </jats:ext-link>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"21 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146230854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-21DOI: 10.1007/s11263-025-02702-1
Qiong Wu, Yiyi Zhou, Weihao Ye, Xiaoshuai Sun, Rongrong Ji
{"title":"Not All Attention is Needed: Parameter and Computation Efficient Tuning for Multi-modal Large Language Models via Effective Attention Skipping","authors":"Qiong Wu, Yiyi Zhou, Weihao Ye, Xiaoshuai Sun, Rongrong Ji","doi":"10.1007/s11263-025-02702-1","DOIUrl":"https://doi.org/10.1007/s11263-025-02702-1","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"43 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146230858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-21DOI: 10.1007/s11263-025-02660-8
David Lehavi, Brian Osserman
{"title":"A Polynomial Formula for the Perspective Four Points Problem","authors":"David Lehavi, Brian Osserman","doi":"10.1007/s11263-025-02660-8","DOIUrl":"https://doi.org/10.1007/s11263-025-02660-8","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"42 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146230853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}