Pub Date : 2025-11-18DOI: 10.1109/tip.2025.3627395
Jinhong Deng,Yinjie Lei,Wen Li,Lixin Duan
Open-vocabulary object detection (OVD) aims to detect novel object concepts by mining region-word correspondences from image-text pairs, yet current methods often produce false correspondences. While some strategies (e.g., one-to-one matching) were proposed to mitigate this issue, they often sacrifice numerous valuable region-word pairs during the matching process. To overcome these challenges, we propose a novel comprehensive alignment method, named Region-word Alignment with Partial Optimal Transport (ROOT) framework, which reframes the region-word matching task as a problem of partial distribution alignment. Unlike traditional optimal transport, which shifts the full mass of the distribution, partial optimal transport enables selective matching, making it more robust to noise in region and word alignment. Specifically, ROOT first employs partial optimal transport to obtain an optimal transport plan for region and word feature alignment. This transport plan is then used to compute a matching reliability score for each region-word pair, which reweights the contrastive alignment loss to enhance accuracy. By enabling more flexible and reliable region-text matches, ROOT significantly reduces misalignment errors while preserving valuable region-word correspondences. Extensive experiments on standard benchmarks OV-COCO and OV-LVIS show that our ROOT outperforms the previous state-of-the-art works, demonstrating the effectiveness of our approach.
{"title":"ROOT: Region-word Alignment with Partial Optimal Transport for Open-vocabulary Object Detection.","authors":"Jinhong Deng,Yinjie Lei,Wen Li,Lixin Duan","doi":"10.1109/tip.2025.3627395","DOIUrl":"https://doi.org/10.1109/tip.2025.3627395","url":null,"abstract":"Open-vocabulary object detection (OVD) aims to detect novel object concepts by mining region-word correspondences from image-text pairs, yet current methods often produce false correspondences. While some strategies (e.g., one-to-one matching) were proposed to mitigate this issue, they often sacrifice numerous valuable region-word pairs during the matching process. To overcome these challenges, we propose a novel comprehensive alignment method, named Region-word Alignment with Partial Optimal Transport (ROOT) framework, which reframes the region-word matching task as a problem of partial distribution alignment. Unlike traditional optimal transport, which shifts the full mass of the distribution, partial optimal transport enables selective matching, making it more robust to noise in region and word alignment. Specifically, ROOT first employs partial optimal transport to obtain an optimal transport plan for region and word feature alignment. This transport plan is then used to compute a matching reliability score for each region-word pair, which reweights the contrastive alignment loss to enhance accuracy. By enabling more flexible and reliable region-text matches, ROOT significantly reduces misalignment errors while preserving valuable region-word correspondences. Extensive experiments on standard benchmarks OV-COCO and OV-LVIS show that our ROOT outperforms the previous state-of-the-art works, demonstrating the effectiveness of our approach.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"130 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145545045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-18DOI: 10.1109/tip.2025.3632223
Xi Yang,Xinyue Zhong,Dechen Kong,Nannan Wang
Prompt learning has made significant progress in vision-language models (VLMs), enabling pre-trained models like CLIP to perform cross-domain tasks with few-shot or even zero-shot learning. However, existing methods tend to overfit the training data after fine-tuning on the target domain, leading to a decline in generalization ability and limiting their performance on unseen categories.To address these challenges, we propose a multi-regularization guided knowledge distillation towards generalizable prompt learning. This approach enhances the model's adaptability and generalization through different stages of regularization while mitigating performance degradation caused by target domain training. Specifically, within the image encoder of CLIP, we introduce Residual Regularization, which binds additional residual connections to certain transformer blocks. This design provides greater flexibility, allowing the model to adjust to new data distributions when adapting to the target domain.Furthermore, during training, we impose Self-distillation Regularization to ensure that while adapting to the target domain, the model preserves its prior generalization knowledge. Specifically, we regularize the intermediate layer outputs of Transformer Blocks to prevent the model from excessively favoring target domain data. Additionally, we employ an unsupervised knowledge distillation strategy to enforce multi-level alignment between the teacher and student models by Direction Distillation Regularization. This ensures that both models maintain consistent visual feature orientations under the same textual features, thereby enhancing overall model stability and cross-domain adaptability.Experimental results demonstrate that our method achieves more stable classification performance in both cross-domain few-shot classification and domain adaptation settings.
{"title":"Towards Generalizable Prompt Learning via Multi-regularization Guided Knowledge Distillation.","authors":"Xi Yang,Xinyue Zhong,Dechen Kong,Nannan Wang","doi":"10.1109/tip.2025.3632223","DOIUrl":"https://doi.org/10.1109/tip.2025.3632223","url":null,"abstract":"Prompt learning has made significant progress in vision-language models (VLMs), enabling pre-trained models like CLIP to perform cross-domain tasks with few-shot or even zero-shot learning. However, existing methods tend to overfit the training data after fine-tuning on the target domain, leading to a decline in generalization ability and limiting their performance on unseen categories.To address these challenges, we propose a multi-regularization guided knowledge distillation towards generalizable prompt learning. This approach enhances the model's adaptability and generalization through different stages of regularization while mitigating performance degradation caused by target domain training. Specifically, within the image encoder of CLIP, we introduce Residual Regularization, which binds additional residual connections to certain transformer blocks. This design provides greater flexibility, allowing the model to adjust to new data distributions when adapting to the target domain.Furthermore, during training, we impose Self-distillation Regularization to ensure that while adapting to the target domain, the model preserves its prior generalization knowledge. Specifically, we regularize the intermediate layer outputs of Transformer Blocks to prevent the model from excessively favoring target domain data. Additionally, we employ an unsupervised knowledge distillation strategy to enforce multi-level alignment between the teacher and student models by Direction Distillation Regularization. This ensures that both models maintain consistent visual feature orientations under the same textual features, thereby enhancing overall model stability and cross-domain adaptability.Experimental results demonstrate that our method achieves more stable classification performance in both cross-domain few-shot classification and domain adaptation settings.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"1 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145545044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-07DOI: 10.1109/tip.2025.3627418
Zhigang Tu, Zhengbo Zhang, Jia Gong, Junsong Yuan, Bo Du
{"title":"Informative Sample Selection Model for Skeleton-based Action Recognition with Limited Training Samples","authors":"Zhigang Tu, Zhengbo Zhang, Jia Gong, Junsong Yuan, Bo Du","doi":"10.1109/tip.2025.3627418","DOIUrl":"https://doi.org/10.1109/tip.2025.3627418","url":null,"abstract":"","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"10 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145461388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}