Pub Date : 2024-09-20DOI: 10.1109/tpami.2024.3463799
Lucas Ventura,Antoine Yang,Cordelia Schmid,Gul Varol
Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. We further validate that our methodology is equally applicable to image-caption pairs, by generating 3.3 million CoIR training triplets using the Conceptual Captions dataset. Our model builds on BLIP-2 pretraining, adapting it to composed video (or image) retrieval, and incorporates an additional caption retrieval loss to exploit extra supervision beyond the triplet, which is possible since captions are readily available for our training data by design. We provide extensive ablations to analyze the design choices on our new CoVR benchmark. Our experiments also demonstrate that training a CoVR model on our datasets effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on the CIRR, FashionIQ, and CIRCO benchmarks. Our code, datasets, and models are publicly available at https://imagine.enpc.fr/ ventural/covr.
{"title":"CoVR-2: Automatic Data Construction for Composed Video Retrieval.","authors":"Lucas Ventura,Antoine Yang,Cordelia Schmid,Gul Varol","doi":"10.1109/tpami.2024.3463799","DOIUrl":"https://doi.org/10.1109/tpami.2024.3463799","url":null,"abstract":"Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. We further validate that our methodology is equally applicable to image-caption pairs, by generating 3.3 million CoIR training triplets using the Conceptual Captions dataset. Our model builds on BLIP-2 pretraining, adapting it to composed video (or image) retrieval, and incorporates an additional caption retrieval loss to exploit extra supervision beyond the triplet, which is possible since captions are readily available for our training data by design. We provide extensive ablations to analyze the design choices on our new CoVR benchmark. Our experiments also demonstrate that training a CoVR model on our datasets effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on the CIRR, FashionIQ, and CIRCO benchmarks. Our code, datasets, and models are publicly available at https://imagine.enpc.fr/ ventural/covr.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":null,"pages":null},"PeriodicalIF":23.6,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142275195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-modality imaging is widely used in clinical practice and biomedical research to gain a comprehensive understanding of an imaging subject. Currently, multi-modality imaging is accomplished by post hoc fusion of independently reconstructed images under the guidance of mutual information or spatially registered hardware, which limits the accuracy and utility of multi-modality imaging. Here, we investigate a data-driven multi-modality imaging (DMI) strategy for synergetic imaging of CT and MRI. We reveal two distinct types of features in multi-modality imaging, namely intra- and inter-modality features, and present a multi-sensor learning (MSL) framework to utilize the crossover inter-modality features for augmented multi-modality imaging. The MSL imaging approach breaks down the boundaries of traditional imaging modalities and allows for optimal hybridization of CT and MRI, which maximizes the use of sensory data. We showcase the effectiveness of our DMI strategy through synergetic CT-MRI brain imaging. The principle of DMI is quite general and holds enormous potential for various DMI applications across disciplines.
{"title":"Multi-sensor Learning Enables Information Transfer across Different Sensory Data and Augments Multi-modality Imaging.","authors":"Lingting Zhu,Yizheng Chen,Lianli Liu,Lei Xing,Lequan Yu","doi":"10.1109/tpami.2024.3465649","DOIUrl":"https://doi.org/10.1109/tpami.2024.3465649","url":null,"abstract":"Multi-modality imaging is widely used in clinical practice and biomedical research to gain a comprehensive understanding of an imaging subject. Currently, multi-modality imaging is accomplished by post hoc fusion of independently reconstructed images under the guidance of mutual information or spatially registered hardware, which limits the accuracy and utility of multi-modality imaging. Here, we investigate a data-driven multi-modality imaging (DMI) strategy for synergetic imaging of CT and MRI. We reveal two distinct types of features in multi-modality imaging, namely intra- and inter-modality features, and present a multi-sensor learning (MSL) framework to utilize the crossover inter-modality features for augmented multi-modality imaging. The MSL imaging approach breaks down the boundaries of traditional imaging modalities and allows for optimal hybridization of CT and MRI, which maximizes the use of sensory data. We showcase the effectiveness of our DMI strategy through synergetic CT-MRI brain imaging. The principle of DMI is quite general and holds enormous potential for various DMI applications across disciplines.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":null,"pages":null},"PeriodicalIF":23.6,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142275199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-20DOI: 10.1109/tpami.2024.3460871
Enneng Yang, Zhenyi Wang, Li Shen, Nan Yin, Tongliang Liu, Guibing Guo, Xingwei Wang, Dacheng Tao
{"title":"Continual Learning From a Stream of APIs","authors":"Enneng Yang, Zhenyi Wang, Li Shen, Nan Yin, Tongliang Liu, Guibing Guo, Xingwei Wang, Dacheng Tao","doi":"10.1109/tpami.2024.3460871","DOIUrl":"https://doi.org/10.1109/tpami.2024.3460871","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":null,"pages":null},"PeriodicalIF":23.6,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142275362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Snapshot Mosaic Hyperspectral Cameras (SMHCs) are popular hyperspectral imaging devices for acquiring both color and motion details of scenes. However, the narrow-band spectral filters in SMHCs may negatively impact their motion perception ability, resulting in blurry SMHC frames. In this paper, we propose a hardware-software collaborative approach to address the blurring issue of SMHCs. Our approach involves integrating SMHCs with neuromorphic event cameras for efficient event-enhanced SMHC frame deblurring. To achieve spectral information recovery guided by event signals, we formulate a spectral-aware Event-based Double Integral (sEDI) model that links SMHC frames and events from a spectral perspective, providing principled model design insights. Then, we develop a Diffusion-guided Noise Awareness (DNA) training framework that utilizes diffusion models to learn noise-aware features and promote model robustness towards camera noise. Furthermore, we design an Event-enhanced Hyperspectral frame Deblurring Network (EvHDNet) based on sEDI, which is trained with DNA and features improved spatial-spectral learning and modality interaction for reliable SMHC frame deblurring. Experiments on both synthetic data and real data show that the proposed DNA + EvHDNet outperforms stateof-the-art methods on both spatial and spectral fidelity. The code and dataset will be made publicly available.
{"title":"Event-enhanced Snapshot Mosaic Hyperspectral Frame Deblurring.","authors":"Mengyue Geng,Lizhi Wang,Lin Zhu,Wei Zhang,Ruiqin Xiong,Yonghong Tian","doi":"10.1109/tpami.2024.3465455","DOIUrl":"https://doi.org/10.1109/tpami.2024.3465455","url":null,"abstract":"Snapshot Mosaic Hyperspectral Cameras (SMHCs) are popular hyperspectral imaging devices for acquiring both color and motion details of scenes. However, the narrow-band spectral filters in SMHCs may negatively impact their motion perception ability, resulting in blurry SMHC frames. In this paper, we propose a hardware-software collaborative approach to address the blurring issue of SMHCs. Our approach involves integrating SMHCs with neuromorphic event cameras for efficient event-enhanced SMHC frame deblurring. To achieve spectral information recovery guided by event signals, we formulate a spectral-aware Event-based Double Integral (sEDI) model that links SMHC frames and events from a spectral perspective, providing principled model design insights. Then, we develop a Diffusion-guided Noise Awareness (DNA) training framework that utilizes diffusion models to learn noise-aware features and promote model robustness towards camera noise. Furthermore, we design an Event-enhanced Hyperspectral frame Deblurring Network (EvHDNet) based on sEDI, which is trained with DNA and features improved spatial-spectral learning and modality interaction for reliable SMHC frame deblurring. Experiments on both synthetic data and real data show that the proposed DNA + EvHDNet outperforms stateof-the-art methods on both spatial and spectral fidelity. The code and dataset will be made publicly available.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":null,"pages":null},"PeriodicalIF":23.6,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142275196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-20DOI: 10.1109/tpami.2024.3465535
Mushir Akhtar, M. Tanveer, Mohd. Arshad
{"title":"RoBoSS: A Robust, Bounded, Sparse, and Smooth Loss Function for Supervised Learning","authors":"Mushir Akhtar, M. Tanveer, Mohd. Arshad","doi":"10.1109/tpami.2024.3465535","DOIUrl":"https://doi.org/10.1109/tpami.2024.3465535","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":null,"pages":null},"PeriodicalIF":23.6,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142275575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-19DOI: 10.1109/tpami.2024.3463966
Wei Feng, Feifan Wang, Ruize Han, Yiyang Gan, Zekun Qian, Junhui Hou, Song Wang
{"title":"Unveiling the Power of Self-Supervision for Multi-View Multi-Human Association and Tracking","authors":"Wei Feng, Feifan Wang, Ruize Han, Yiyang Gan, Zekun Qian, Junhui Hou, Song Wang","doi":"10.1109/tpami.2024.3463966","DOIUrl":"https://doi.org/10.1109/tpami.2024.3463966","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":null,"pages":null},"PeriodicalIF":23.6,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142275448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-18DOI: 10.1109/tpami.2024.3463753
Weizhi Nie, Ruidong Chen, Weijie Wang, Bruno Lepri, Nicu Sebe
{"title":"T2TD: Text-3D Generation Model Based on Prior Knowledge Guidance","authors":"Weizhi Nie, Ruidong Chen, Weijie Wang, Bruno Lepri, Nicu Sebe","doi":"10.1109/tpami.2024.3463753","DOIUrl":"https://doi.org/10.1109/tpami.2024.3463753","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":null,"pages":null},"PeriodicalIF":23.6,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142245659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}