Photo retouching aims to adjust the hue, luminance, contrast, and saturation of the image to make it more human and aesthetically desirable. Based on researches on image imaging process and artists' retouching processes, we propose three improvements to existing automatic retouching methods. Firstly, in the past retouching methods, all the imaging conditions in EXIF were ignored. According to this, we design a simple module to introduce these imaging conditions into a network called ECM (EXIF Condition Module). This module can improve the performance of several existing auto-retouching methods with only a small parameter cost. Additionally, artists' operations also were ignored. By investigating artists' operations in retouching, we propose a two-stage network that brightens images first and then enriches them in the chrominance plane to mimic artists. Finally, we find that there is a color imbalance in the existing retouching dataset, thus, hue palette loss is designed to resolve the imbalance and make the image more vibrant. Experimental results show that our method is effective on the benchmark MIT-Adobe FiveK dataset and PPR10 K dataset, and achieves SOTA performance in both quantitative and qualitative evaluation.
{"title":"Image Shooting Parameter-Guided Cascade Image Retouching Network: Think Like an Artist","authors":"Hailong Ma;Sibo Feng;Xi Xiao;Chenyu Dong;Xingyue Cheng","doi":"10.1109/TMM.2024.3521779","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521779","url":null,"abstract":"Photo retouching aims to adjust the hue, luminance, contrast, and saturation of the image to make it more human and aesthetically desirable. Based on researches on image imaging process and artists' retouching processes, we propose three improvements to existing automatic retouching methods. Firstly, in the past retouching methods, all the imaging conditions in EXIF were ignored. According to this, we design a simple module to introduce these imaging conditions into a network called ECM (EXIF Condition Module). This module can improve the performance of several existing auto-retouching methods with only a small parameter cost. Additionally, artists' operations also were ignored. By investigating artists' operations in retouching, we propose a two-stage network that brightens images first and then enriches them in the chrominance plane to mimic artists. Finally, we find that there is a color imbalance in the existing retouching dataset, thus, hue palette loss is designed to resolve the imbalance and make the image more vibrant. Experimental results show that our method is effective on the benchmark MIT-Adobe FiveK dataset and PPR10 K dataset, and achieves SOTA performance in both quantitative and qualitative evaluation.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1566-1573"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-23DOI: 10.1109/TMM.2024.3521778
Lei Zhao;Bo Li;Jixiang Jiang;Xingxing Wei
In object detection, the cost of labeling is very high because it needs not only to confirm the categories of multiple objects in an image but also to determine the bounding boxes of each object accurately. Thus, integrating active learning into object detection will raise pretty positive significance. In this paper, we propose a classification committee for the active deep object detection method by introducing a discrepancy mechanism of multiple classifiers for samples' selection when training object detectors. The model contains a main detector and a classification committee. The main detector denotes the target object detector trained from a labeled pool composed of the selected informative images. The role of the classification committee is to select the most informative images according to their uncertainty values from the view of classification, which is expected to focus more on the discrepancy and representative of instances. Specifically, they compute the uncertainty for a specified instance within the image by measuring its discrepancy output by the committee pre-trained via the proposed Maximum Classifiers Discrepancy Group Loss (MCDGL). The most informative images are finally determined by selecting the ones with many high-uncertainty instances. Besides, to mitigate the impact of interference instances, we design a Focusing on Positive Instances Loss (FPIL) to provide the committee the ability to automatically focus on the representative instances as well as precisely encode their discrepancies for the same instance. Experiments are conducted on Pascal VOC and COCO datasets versus some popular object detectors. And results show that our method outperforms the state-of-the-art active learning methods, which verifies the effectiveness of the proposed method.
{"title":"Classification Committee for Active Deep Object Detection","authors":"Lei Zhao;Bo Li;Jixiang Jiang;Xingxing Wei","doi":"10.1109/TMM.2024.3521778","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521778","url":null,"abstract":"In object detection, the cost of labeling is very high because it needs not only to confirm the categories of multiple objects in an image but also to determine the bounding boxes of each object accurately. Thus, integrating active learning into object detection will raise pretty positive significance. In this paper, we propose a classification committee for the active deep object detection method by introducing a discrepancy mechanism of multiple classifiers for samples' selection when training object detectors. The model contains a main detector and a classification committee. The main detector denotes the target object detector trained from a labeled pool composed of the selected informative images. The role of the classification committee is to select the most informative images according to their uncertainty values from the view of classification, which is expected to focus more on the discrepancy and representative of instances. Specifically, they compute the uncertainty for a specified instance within the image by measuring its discrepancy output by the committee pre-trained via the proposed Maximum Classifiers Discrepancy Group Loss (MCDGL). The most informative images are finally determined by selecting the ones with many high-uncertainty instances. Besides, to mitigate the impact of interference instances, we design a Focusing on Positive Instances Loss (FPIL) to provide the committee the ability to automatically focus on the representative instances as well as precisely encode their discrepancies for the same instance. Experiments are conducted on Pascal VOC and COCO datasets versus some popular object detectors. And results show that our method outperforms the state-of-the-art active learning methods, which verifies the effectiveness of the proposed method.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1277-1288"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143594345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-23DOI: 10.1109/TMM.2024.3521817
Yuhui Quan;Xi Wan;Tianxiang Zheng;Yan Huang;Hui Ji
Multi-focus image fusion (MFIF) aims at merging multiple images captured at different focal lengths to create an all-in-focus image. This paper introduces a fully unsupervised learning approach for MFIF that uses only pairs of defocused images for end-to-end training, bypassing the need for ground-truths in supervised learning. Unlike existing methods training via a similarity loss between fused and source images, we propose a dual-path learning framework comprising two networks: an image fuser and a mask predictor. The mask predictor is modeled as a self-supervised denoising network on imperfect fusion masks, trained with a masking-based unsupervised learning scheme. The image fuser, crafted with deep unrolling, leverages the output from the mask predictor to supervise its mask generation at each unrolled step. Moreover, we introduce a fusion consistency loss to ensure the alignment between the image fuser and the mask predictor. In extensive experiments, our proposed approach shows superiority over existing end-to-end unsupervised methods and competitive performance against the supervised ones.
{"title":"Dual-Path Deep Unsupervised Learning for Multi-Focus Image Fusion","authors":"Yuhui Quan;Xi Wan;Tianxiang Zheng;Yan Huang;Hui Ji","doi":"10.1109/TMM.2024.3521817","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521817","url":null,"abstract":"Multi-focus image fusion (MFIF) aims at merging multiple images captured at different focal lengths to create an all-in-focus image. This paper introduces a fully unsupervised learning approach for MFIF that uses only pairs of defocused images for end-to-end training, bypassing the need for ground-truths in supervised learning. Unlike existing methods training via a similarity loss between fused and source images, we propose a dual-path learning framework comprising two networks: an image fuser and a mask predictor. The mask predictor is modeled as a self-supervised denoising network on imperfect fusion masks, trained with a masking-based unsupervised learning scheme. The image fuser, crafted with deep unrolling, leverages the output from the mask predictor to supervise its mask generation at each unrolled step. Moreover, we introduce a fusion consistency loss to ensure the alignment between the image fuser and the mask predictor. In extensive experiments, our proposed approach shows superiority over existing end-to-end unsupervised methods and competitive performance against the supervised ones.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1165-1176"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143594279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-16DOI: 10.1109/TMM.2024.3453058
Hefeng Wu;Hao Jiang;Keze Wang;Ziyi Tang;Xianghuan He;Liang Lin
While deep neural networks have achieved remarkable performance, they tend to lack transparency in prediction. The pursuit of greater interpretability in neural networks often results in a degradation of their original performance. Some works strive to improve both interpretability and performance, but they primarily depend on meticulously imposed conditions. In this paper, we propose a simple yet effective framework that acquires more explainable activation heatmaps and simultaneously increases the model performance, without the need for any extra supervision. Specifically, our concise framework introduces a new metric, i.e., explanation consistency, to reweight the training samples adaptively in model learning. The explanation consistency metric is utilized to measure the similarity between the model's visual explanations of the original samples and those of semantic-preserved adversarial samples, whose background regions are perturbed by using image adversarial attack techniques. Our framework then promotes the model learning by paying closer attention to those training samples with a high difference in explanations (i.e., low explanation consistency), for which the current model cannot provide robust interpretations. Comprehensive experimental results on various benchmarks demonstrate the superiority of our framework in multiple aspects, including higher recognition accuracy, greater data debiasing capability, stronger network robustness, and more precise localization ability on both regular networks and interpretable networks. We also provide extensive ablation studies and qualitative analyses to unveil the detailed contribution of each component.
{"title":"Improving Network Interpretability via Explanation Consistency Evaluation","authors":"Hefeng Wu;Hao Jiang;Keze Wang;Ziyi Tang;Xianghuan He;Liang Lin","doi":"10.1109/TMM.2024.3453058","DOIUrl":"https://doi.org/10.1109/TMM.2024.3453058","url":null,"abstract":"While deep neural networks have achieved remarkable performance, they tend to lack transparency in prediction. The pursuit of greater interpretability in neural networks often results in a degradation of their original performance. Some works strive to improve both interpretability and performance, but they primarily depend on meticulously imposed conditions. In this paper, we propose a simple yet effective framework that acquires more explainable activation heatmaps and simultaneously increases the model performance, without the need for any extra supervision. Specifically, our concise framework introduces a new metric, i.e., explanation consistency, to reweight the training samples adaptively in model learning. The explanation consistency metric is utilized to measure the similarity between the model's visual explanations of the original samples and those of semantic-preserved adversarial samples, whose background regions are perturbed by using image adversarial attack techniques. Our framework then promotes the model learning by paying closer attention to those training samples with a high difference in explanations (i.e., low explanation consistency), for which the current model cannot provide robust interpretations. Comprehensive experimental results on various benchmarks demonstrate the superiority of our framework in multiple aspects, including higher recognition accuracy, greater data debiasing capability, stronger network robustness, and more precise localization ability on both regular networks and interpretable networks. We also provide extensive ablation studies and qualitative analyses to unveil the detailed contribution of each component.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11261-11273"},"PeriodicalIF":8.4,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142691775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unsupervised domain adaptation person re-identification (UDA person re-ID) aims at transferring the knowledge on the source domain with expensive manual annotation to the unlabeled target domain. Most of the recent papers leverage pseudo-labels for the target images to accomplish this task. However, the noise in the generated labels hinders the identification system from learning discriminative features. To address this problem, we propose a deep mutual distillation (DMD) to generate reliable pseudo-labels for UDA person re-ID. The proposed DMD applies two parallel branches for feature extraction, and each branch serves as the teacher of the other to generate pseudo-labels for its training. This mutually reinforcing optimization framework enhances the reliability of pseudo-labels, improving the identification performance. In addition, we present a bilateral graph representation (BGR) to describe the pedestrian images. BGR mimics the person re-identification of the human to aggregate the identity features according to the visual similarity and attribute consistency. Experimental results on Market-1501 and Duke demonstrate the effectiveness and generalization of the proposed method.
{"title":"Deep Mutual Distillation for Unsupervised Domain Adaptation Person Re-Identification","authors":"Xingyu Gao;Zhenyu Chen;Jianze Wei;Rubo Wang;Zhijun Zhao","doi":"10.1109/TMM.2024.3459637","DOIUrl":"10.1109/TMM.2024.3459637","url":null,"abstract":"Unsupervised domain adaptation person re-identification (UDA person re-ID) aims at transferring the knowledge on the source domain with expensive manual annotation to the unlabeled target domain. Most of the recent papers leverage pseudo-labels for the target images to accomplish this task. However, the noise in the generated labels hinders the identification system from learning discriminative features. To address this problem, we propose a deep mutual distillation (DMD) to generate reliable pseudo-labels for UDA person re-ID. The proposed DMD applies two parallel branches for feature extraction, and each branch serves as the teacher of the other to generate pseudo-labels for its training. This mutually reinforcing optimization framework enhances the reliability of pseudo-labels, improving the identification performance. In addition, we present a bilateral graph representation (BGR) to describe the pedestrian images. BGR mimics the person re-identification of the human to aggregate the identity features according to the visual similarity and attribute consistency. Experimental results on Market-1501 and Duke demonstrate the effectiveness and generalization of the proposed method.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1059-1071"},"PeriodicalIF":8.4,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Since the standard license plate of large vehicle is easily affected by occlusion and stain, the traffic management department introduces the enlarged license plate at the rear of the large vehicle to assist license plate recognition. However, current researches regards standard license plate recognition and enlarged license plate recognition as independent tasks, and do not take advantage of the complementary benefits from the two types of license plates. In this work, we propose a new computer vision task called collaborative license plate recognition, aiming to leverage the complementary advantages of standard and enlarged license plates for achieving more accurate license plate recognition. To achieve this goal, we propose an Association Enhancement Network (AENet), which achieves robust collaborative licence plate recognition by capturing the correlations between characters within a single licence plate and enhancing the associations between two license plates. In particular, we design an association enhancement branch, which supervises the fusion of two licence plate information using the complete licence plate number to mine the association between them. To enhance the representation ability of each type of licence plates, we design an auxiliary learning branch in the training stage, which supervises the learning of individual license plates in the association enhancement between two license plates. In addition, we contribute a comprehensive benchmark dataset called CLPR, which consists of a total of 19,782 standard and enlarged licence plates from 24 provinces in China and covers most of the challenges in real scenarios, for collaborative license plate recognition. Extensive experiments on the proposed CLPR dataset demonstrate the effectiveness of the proposed AENet against several state-of-the-art methods.
{"title":"Collaborative License Plate Recognition via Association Enhancement Network With Auxiliary Learning and a Unified Benchmark","authors":"Yifei Deng;Guohao Wang;Chenglong Li;Wei Wang;Cheng Zhang;Jin Tang","doi":"10.1109/TMM.2024.3452982","DOIUrl":"10.1109/TMM.2024.3452982","url":null,"abstract":"Since the standard license plate of large vehicle is easily affected by occlusion and stain, the traffic management department introduces the enlarged license plate at the rear of the large vehicle to assist license plate recognition. However, current researches regards standard license plate recognition and enlarged license plate recognition as independent tasks, and do not take advantage of the complementary benefits from the two types of license plates. In this work, we propose a new computer vision task called collaborative license plate recognition, aiming to leverage the complementary advantages of standard and enlarged license plates for achieving more accurate license plate recognition. To achieve this goal, we propose an Association Enhancement Network (AENet), which achieves robust collaborative licence plate recognition by capturing the correlations between characters within a single licence plate and enhancing the associations between two license plates. In particular, we design an association enhancement branch, which supervises the fusion of two licence plate information using the complete licence plate number to mine the association between them. To enhance the representation ability of each type of licence plates, we design an auxiliary learning branch in the training stage, which supervises the learning of individual license plates in the association enhancement between two license plates. In addition, we contribute a comprehensive benchmark dataset called CLPR, which consists of a total of 19,782 standard and enlarged licence plates from 24 provinces in China and covers most of the challenges in real scenarios, for collaborative license plate recognition. Extensive experiments on the proposed CLPR dataset demonstrate the effectiveness of the proposed AENet against several state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11402-11414"},"PeriodicalIF":8.4,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Domain adaptive object detection (DAOD) aims to develop a detector trained on labeled source domains to identify objects in unlabeled target domains. A primary challenge in DAOD is the domain shift problem. Most existing methods learn domain-invariant features within single domain embedding space, often resulting in heavy model biases due to the intrinsic data properties of source domains. To mitigate the model biases, this paper proposes VLDadaptor, a domain adaptive object detector based on vision-language models (VLMs) distillation. Firstly, the proposed method integrates domain-mixed contrastive knowledge distillation between the visual encoder of CLIP and the detector by transferring category-level instance features, which guarantees the detector can extract domain-invariant visual instance features across domains. Then, VLDadaptor employs domain-mixed consistency distillation between the text encoder of CLIP and detector by aligning text prompt embeddings with visual instance features, which helps to maintain the category-level feature consistency among the detector, text encoder and the visual encoder of VLMs. Finally, the proposed method further promotes the adaptation ability by adopting a prompt-based memory bank to generate semantic-complete features for graph matching. These contributions enable VLDadaptor to extract visual features into the visual-language embedding space without any evident model bias towards specific domains. Extensive experimental results demonstrate that the proposed method achieves state-of-the-art performance on Pascal VOC to Clipart adaptation tasks and exhibits high accuracy on driving scenario tasks with significantly less training time.
{"title":"VLDadaptor: Domain Adaptive Object Detection With Vision-Language Model Distillation","authors":"Junjie Ke;Lihuo He;Bo Han;Jie Li;Di Wang;Xinbo Gao","doi":"10.1109/TMM.2024.3453061","DOIUrl":"10.1109/TMM.2024.3453061","url":null,"abstract":"Domain adaptive object detection (DAOD) aims to develop a detector trained on labeled source domains to identify objects in unlabeled target domains. A primary challenge in DAOD is the domain shift problem. Most existing methods learn domain-invariant features within single domain embedding space, often resulting in heavy model biases due to the intrinsic data properties of source domains. To mitigate the model biases, this paper proposes VLDadaptor, a domain adaptive object detector based on vision-language models (VLMs) distillation. Firstly, the proposed method integrates domain-mixed contrastive knowledge distillation between the visual encoder of CLIP and the detector by transferring category-level instance features, which guarantees the detector can extract domain-invariant visual instance features across domains. Then, VLDadaptor employs domain-mixed consistency distillation between the text encoder of CLIP and detector by aligning text prompt embeddings with visual instance features, which helps to maintain the category-level feature consistency among the detector, text encoder and the visual encoder of VLMs. Finally, the proposed method further promotes the adaptation ability by adopting a prompt-based memory bank to generate semantic-complete features for graph matching. These contributions enable VLDadaptor to extract visual features into the visual-language embedding space without any evident model bias towards specific domains. Extensive experimental results demonstrate that the proposed method achieves state-of-the-art performance on Pascal VOC to Clipart adaptation tasks and exhibits high accuracy on driving scenario tasks with significantly less training time.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11316-11331"},"PeriodicalIF":8.4,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-05DOI: 10.1109/TMM.2024.3453045
Hantao Yao;Jifei Luo;Lu Yu;Changsheng Xu
Object Re-identification (ReID) is a task focused on retrieving a probe object from a multitude of gallery images using a ReID model trained on a stationary, camera-free dataset. This training involves associating and aggregating identities across various camera views. However, when deploying ReID algorithms in real-world scenarios, several challenges, such as storage constraints, privacy considerations, and dynamic changes in camera setups, can hinder their generalizability and practicality. To address these challenges, we introduce a novel ReID task called Camera-Incremental Object Re-identification (CIOR). In CIOR, we treat each camera's data as a separate source and continually optimize the ReID model as new data streams come from various cameras. By associating and consolidating the knowledge of common identities, our aim is to enhance discrimination capabilities and mitigate the problem of catastrophic forgetting. Therefore, we propose a novel Identity Knowledge Evolution (IKE) framework for CIOR, consisting of Identity Knowledge Association (IKA), Identity Knowledge Distillation (IKD), and Identity Knowledge Update (IKU). IKA is proposed to discover common identities between the current identity and historical identities, facilitating the integration of previously acquired knowledge. IKD involves distilling historical identity knowledge from common identities, enabling rapid adaptation of the historical model to the current camera view. After each camera has been trained, IKU is applied to continually expand identity knowledge by combining historical and current identity memories. Market-CL and Veri-CL evaluations show the effectiveness of Identity Knowledge Evolution (IKE) for CIOR.Code: https://github.com/htyao89/Camera-Incremental-Object-ReID
{"title":"Camera-Incremental Object Re-Identification With Identity Knowledge Evolution","authors":"Hantao Yao;Jifei Luo;Lu Yu;Changsheng Xu","doi":"10.1109/TMM.2024.3453045","DOIUrl":"10.1109/TMM.2024.3453045","url":null,"abstract":"Object Re-identification (ReID) is a task focused on retrieving a probe object from a multitude of gallery images using a ReID model trained on a stationary, camera-free dataset. This training involves associating and aggregating identities across various camera views. However, when deploying ReID algorithms in real-world scenarios, several challenges, such as storage constraints, privacy considerations, and dynamic changes in camera setups, can hinder their generalizability and practicality. To address these challenges, we introduce a novel ReID task called Camera-Incremental Object Re-identification (CIOR). In CIOR, we treat each camera's data as a separate source and continually optimize the ReID model as new data streams come from various cameras. By associating and consolidating the knowledge of common identities, our aim is to enhance discrimination capabilities and mitigate the problem of catastrophic forgetting. Therefore, we propose a novel Identity Knowledge Evolution (IKE) framework for CIOR, consisting of Identity Knowledge Association (IKA), Identity Knowledge Distillation (IKD), and Identity Knowledge Update (IKU). IKA is proposed to discover common identities between the current identity and historical identities, facilitating the integration of previously acquired knowledge. IKD involves distilling historical identity knowledge from common identities, enabling rapid adaptation of the historical model to the current camera view. After each camera has been trained, IKU is applied to continually expand identity knowledge by combining historical and current identity memories. Market-CL and Veri-CL evaluations show the effectiveness of Identity Knowledge Evolution (IKE) for CIOR.Code: \u0000<uri>https://github.com/htyao89/Camera-Incremental-Object-ReID</uri>","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11246-11260"},"PeriodicalIF":8.4,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-02DOI: 10.1109/TMM.2024.3453055
Hefeng Wu;Guangzhi Ye;Ziyang Zhou;Ling Tian;Qing Wang;Liang Lin
Learning to recognize novel concepts from just a few image samples is very challenging as the learned model is easily overfitted on the few data and results in poor generalizability. One promising but underexplored solution is to compensate for the novel classes by generating plausible samples. However, most existing works of this line exploit visual information only, rendering the generated data easy to be distracted by some challenging factors contained in the few available samples. Being aware of the semantic information in the textual modality that reflects human concepts, this work proposes a novel framework that exploits semantic relations to guide dual-view data hallucination for few-shot image recognition. The proposed framework enables generating more diverse and reasonable data samples for novel classes through effective information transfer from base classes. Specifically, an instance-view data hallucination module hallucinates each sample of a novel class to generate new data by employing local semantic correlated attention and global semantic feature fusion derived from base classes. Meanwhile, a prototype-view data hallucination module exploits semantic-aware measure to estimate the prototype of a novel class and the associated distribution from the few samples, which thereby harvests the prototype as a more stable sample and enables resampling a large number of samples. We conduct extensive experiments and comparisons with state-of-the-art methods on several popular few-shot benchmarks to verify the effectiveness of the proposed framework.
{"title":"Dual-View Data Hallucination With Semantic Relation Guidance for Few-Shot Image Recognition","authors":"Hefeng Wu;Guangzhi Ye;Ziyang Zhou;Ling Tian;Qing Wang;Liang Lin","doi":"10.1109/TMM.2024.3453055","DOIUrl":"10.1109/TMM.2024.3453055","url":null,"abstract":"Learning to recognize novel concepts from just a few image samples is very challenging as the learned model is easily overfitted on the few data and results in poor generalizability. One promising but underexplored solution is to compensate for the novel classes by generating plausible samples. However, most existing works of this line exploit visual information only, rendering the generated data easy to be distracted by some challenging factors contained in the few available samples. Being aware of the semantic information in the textual modality that reflects human concepts, this work proposes a novel framework that exploits semantic relations to guide dual-view data hallucination for few-shot image recognition. The proposed framework enables generating more diverse and reasonable data samples for novel classes through effective information transfer from base classes. Specifically, an instance-view data hallucination module hallucinates each sample of a novel class to generate new data by employing local semantic correlated attention and global semantic feature fusion derived from base classes. Meanwhile, a prototype-view data hallucination module exploits semantic-aware measure to estimate the prototype of a novel class and the associated distribution from the few samples, which thereby harvests the prototype as a more stable sample and enables resampling a large number of samples. We conduct extensive experiments and comparisons with state-of-the-art methods on several popular few-shot benchmarks to verify the effectiveness of the proposed framework.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11302-11315"},"PeriodicalIF":8.4,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Face forgery detection has attracted much attention due to the ever-increasing social concerns caused by facial manipulation techniques. Recently, identity-based detection methods have made considerable progress, which is especially suitable in the celebrity protection scenario. However, they still suffer from two main limitations: (a) generic identity extractor is not specifically designed for forgery detection, leading to nonnegligible Identity Representation Bias