IEEE Transactions on Multimedia最新文献_第5页

Scene Text Image Super-Resolution Via Semantic Distillation and Text Perceptual Loss

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-24 DOI: 10.1109/TMM.2024.3521759

Cairong Zhao;Rui Shu;Shuyang Feng;Liang Zhu;Xuekuan Wang

Text Super-Resolution (SR) technology aims to recover lost information in low-resolution text images. With the proposal of TextZoom, which is the first dataset aiming at text super-resolution in real scenes, more and more scene text super-resolution models have been presented on the basis of it. Although these methods have achieved excellent performance, they do not consider how to make full and efficient use of semantic information. Out of this consideration, a Semantic-aware Trident Network (STNet) for Scene Text Image Super-Resolution is proposed. Specifically, pre-trained text recognition model ASTER (Attentional Scene Text Recognizer) is utilized to assist this process in two ways. Firstly, a novel basic block named Semantic-aware Trident Block (STB) is designed to build the STNet, which incorporates an added branch for semantic distillation to learn semantic information of pre-trained recognition model. Secondly, we expand our model in an adversarial training manner and propose new text perceptual loss based on ASTER to further enhance semantic information in SR images. Extensive experiments on TextZoom dataset show that compared with directly recognizing bicubic images, the proposed STNet boosts the recognition accuracy of ASTER, MORAN (Multi-Object Rectified Attention Network), and CRNN (Convolutional Recurrent Neural Network) by 17.4%, 18.2%, and 24.3%, respectively, which is higher than the performance of several existing state-of-the-art (SOTA) SR network models. Besides, experiments in real scenes (on ICDAR 2015 dataset) and in restricted scenarios (defense against adversarial attacks) validate that addition of semantic information enables the proposed method to achieve promising cross-dataset performance. Since the proposed method is trained on cropped images, when applied to real-world scenarios, locations of text in natural images are firstly localized through scene text detection methods, and then cropped text images are obtained based on detected text positions.

{"title":"Scene Text Image Super-Resolution Via Semantic Distillation and Text Perceptual Loss","authors":"Cairong Zhao;Rui Shu;Shuyang Feng;Liang Zhu;Xuekuan Wang","doi":"10.1109/TMM.2024.3521759","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521759","url":null,"abstract":"Text Super-Resolution (SR) technology aims to recover lost information in low-resolution text images. With the proposal of TextZoom, which is the first dataset aiming at text super-resolution in real scenes, more and more scene text super-resolution models have been presented on the basis of it. Although these methods have achieved excellent performance, they do not consider how to make full and efficient use of semantic information. Out of this consideration, a Semantic-aware Trident Network (STNet) for Scene Text Image Super-Resolution is proposed. Specifically, pre-trained text recognition model ASTER (Attentional Scene Text Recognizer) is utilized to assist this process in two ways. Firstly, a novel basic block named Semantic-aware Trident Block (STB) is designed to build the STNet, which incorporates an added branch for semantic distillation to learn semantic information of pre-trained recognition model. Secondly, we expand our model in an adversarial training manner and propose new text perceptual loss based on ASTER to further enhance semantic information in SR images. Extensive experiments on TextZoom dataset show that compared with directly recognizing bicubic images, the proposed STNet boosts the recognition accuracy of ASTER, MORAN (Multi-Object Rectified Attention Network), and CRNN (Convolutional Recurrent Neural Network) by 17.4%, 18.2%, and 24.3%, respectively, which is higher than the performance of several existing state-of-the-art (SOTA) SR network models. Besides, experiments in real scenes (on ICDAR 2015 dataset) and in restricted scenarios (defense against adversarial attacks) validate that addition of semantic information enables the proposed method to achieve promising cross-dataset performance. Since the proposed method is trained on cropped images, when applied to real-world scenarios, locations of text in natural images are firstly localized through scene text detection methods, and then cropped text images are obtained based on detected text positions.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1153-1164"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143594323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Combating Noisy Labels by Alleviating the Memorization of DNNs to Noisy Labels

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-24 DOI: 10.1109/TMM.2024.3521722

Shunjie Yuan;Xinghua Li;Yinbin Miao;Haiyan Zhang;Ximeng Liu;Robert H. Deng

Data is the essential fuel for deep neural networks (DNNs), and its quality affects the practical performance of DNNs. In real-world training scenarios, the successful generalization performance of DNNs is severely challenged by noisy samples with incorrect labels. To combat noisy samples in image classification, numerous methods based on sample selection and semi-supervised learning (SSL) have been developed, where sample selection is used to provide the supervision signal for SSL, achieving great success in resisting noisy samples. Due to the necessary warm-up training on noisy datasets and the basic sample selection mechanism, DNNs are still confronted with the challenge of memorizing noisy samples. However, existing methods do not address the memorization of noisy samples by DNNs explicitly, which hinders the generalization performance of DNNs. To alleviate this issue, we present a new approach to combat noisy samples. First, we propose a memorized noise detection method to detect noisy samples that DNNs have already memorized during the training process. Next, we design a noise-excluded sample selection method and a noise-alleviated MixMatch to alleviate the memorization of DNNs to noisy samples. Finally, we integrate our approach with the established method DivideMix, proposing Modified-DivideMix. The experimental results on CIFAR-10, CIFAR-100, and Clothing1M demonstrate the effectiveness of our approach.

{"title":"Combating Noisy Labels by Alleviating the Memorization of DNNs to Noisy Labels","authors":"Shunjie Yuan;Xinghua Li;Yinbin Miao;Haiyan Zhang;Ximeng Liu;Robert H. Deng","doi":"10.1109/TMM.2024.3521722","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521722","url":null,"abstract":"Data is the essential fuel for deep neural networks (DNNs), and its quality affects the practical performance of DNNs. In real-world training scenarios, the successful generalization performance of DNNs is severely challenged by noisy samples with incorrect labels. To combat noisy samples in image classification, numerous methods based on sample selection and semi-supervised learning (SSL) have been developed, where sample selection is used to provide the supervision signal for SSL, achieving great success in resisting noisy samples. Due to the necessary warm-up training on noisy datasets and the basic sample selection mechanism, DNNs are still confronted with the challenge of memorizing noisy samples. However, existing methods do not address the memorization of noisy samples by DNNs explicitly, which hinders the generalization performance of DNNs. To alleviate this issue, we present a new approach to combat noisy samples. First, we propose a memorized noise detection method to detect noisy samples that DNNs have already memorized during the training process. Next, we design a noise-excluded sample selection method and a noise-alleviated MixMatch to alleviate the memorization of DNNs to noisy samples. Finally, we integrate our approach with the established method DivideMix, proposing Modified-DivideMix. The experimental results on CIFAR-10, CIFAR-100, and Clothing1M demonstrate the effectiveness of our approach.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"597-609"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Prototype Alignment With Dedicated Experts for Test-Agnostic Long-Tailed Recognition 与测试不可知长尾识别专用专家的原型对齐

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-24 DOI: 10.1109/TMM.2024.3521665

Chen Guo;Weiling Chen;Aiping Huang;Tiesong Zhao

Unlike vanilla long-tailed recognition trains on imbalanced data but assumes a uniform test class distribution, test-agnostic long-tailed recognition aims to handle arbitrary test class distributions. Existing methods require prior knowledge of test sets for post-adjustment through multi-stage training, resulting in static decisions at the dataset-level. This pipeline overlooks instance diversity and is impractical in real situations. In this work, we introduce Prototype Alignment with Dedicated Experts (PADE), a one-stage framework for test-agnostic long-tailed recognition. PADE tackles unknown test distributions at the instance-level, without depending on test priors. It reformulates the task as a domain detection problem, dynamically adjusting the model for each instance. PADE comprises three main strategies: 1) parameter customization strategy for multi-experts skilled at different categories; 2) normalized target knowledge distillation for mutual guidance among experts while maintaining diversity; 3) re-balanced compactness learning with momentum prototypes, promoting instance alignment with the corresponding class centroid. We evaluate PADE on various long-tailed recognition benchmarks with diverse test distributions. The results verify its effectiveness in both vanilla and test-agnostic long-tailed recognition.

与传统的长尾识别在不平衡数据上训练不同，它假设一个统一的测试类分布，而测试不可知的长尾识别旨在处理任意的测试类分布。现有的方法需要预先了解测试集，通过多阶段训练进行后调整，从而导致数据集级别的静态决策。这种管道忽略了实例的多样性，在实际情况下是不切实际的。在这项工作中，我们介绍了与专用专家的原型对齐（PADE），这是一种测试不可知的长尾识别的单阶段框架。PADE在实例级处理未知的测试发行版，而不依赖于测试先验。它将任务重新表述为一个领域检测问题，为每个实例动态调整模型。PADE包括三个主要策略：1)针对不同类别的多专家的参数定制策略；2)规范化目标知识精馏，在保持多样性的前提下，实现专家间的相互指导；3)利用动量原型重新平衡紧凑性学习，促进实例与相应类质心对齐。我们在具有不同测试分布的各种长尾识别基准上评估了PADE。结果验证了该方法在香草和测试无关的长尾识别中的有效性。

{"title":"Prototype Alignment With Dedicated Experts for Test-Agnostic Long-Tailed Recognition","authors":"Chen Guo;Weiling Chen;Aiping Huang;Tiesong Zhao","doi":"10.1109/TMM.2024.3521665","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521665","url":null,"abstract":"Unlike vanilla long-tailed recognition trains on imbalanced data but assumes a uniform test class distribution, test-agnostic long-tailed recognition aims to handle arbitrary test class distributions. Existing methods require prior knowledge of test sets for post-adjustment through multi-stage training, resulting in static decisions at the dataset-level. This pipeline overlooks instance diversity and is impractical in real situations. In this work, we introduce Prototype Alignment with Dedicated Experts (PADE), a one-stage framework for test-agnostic long-tailed recognition. PADE tackles unknown test distributions at the instance-level, without depending on test priors. It reformulates the task as a domain detection problem, dynamically adjusting the model for each instance. PADE comprises three main strategies: 1) parameter customization strategy for multi-experts skilled at different categories; 2) normalized target knowledge distillation for mutual guidance among experts while maintaining diversity; 3) re-balanced compactness learning with momentum prototypes, promoting instance alignment with the corresponding class centroid. We evaluate PADE on various long-tailed recognition benchmarks with diverse test distributions. The results verify its effectiveness in both vanilla and test-agnostic long-tailed recognition.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"455-465"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Content-Aware Tunable Selective Encryption for HEVC Using Sine-Modular Chaotification Model 基于正弦模混沌模型的HEVC内容感知可调选择性加密

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-24 DOI: 10.1109/TMM.2024.3521724

Qingxin Sheng;Chong Fu;Zhaonan Lin;Junxin Chen;Xingwei Wang;Chiu-Wing Sham

Existing High Efficiency Video Coding (HEVC) selective encryption algorithms only consider the encoding characteristics of syntax elements to keep format compliance, but ignore the semantic features of video content, which may lead to unnecessary computational and bit rate costs. To tackle this problem, we present a content-aware tunable selective encryption (CATSE) scheme for HEVC. First, a deep hashing network is adopted to retrieve groups of pictures (GOPs) containing sensitive objects. Then, the retrieved sensitive GOPs and the remaining insensitive ones are encrypted with different encryption strengths. For the former, multiple syntax elements are encrypted to ensure security, whereas for the latter, only a few bypass-coded syntax elements are encrypted to improve the encryption efficiency and reduce the bit rate overhead. The keystream sequence used is extracted from the time series of a new improved logistic map with complex dynamic behavior, which is generated by our proposed sine-modular chaotification model. Finally, a reversible steganography is applied to embed the flag bits of the GOP type into the encrypted bitstream, so that the decoder can distinguish the encrypted syntax elements that need to be decrypted in different GOPs. Experimental results indicate that the proposed HEVC CATSE scheme not only provides high encryption speed and low bit rate overhead, but also has superior encryption strength than other state-of-the-art HEVC selective encryption algorithms.

现有的HEVC （High Efficiency Video Coding）选择性加密算法仅考虑语法元素的编码特征以保持格式遵从性，而忽略了视频内容的语义特征，这可能导致不必要的计算和比特率成本。为了解决这个问题，我们提出了一种HEVC的内容感知可调选择性加密（CATSE）方案。首先，采用深度哈希网络检索包含敏感对象的图片组（GOPs）。然后，对检索到的敏感GOPs和剩余的不敏感GOPs使用不同的加密强度进行加密。前者对多个语法元素进行加密以保证安全性，而后者只对少数经过旁路编码的语法元素进行加密，以提高加密效率，降低比特率开销。所使用的密钥流序列是从一个新的改进的具有复杂动态行为的逻辑映射的时间序列中提取出来的，该逻辑映射是由我们提出的正弦模混沌模型生成的。最后，采用可逆隐写技术将GOP类型的标志位嵌入到加密的比特流中，使解码器能够区分不同GOPs中需要解密的加密语法元素。实验结果表明，该方案不仅具有较高的加密速度和较低的比特率开销，而且具有较先进的HEVC选择性加密算法优越的加密强度。

{"title":"Content-Aware Tunable Selective Encryption for HEVC Using Sine-Modular Chaotification Model","authors":"Qingxin Sheng;Chong Fu;Zhaonan Lin;Junxin Chen;Xingwei Wang;Chiu-Wing Sham","doi":"10.1109/TMM.2024.3521724","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521724","url":null,"abstract":"Existing High Efficiency Video Coding (HEVC) selective encryption algorithms only consider the encoding characteristics of syntax elements to keep format compliance, but ignore the semantic features of video content, which may lead to unnecessary computational and bit rate costs. To tackle this problem, we present a content-aware tunable selective encryption (CATSE) scheme for HEVC. First, a deep hashing network is adopted to retrieve groups of pictures (GOPs) containing sensitive objects. Then, the retrieved sensitive GOPs and the remaining insensitive ones are encrypted with different encryption strengths. For the former, multiple syntax elements are encrypted to ensure security, whereas for the latter, only a few bypass-coded syntax elements are encrypted to improve the encryption efficiency and reduce the bit rate overhead. The keystream sequence used is extracted from the time series of a new improved logistic map with complex dynamic behavior, which is generated by our proposed sine-modular chaotification model. Finally, a reversible steganography is applied to embed the flag bits of the GOP type into the encrypted bitstream, so that the decoder can distinguish the encrypted syntax elements that need to be decrypted in different GOPs. Experimental results indicate that the proposed HEVC CATSE scheme not only provides high encryption speed and low bit rate overhead, but also has superior encryption strength than other state-of-the-art HEVC selective encryption algorithms.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"41-55"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Part-Level Relationship Learning for Fine-Grained Few-Shot Image Classification

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-24 DOI: 10.1109/TMM.2024.3521792

Chuanming Wang;Huiyuan Fu;Peiye Liu;Huadong Ma

Recently, an increasing number of few-shot image classification methods have been proposed, and they aim at seeking a learning paradigm to train a high-performance classification model with limited labeled samples. However, the neglect of part-level relationships causes few-shot methods to struggle to distinguish between closely similar subcategories, which makes it difficult for them to solve the fine-grained image classification problem. To tackle this challenging task, this paper proposes a fine-grained few-shot image classification method that exploits both intra-part and inter-part relationships among different samples. To establish comprehensive relationships, we first extract multiple discriminative descriptors from the input image, representing its different parts. Then, we propose to define the metric spaces by interpolating intra-part relationships, which can help the model adaptively find clear boundaries for these confusing classes. Finally, since the unlabeled image has high similarities to all classes, we project these similarities into a high-dimension space according to the inter-part relationship and interpolate a parameterized classifier to discover the subtle differences among these similar classes. To evaluate our proposed method, we conduct extensive experiments on various fine-grained datasets. Without any pre-train/fine-tuning process, our approach clearly outperforms previous few-shot learning methods, which demonstrates the effectiveness of our approach.

{"title":"Part-Level Relationship Learning for Fine-Grained Few-Shot Image Classification","authors":"Chuanming Wang;Huiyuan Fu;Peiye Liu;Huadong Ma","doi":"10.1109/TMM.2024.3521792","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521792","url":null,"abstract":"Recently, an increasing number of few-shot image classification methods have been proposed, and they aim at seeking a learning paradigm to train a high-performance classification model with limited labeled samples. However, the neglect of part-level relationships causes few-shot methods to struggle to distinguish between closely similar subcategories, which makes it difficult for them to solve the fine-grained image classification problem. To tackle this challenging task, this paper proposes a fine-grained few-shot image classification method that exploits both intra-part and inter-part relationships among different samples. To establish comprehensive relationships, we first extract multiple discriminative descriptors from the input image, representing its different parts. Then, we propose to define the metric spaces by interpolating intra-part relationships, which can help the model adaptively find clear boundaries for these confusing classes. Finally, since the unlabeled image has high similarities to all classes, we project these similarities into a high-dimension space according to the inter-part relationship and interpolate a parameterized classifier to discover the subtle differences among these similar classes. To evaluate our proposed method, we conduct extensive experiments on various fine-grained datasets. Without any pre-train/fine-tuning process, our approach clearly outperforms previous few-shot learning methods, which demonstrates the effectiveness of our approach.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1448-1460"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Underwater Image Enhancement With Cascaded Contrastive Learning

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-24 DOI: 10.1109/TMM.2024.3521739

Yi Liu;Qiuping Jiang;Xinyi Wang;Ting Luo;Jingchun Zhou

Underwater image enhancement (UIE) is a highly challenging task due to the complexity of underwater environment and the diversity of underwater image degradation. Due to the application of deep learning, current UIE methods have made significant progress. Most of the existing deep learning-based UIE methods follow a single-stage network which cannot effectively address the diverse degradations simultaneously. In this paper, we propose to address this issue by designing a two-stage deep learning framework and taking advantage of cascaded contrastive learning to guide the network training of each stage. The proposed method is called CCL-Net in short. Specifically, the proposed CCL-Net involves two cascaded stages, i.e., a color correction stage tailored to the color deviation issue and a haze removal stage tailored to improve the visibility and contrast of underwater images. To guarantee the underwater image can be progressively enhanced, we also apply contrastive loss as an additional constraint to guide the training of each stage. In the first stage, the raw underwater images are used as negative samples for building the first contrastive loss, ensuring the enhanced results of the first color correction stage are better than the original inputs. While in the second stage, the enhanced results rather than the raw underwater images of the first color correction stage are used as the negative samples for building the second contrastive loss, thus ensuring the final enhanced results of the second haze removal stage are better than the intermediate color corrected results. Extensive experiments on multiple benchmark datasets demonstrate that our CCL-Net can achieve superior performance compared to many state-of-the-art methods. In addition, a series of ablation studies also verify the effectiveness of each key component involved in the proposed CCL-Net.

{"title":"Underwater Image Enhancement With Cascaded Contrastive Learning","authors":"Yi Liu;Qiuping Jiang;Xinyi Wang;Ting Luo;Jingchun Zhou","doi":"10.1109/TMM.2024.3521739","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521739","url":null,"abstract":"Underwater image enhancement (UIE) is a highly challenging task due to the complexity of underwater environment and the diversity of underwater image degradation. Due to the application of deep learning, current UIE methods have made significant progress. Most of the existing deep learning-based UIE methods follow a single-stage network which cannot effectively address the diverse degradations simultaneously. In this paper, we propose to address this issue by designing a two-stage deep learning framework and taking advantage of cascaded contrastive learning to guide the network training of each stage. The proposed method is called CCL-Net in short. Specifically, the proposed CCL-Net involves two cascaded stages, i.e., a color correction stage tailored to the color deviation issue and a haze removal stage tailored to improve the visibility and contrast of underwater images. To guarantee the underwater image can be progressively enhanced, we also apply contrastive loss as an additional constraint to guide the training of each stage. In the first stage, the raw underwater images are used as negative samples for building the first contrastive loss, ensuring the enhanced results of the first color correction stage are better than the original inputs. While in the second stage, the enhanced results rather than the raw underwater images of the first color correction stage are used as the negative samples for building the second contrastive loss, thus ensuring the final enhanced results of the second haze removal stage are better than the intermediate color corrected results. Extensive experiments on multiple benchmark datasets demonstrate that our CCL-Net can achieve superior performance compared to many state-of-the-art methods. In addition, a series of ablation studies also verify the effectiveness of each key component involved in the proposed CCL-Net.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1512-1525"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Discriminative Anchor Learning for Efficient Multi-View Clustering

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-24 DOI: 10.1109/TMM.2024.3521743

Yalan Qin;Nan Pu;Hanzhou Wu;Nicu Sebe

Multi-view clustering aims to study the complementary information across views and discover the underlying structure. For solving the relatively high computational cost for the existing approaches, works based on anchor have been presented recently. Even with acceptable clustering performance, these methods tend to map the original representation from multiple views into a fixed shared graph based on the original dataset. However, most studies ignore the discriminative property of the learned anchors, which ruin the representation capability of the built model. Moreover, the complementary information among anchors across views is neglected to be ensured by simply learning the shared anchor graph without considering the quality of view-specific anchors. In this paper, we propose discriminative anchor learning for multi-view clustering (DALMC) for handling the above issues. We learn discriminative view-specific feature representations according to the original dataset and build anchors from different views based on these representations, which increase the quality of the shared anchor graph. The discriminative feature learning and consensus anchor graph construction are integrated into a unified framework to improve each other for realizing the refinement. The optimal anchors from multiple views and the consensus anchor graph are learned with the orthogonal constraints. We give an iterative algorithm to deal with the formulated problem. Extensive experiments on different datasets show the effectiveness and efficiency of our method compared with other methods.

{"title":"Discriminative Anchor Learning for Efficient Multi-View Clustering","authors":"Yalan Qin;Nan Pu;Hanzhou Wu;Nicu Sebe","doi":"10.1109/TMM.2024.3521743","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521743","url":null,"abstract":"Multi-view clustering aims to study the complementary information across views and discover the underlying structure. For solving the relatively high computational cost for the existing approaches, works based on anchor have been presented recently. Even with acceptable clustering performance, these methods tend to map the original representation from multiple views into a fixed shared graph based on the original dataset. However, most studies ignore the discriminative property of the learned anchors, which ruin the representation capability of the built model. Moreover, the complementary information among anchors across views is neglected to be ensured by simply learning the shared anchor graph without considering the quality of view-specific anchors. In this paper, we propose discriminative anchor learning for multi-view clustering (DALMC) for handling the above issues. We learn discriminative view-specific feature representations according to the original dataset and build anchors from different views based on these representations, which increase the quality of the shared anchor graph. The discriminative feature learning and consensus anchor graph construction are integrated into a unified framework to improve each other for realizing the refinement. The optimal anchors from multiple views and the consensus anchor graph are learned with the orthogonal constraints. We give an iterative algorithm to deal with the formulated problem. Extensive experiments on different datasets show the effectiveness and efficiency of our method compared with other methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1386-1396"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HNR-ISC: Hybrid Neural Representation for Image Set Compression HNR-ISC：图像集压缩的混合神经表示

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-24 DOI: 10.1109/TMM.2024.3521715

Pingping Zhang;Shiqi Wang;Meng Wang;Peilin Chen;Wenhui Wu;Xu Wang;Sam Kwong

Image set compression (ISC) refers to compressing the sets of semantically similar images. Traditional ISC methods typically aim to eliminate redundancy among images at either signal or frequency domain, but often struggle to handle complex geometric deformations across different images effectively. Here, we propose a new Hybrid Neural Representation for ISC (HNR-ISC), including an implicit neural representation for Semantically Common content Compression (SCC) and an explicit neural representation for Semantically Unique content Compression (SUC). Specifically, SCC enables the conversion of semantically common contents into a small-and-sweet neural representation, along with embeddings that can be conveyed as a bitstream. SUC is composed of invertible modules for removing intra-image redundancies. The feature level combination from SCC and SUC naturally forms the final image set. Experimental results demonstrate the robustness and generalization capability of HNR-ISC in terms of signal and perceptual quality for reconstruction and accuracy for the downstream analysis task.

图像集压缩（Image set compression， ISC）是指对语义相似的图像集进行压缩。传统的ISC方法通常旨在消除信号域或频域图像之间的冗余，但往往难以有效地处理不同图像之间的复杂几何变形。在此，我们提出了一种新的ISC混合神经表示（HNR-ISC），包括语义通用内容压缩（SCC）的隐式神经表示和语义唯一内容压缩（SUC）的显式神经表示。具体来说，SCC允许将语义上常见的内容转换为小而简洁的神经表示，以及可以作为比特流传递的嵌入。SUC由多个可逆模块组成，用于消除图像内冗余。SCC和SUC的特征级组合自然形成最终的图像集。实验结果表明，HNR-ISC在信号和感知质量方面具有鲁棒性和泛化能力，可用于重建和下游分析任务的准确性。

引用次数: 0

Knowledge-Guided Cross-Modal Alignment and Progressive Fusion for Chest X-Ray Report Generation

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-24 DOI: 10.1109/TMM.2024.3521728

Lili Huang;Yiming Cao;Pengcheng Jia;Chenglong Li;Jin Tang;Chuanfu Li

The task of chest X-ray report generation, which aims to simulate the diagnosis process of doctors, has received widespread attention. Compared with the image caption task, chest X-ray report generation is more challenging since it needs to generate a longer and more accurate description of each diagnostic part in chest X-ray images. Most of existing works focus on how to extract better visual features or more accurate text expression based on existing reports. However, they ignore the interactions between visual and text modalities and are thus obviously not in line with human thinking. A small part of works explore the interactions of visual and text modalities, but data-driven learning of cross-modal information mapping can not break the semantic gap between different modalities. In this work, we propose a novel approach called Knowledge-guided Cross-modal Alignment and Progressive fusion (KCAP), which takes the knowledge words from a created medical knowledge dictionary as the bridge to guide the cross-modal feature alignment and fusion, for accurate chest X-ray report generation. In particular, we create the medical knowledge dictionary by extracting medical phrases from the training set and then selecting some phrases with substantive meanings as knowledge words based on their frequency of occurrence. Based on the knowledge words from the medical knowledge dictionary, the visual and text modalities are interacted by a mapping layer for the enhancement of the features of two modalities, and then the alignment fusion module is introduced to mitigate the semantic gap between visual and text modalities. To retain the important details of the original information, we design a progressive fusion scheme to integrate the advantages of both salient fused and original features to generate better medical reports. The experimental results on IU-Xray and MIMIC datasets demonstrate the effectiveness of the proposed KCAP.

{"title":"Knowledge-Guided Cross-Modal Alignment and Progressive Fusion for Chest X-Ray Report Generation","authors":"Lili Huang;Yiming Cao;Pengcheng Jia;Chenglong Li;Jin Tang;Chuanfu Li","doi":"10.1109/TMM.2024.3521728","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521728","url":null,"abstract":"The task of chest X-ray report generation, which aims to simulate the diagnosis process of doctors, has received widespread attention. Compared with the image caption task, chest X-ray report generation is more challenging since it needs to generate a longer and more accurate description of each diagnostic part in chest X-ray images. Most of existing works focus on how to extract better visual features or more accurate text expression based on existing reports. However, they ignore the interactions between visual and text modalities and are thus obviously not in line with human thinking. A small part of works explore the interactions of visual and text modalities, but data-driven learning of cross-modal information mapping can not break the semantic gap between different modalities. In this work, we propose a novel approach called Knowledge-guided Cross-modal Alignment and Progressive fusion (KCAP), which takes the knowledge words from a created medical knowledge dictionary as the bridge to guide the cross-modal feature alignment and fusion, for accurate chest X-ray report generation. In particular, we create the medical knowledge dictionary by extracting medical phrases from the training set and then selecting some phrases with substantive meanings as knowledge words based on their frequency of occurrence. Based on the knowledge words from the medical knowledge dictionary, the visual and text modalities are interacted by a mapping layer for the enhancement of the features of two modalities, and then the alignment fusion module is introduced to mitigate the semantic gap between visual and text modalities. To retain the important details of the original information, we design a progressive fusion scheme to integrate the advantages of both salient fused and original features to generate better medical reports. The experimental results on IU-Xray and MIMIC datasets demonstrate the effectiveness of the proposed KCAP.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"557-567"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GCCNet: A Novel Network Leveraging Gated Cross-Correlation for Multi-View Classification

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-24 DOI: 10.1109/TMM.2024.3521733

Yuanpeng Zeng;Ru Zhang;Hao Zhang;Shaojie Qiao;Faliang Huang;Qing Tian;Yuzhong Peng

Multi-view learning is a machine learning paradigm that utilizes multiple feature sets or data sources to improve learning performance and generalization. However, existing multi-view learning methods often do not capture and utilize information from different views very well, especially when the relationships between views are complex and of varying quality. In this paper, we propose a novel multi-view learning framework for the multi-view classification task, called Gated Cross-Correlation Network (GCCNet), which addresses these challenges by integrating the three key operational levels in multi-view learning: representation, fusion, and decision. Specifically, GCCNet contains a novel component called the Multi-View Gated Information Distributor (MVGID) to enhance noise filtering and optimize the retention of critical information. In addition, GCCNet uses cross-correlation analysis to reveal dependencies and interactions between different views, as well as integrates an adaptive weighted joint decision strategy to mitigate the interference of low-quality views. Thus, GCCNet can not only comprehensively capture and utilize information from different views, but also facilitate information exchange and synergy between views, ultimately improving the overall performance of the model. Extensive experimental results on ten benchmark datasets show GCCNet's outperforms state-of-the-art methods on eight out of ten datasets, validating its effectiveness and superiority in multi-view learning.

{"title":"GCCNet: A Novel Network Leveraging Gated Cross-Correlation for Multi-View Classification","authors":"Yuanpeng Zeng;Ru Zhang;Hao Zhang;Shaojie Qiao;Faliang Huang;Qing Tian;Yuzhong Peng","doi":"10.1109/TMM.2024.3521733","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521733","url":null,"abstract":"Multi-view learning is a machine learning paradigm that utilizes multiple feature sets or data sources to improve learning performance and generalization. However, existing multi-view learning methods often do not capture and utilize information from different views very well, especially when the relationships between views are complex and of varying quality. In this paper, we propose a novel multi-view learning framework for the multi-view classification task, called Gated Cross-Correlation Network (GCCNet), which addresses these challenges by integrating the three key operational levels in multi-view learning: representation, fusion, and decision. Specifically, GCCNet contains a novel component called the Multi-View Gated Information Distributor (MVGID) to enhance noise filtering and optimize the retention of critical information. In addition, GCCNet uses cross-correlation analysis to reveal dependencies and interactions between different views, as well as integrates an adaptive weighted joint decision strategy to mitigate the interference of low-quality views. Thus, GCCNet can not only comprehensively capture and utilize information from different views, but also facilitate information exchange and synergy between views, ultimately improving the overall performance of the model. Extensive experimental results on ten benchmark datasets show GCCNet's outperforms state-of-the-art methods on eight out of ten datasets, validating its effectiveness and superiority in multi-view learning.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1086-1099"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143594434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0