首页 > 最新文献

IEEE Transactions on Multimedia最新文献

英文 中文
Towards Efficient Partially Relevant Video Retrieval With Active Moment Discovering 基于主动矩发现的部分相关视频高效检索
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-07-21 DOI: 10.1109/TMM.2025.3590937
Peipei Song;Long Zhang;Long Lan;Weidong Chen;Dan Guo;Xun Yang;Meng Wang
Partially relevant video retrieval (PRVR) is a practical yet challenging task in text-to-video retrieval, where videos are untrimmed and contain much background content. The pursuit here is of both effective and efficient solutions to capture the partial correspondence between text queries and untrimmed videos. Existing PRVR methods, which typically focus on modeling multi-scale clip representations, however, suffer from content independence and information redundancy, impairing retrieval performance. To overcome these limitations, we propose a simple yet effective approach with active moment discovering (AMDNet). We are committed to discovering video moments that are semantically consistent with their queries. By using learnable span anchors to capture distinct moments and applying masked multi-moment attention to emphasize salient moments while suppressing redundant backgrounds, we achieve more compact and informative video representations. To further enhance moment modeling, we introduce a moment diversity loss to encourage different moments of distinct regions and a moment relevance loss to promote semantically query-relevant moments, which cooperate with a partially relevant retrieval loss for end-to-end optimization. Extensive experiments on two large-scale video datasets (i.e., TVR and ActivityNet Captions) demonstrate the superiority and efficiency of our AMDNet. In particular, AMDNet is about 15.5 times smaller (#parameters) while 6.0 points higher (SumR) than the up-to-date method GMMFormer on TVR.
部分相关视频检索(PRVR)是文本到视频检索中一个既实际又具有挑战性的任务,其中视频未经修剪且包含大量背景内容。这里追求的是有效和高效的解决方案,以捕获文本查询和未修剪视频之间的部分对应关系。现有的PRVR方法主要关注多尺度片段表示的建模,存在内容独立性和信息冗余问题,影响了检索性能。为了克服这些限制,我们提出了一种简单而有效的主动矩发现方法(AMDNet)。我们致力于发现与他们的查询在语义上一致的视频时刻。通过使用可学习的跨度锚点来捕捉不同的时刻,并使用掩蔽的多时刻注意来强调突出的时刻,同时抑制冗余的背景,我们实现了更紧凑和信息丰富的视频表示。为了进一步增强矩建模,我们引入了矩分集损失来鼓励不同区域的不同矩,引入了矩相关损失来促进语义查询相关矩,并与部分相关检索损失配合进行端到端优化。在两个大型视频数据集(即TVR和ActivityNet Captions)上的大量实验证明了我们的AMDNet的优越性和效率。特别是,AMDNet比最新的TVR方法GMMFormer小15.5倍(#参数),高6.0点(SumR)。
{"title":"Towards Efficient Partially Relevant Video Retrieval With Active Moment Discovering","authors":"Peipei Song;Long Zhang;Long Lan;Weidong Chen;Dan Guo;Xun Yang;Meng Wang","doi":"10.1109/TMM.2025.3590937","DOIUrl":"https://doi.org/10.1109/TMM.2025.3590937","url":null,"abstract":"Partially relevant video retrieval (PRVR) is a practical yet challenging task in text-to-video retrieval, where videos are untrimmed and contain much background content. The pursuit here is of both effective and efficient solutions to capture the partial correspondence between text queries and untrimmed videos. Existing PRVR methods, which typically focus on modeling multi-scale clip representations, however, suffer from content independence and information redundancy, impairing retrieval performance. To overcome these limitations, we propose a simple yet effective approach with active moment discovering (AMDNet). We are committed to discovering video moments that are semantically consistent with their queries. By using learnable span anchors to capture distinct moments and applying masked multi-moment attention to emphasize salient moments while suppressing redundant backgrounds, we achieve more compact and informative video representations. To further enhance moment modeling, we introduce a moment diversity loss to encourage different moments of distinct regions and a moment relevance loss to promote semantically query-relevant moments, which cooperate with a partially relevant retrieval loss for end-to-end optimization. Extensive experiments on two large-scale video datasets (<italic>i.e</i>., TVR and ActivityNet Captions) demonstrate the superiority and efficiency of our AMDNet. In particular, AMDNet is about 15.5 times smaller (#parameters) while 6.0 points higher (SumR) than the up-to-date method GMMFormer on TVR.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6740-6751"},"PeriodicalIF":9.7,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Cross-Modal Generation Algorithm for Temporal Force Tactile Data for Multidimensional Haptic Rendering 面向多维触觉渲染的时间力触觉数据跨模态生成算法
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-07-21 DOI: 10.1109/TMM.2025.3590907
Rui Song;Guohong Liu;Yan Zhang;Xiaoying Sun
Exploiting the correlation between multimodal data to generate tactile data has become a preferred approach to enhance tactile rendering fidelity. Nevertheless, existing studies have often overlooked the temporal dynamics of force tactile data. To fill this gap in the literature, this paper introduces a joint visual-audio approach to generate a temporal tactile data (VA2T) algorithm, focusing on the temporal and long-term dependencies of force tactile data. VA2T uses a feature extraction network to extract audio and image features and then uses an attention mechanism and decoder to fuse these features. The tactile reconstructor generates temporal friction and a normal force, with dilated causal convolution securing the temporal dependencies in the force tactile data. Simulation experiments on the LMT dataset demonstrate that compared with the transformer and audio-visual-aided haptic signal reconstruction (AVHR) algorithms, the VA2T algorithm reduces the RMSE for generated friction by 29.44% and 32.37%, respectively, and for normal forces by 23.30% and 35.43%, respectively. In addition, we developed a haptic rendering approach that combines electrovibration and mechanical vibration to render the generated friction and normal force. The subjective experimental results showed that the rendering fidelity of the data generated using the VA2T method was significantly higher than that of the data generated using the transformer and AVHR methods.
利用多模态数据之间的相关性生成触觉数据已成为提高触觉渲染保真度的首选方法。然而,现有的研究往往忽略了力触觉数据的时间动态。为了填补这一文献空白,本文引入了一种联合视听方法来生成时间触觉数据(VA2T)算法,重点关注力触觉数据的时间和长期依赖性。VA2T使用特征提取网络提取音频和图像特征,然后使用注意机制和解码器融合这些特征。触觉重构器产生时间摩擦和法向力,扩展的因果卷积确保了力触觉数据的时间依赖性。在LMT数据集上的仿真实验表明,与变压器和视听辅助触觉信号重建(AVHR)算法相比,VA2T算法对产生摩擦的RMSE分别降低了29.44%和32.37%,对法向力的RMSE分别降低了23.30%和35.43%。此外,我们开发了一种结合电振动和机械振动的触觉渲染方法来渲染产生的摩擦和法向力。主观实验结果表明,使用VA2T方法生成的数据的渲染保真度明显高于使用变压器和AVHR方法生成的数据。
{"title":"A Cross-Modal Generation Algorithm for Temporal Force Tactile Data for Multidimensional Haptic Rendering","authors":"Rui Song;Guohong Liu;Yan Zhang;Xiaoying Sun","doi":"10.1109/TMM.2025.3590907","DOIUrl":"https://doi.org/10.1109/TMM.2025.3590907","url":null,"abstract":"Exploiting the correlation between multimodal data to generate tactile data has become a preferred approach to enhance tactile rendering fidelity. Nevertheless, existing studies have often overlooked the temporal dynamics of force tactile data. To fill this gap in the literature, this paper introduces a joint visual-audio approach to generate a temporal tactile data (VA2T) algorithm, focusing on the temporal and long-term dependencies of force tactile data. VA2T uses a feature extraction network to extract audio and image features and then uses an attention mechanism and decoder to fuse these features. The tactile reconstructor generates temporal friction and a normal force, with dilated causal convolution securing the temporal dependencies in the force tactile data. Simulation experiments on the LMT dataset demonstrate that compared with the transformer and audio-visual-aided haptic signal reconstruction (AVHR) algorithms, the VA2T algorithm reduces the RMSE for generated friction by 29.44% and 32.37%, respectively, and for normal forces by 23.30% and 35.43%, respectively. In addition, we developed a haptic rendering approach that combines electrovibration and mechanical vibration to render the generated friction and normal force. The subjective experimental results showed that the rendering fidelity of the data generated using the VA2T method was significantly higher than that of the data generated using the transformer and AVHR methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5092-5102"},"PeriodicalIF":9.7,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spectral Discrepancy and Cross-Modal Semantic Consistency Learning for Object Detection in Hyperspectral Images 光谱差异和跨模态语义一致性学习在高光谱图像目标检测中的应用
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-07-11 DOI: 10.1109/TMM.2025.3586155
Xiao He;Chang Tang;Xinwang Liu;Wei Zhang;Zhimin Gao;Chuankun Li;Shaohua Qiu;Jiangfeng Xu
Hyperspectral images with high spectral resolution provide new insights into recognizing subtle differences in similar substances. However, object detection in hyperspectral images faces significant challenges in intra- and inter-class similarity due to the spatial differences in hyperspectral inter-bands and unavoidable interferences, e.g., sensor noises and illumination. To alleviate the hyperspectral inter-bands inconsistencies and redundancy, we propose a novel network termed Spectral Discrepancy and Cross-Modal semantic consistency learning (SDCM), which facilitates the extraction of consistent information across a wide range of hyperspectral bands while utilizing the spectral dimension to pinpoint regions of interest. Specifically, we leverage a semantic consistency learning (SCL) module that utilizes inter-band contextual cues to diminish the heterogeneity of information among bands, yielding highly coherent spectral dimension representations. On the other hand, we incorporate a spectral gated generator (SGG) into the framework that filters out the redundant data inherent in hyperspectral information based on the importance of the bands. Then, we design the spectral discrepancy aware (SDA) module to enrich the semantic representation of high-level information by extracting pixel-level spectral features. Extensive experiments on two hyperspectral datasets demonstrate that our proposed method achieves state-of-the-art performance when compared with other ones.
具有高光谱分辨率的高光谱图像为识别相似物质的细微差异提供了新的见解。然而,由于高光谱波段间的空间差异和不可避免的干扰(如传感器噪声和光照),高光谱图像的目标检测在类内和类间相似性方面面临重大挑战。为了减轻高光谱波段间的不一致性和冗余,我们提出了一种新的网络,称为光谱差异和跨模态语义一致性学习(SDCM),该网络可以在广泛的高光谱波段中提取一致的信息,同时利用光谱维数来确定感兴趣的区域。具体而言,我们利用语义一致性学习(SCL)模块,该模块利用带间上下文线索来减少带间信息的异质性,从而产生高度相干的频谱维表示。另一方面,我们在框架中加入了一个光谱门控发生器(SGG),根据波段的重要性过滤掉高光谱信息中固有的冗余数据。然后,我们设计了光谱差异感知(SDA)模块,通过提取像素级光谱特征来丰富高层信息的语义表示。在两个高光谱数据集上的大量实验表明,与其他方法相比,我们提出的方法达到了最先进的性能。
{"title":"Spectral Discrepancy and Cross-Modal Semantic Consistency Learning for Object Detection in Hyperspectral Images","authors":"Xiao He;Chang Tang;Xinwang Liu;Wei Zhang;Zhimin Gao;Chuankun Li;Shaohua Qiu;Jiangfeng Xu","doi":"10.1109/TMM.2025.3586155","DOIUrl":"https://doi.org/10.1109/TMM.2025.3586155","url":null,"abstract":"Hyperspectral images with high spectral resolution provide new insights into recognizing subtle differences in similar substances. However, object detection in hyperspectral images faces significant challenges in intra- and inter-class similarity due to the spatial differences in hyperspectral inter-bands and unavoidable interferences, e.g., sensor noises and illumination. To alleviate the hyperspectral inter-bands inconsistencies and redundancy, we propose a novel network termed <bold>S</b>pectral <bold>D</b>iscrepancy and <bold>C</b>ross-<bold>M</b>odal semantic consistency learning (SDCM), which facilitates the extraction of consistent information across a wide range of hyperspectral bands while utilizing the spectral dimension to pinpoint regions of interest. Specifically, we leverage a semantic consistency learning (SCL) module that utilizes inter-band contextual cues to diminish the heterogeneity of information among bands, yielding highly coherent spectral dimension representations. On the other hand, we incorporate a spectral gated generator (SGG) into the framework that filters out the redundant data inherent in hyperspectral information based on the importance of the bands. Then, we design the spectral discrepancy aware (SDA) module to enrich the semantic representation of high-level information by extracting pixel-level spectral features. Extensive experiments on two hyperspectral datasets demonstrate that our proposed method achieves state-of-the-art performance when compared with other ones.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6719-6731"},"PeriodicalIF":9.7,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Joint Distribution Weighted Alignment for Multi-Source Domain Adaptation via Kernel Relative Entropy Estimation 基于核相对熵估计的多源域自适应联合分布加权对齐
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-07-10 DOI: 10.1109/TMM.2025.3586109
Sentao Chen;Ping Xuan;Zhifeng Hao
The objective of Multi-Source Domain Adaptation (MSDA) is to train a neural network on labeled data from multiple joint source distributions (source domains) and unlabeled data from a joint target distribution (target domain), and use the trained network to estimate the target data labels. The challenge in this MSDA problem is that the multiple joint source distributions are relevant but distinct from the joint target distribution. To address this challenge, we propose a Joint Distribution Weighted Alignment (JDWA) approach to align a weighted joint source distribution to the joint target distribution under the relative entropy. Specifically, the weighted joint source distribution is defined as the weighted sum of the multiple joint source distributions, and is parameterized by the relevance weights. Since the relative entropy is unknown in practice, we propose a Kernel Relative Entropy Estimation (KREE) method to estimate it from data. Our KREE method first reformulates relative entropy as the negative of the minimal value of a functional, then exploits a function from the Reproducing Kernel Hilbert Space (RKHS) as the functional’s input, and finally solves the resultant convex problem with a global optimal solution. We also incorporate entropy regularization to enhance the network’s performance. Together, we minimize cross entropy, relative entropy, and entropy to learn both the relevance weights and the neural network. Experimental results on benchmark image classification datasets demonstrate that our JDWA approach performs better than the comparison methods. Intro video and Pytorch code are available at https://github.com/sentaochen/Joint-Distribution-Weighted-Alignment. Interested readers are also welcome to visit https://github.com/sentaochen for more source codes of the domain adaptation, partial domain adaptation, multi-source domain adaptation, and domain generalization approaches.
多源域自适应(MSDA)的目标是对来自多个联合源分布(源域)的标记数据和来自联合目标分布(目标域)的未标记数据训练神经网络,并使用训练好的网络估计目标数据的标签。该MSDA问题的挑战在于,多个联合源分布是相关的,但与联合目标分布不同。为了解决这一挑战,我们提出了一种联合分布加权对齐(JDWA)方法,在相对熵下将加权联合源分布对齐到联合目标分布。其中,加权联合源分布定义为多个联合源分布的加权和,并通过相关权值进行参数化。由于实际中相对熵是未知的,我们提出了一种核相对熵估计(KREE)方法来从数据中估计相对熵。我们的KREE方法首先将相对熵重新表述为一个泛函的最小值的负值,然后利用再生核希尔伯特空间(RKHS)中的一个函数作为泛函的输入,最后用全局最优解解决由此产生的凸问题。我们还结合了熵正则化来提高网络的性能。同时,我们最小化交叉熵、相对熵和熵来学习相关权重和神经网络。在基准图像分类数据集上的实验结果表明,我们的JDWA方法优于比较方法。介绍视频和Pytorch代码可在https://github.com/sentaochen/Joint-Distribution-Weighted-Alignment上获得。也欢迎感兴趣的读者访问https://github.com/sentaochen获取更多的域自适应、部分域自适应、多源域自适应和域泛化方法的源代码。
{"title":"Joint Distribution Weighted Alignment for Multi-Source Domain Adaptation via Kernel Relative Entropy Estimation","authors":"Sentao Chen;Ping Xuan;Zhifeng Hao","doi":"10.1109/TMM.2025.3586109","DOIUrl":"https://doi.org/10.1109/TMM.2025.3586109","url":null,"abstract":"The objective of Multi-Source Domain Adaptation (MSDA) is to train a neural network on labeled data from multiple joint source distributions (source domains) and unlabeled data from a joint target distribution (target domain), and use the trained network to estimate the target data labels. The challenge in this MSDA problem is that the multiple joint source distributions are relevant but distinct from the joint target distribution. To address this challenge, we propose a Joint Distribution Weighted Alignment (JDWA) approach to align a weighted joint source distribution to the joint target distribution under the relative entropy. Specifically, the weighted joint source distribution is defined as the weighted sum of the multiple joint source distributions, and is parameterized by the relevance weights. Since the relative entropy is unknown in practice, we propose a Kernel Relative Entropy Estimation (KREE) method to estimate it from data. Our KREE method first reformulates relative entropy as the negative of the minimal value of a functional, then exploits a function from the Reproducing Kernel Hilbert Space (RKHS) as the functional’s input, and finally solves the resultant convex problem with a global optimal solution. We also incorporate entropy regularization to enhance the network’s performance. Together, we minimize cross entropy, relative entropy, and entropy to learn both the relevance weights and the neural network. Experimental results on benchmark image classification datasets demonstrate that our JDWA approach performs better than the comparison methods. Intro video and Pytorch code are available at <uri>https://github.com/sentaochen/Joint-Distribution-Weighted-Alignment</uri>. Interested readers are also welcome to visit <uri>https://github.com/sentaochen</uri> for more source codes of the domain adaptation, partial domain adaptation, multi-source domain adaptation, and domain generalization approaches.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6606-6619"},"PeriodicalIF":9.7,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Blockchain and Improved Perception Hash Based Copyright Protection Scheme for Purely Chromatic Background Images 基于区块链和改进感知哈希的纯彩色背景图像版权保护方案
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-07-10 DOI: 10.1109/TMM.2025.3586150
Guangyong Gao;Tongchao Feng;Chongtao Guo;Zhihua Xia;Yun-Qing Shi
Purely chromatic background images are widely used in computer wallpapers and advertisements, leading to issues such as copyright infringement and the loss of interest of holders. Image hashing is a technique used for comparing the similarity between images, and is often used for image verification, search, and copy detection due to its insensitivity to subtle changes in the original image. In a purely chromatic background image, the central detail of the image is the primary part and the key for copyright authentication. As the perception hash (pHash) algorithm only retains the low-frequency portion of the discrete cosine transform (DCT) matrix, it is unsuitable for purely chromatic background images. To deal with this issue, we propose an improved perception hash (ipHash) algorithm to enhance the universality of the algorithm by extracting purely chromatic background image features. Meanwhile, the development of image hashing is restricted due to the requirement of a trusted third party. To solve this issue, a secure blockchain-based image copyright protection scheme is designed. It realizes the copyright authentication and traceability, and overcomes the issue of a lack of trusted third parties. Experimental results show that the proposed method outperforms the state-of-the-art image copyright protection schemes.
纯彩色背景图像被广泛应用于电脑壁纸和广告中,导致版权侵权和所有者利益损失等问题。图像哈希是一种用于比较图像之间相似性的技术,由于它对原始图像的细微变化不敏感,因此经常用于图像验证、搜索和复制检测。在纯彩色背景图像中,图像的中心细节是版权认证的主要部分和关键。由于感知哈希(pHash)算法仅保留离散余弦变换(DCT)矩阵的低频部分,因此不适合纯彩色背景图像。为了解决这个问题,我们提出了一种改进的感知哈希(ipHash)算法,通过提取纯彩色背景图像特征来增强算法的通用性。同时,由于可信第三方的要求,图像哈希的发展受到限制。为了解决这一问题,设计了一种基于区块链的安全图像版权保护方案。它实现了版权认证和可追溯性,克服了缺乏可信第三方的问题。实验结果表明,该方法优于现有的图像版权保护方案。
{"title":"A Blockchain and Improved Perception Hash Based Copyright Protection Scheme for Purely Chromatic Background Images","authors":"Guangyong Gao;Tongchao Feng;Chongtao Guo;Zhihua Xia;Yun-Qing Shi","doi":"10.1109/TMM.2025.3586150","DOIUrl":"https://doi.org/10.1109/TMM.2025.3586150","url":null,"abstract":"Purely chromatic background images are widely used in computer wallpapers and advertisements, leading to issues such as copyright infringement and the loss of interest of holders. Image hashing is a technique used for comparing the similarity between images, and is often used for image verification, search, and copy detection due to its insensitivity to subtle changes in the original image. In a purely chromatic background image, the central detail of the image is the primary part and the key for copyright authentication. As the perception hash (pHash) algorithm only retains the low-frequency portion of the discrete cosine transform (DCT) matrix, it is unsuitable for purely chromatic background images. To deal with this issue, we propose an improved perception hash (ipHash) algorithm to enhance the universality of the algorithm by extracting purely chromatic background image features. Meanwhile, the development of image hashing is restricted due to the requirement of a trusted third party. To solve this issue, a secure blockchain-based image copyright protection scheme is designed. It realizes the copyright authentication and traceability, and overcomes the issue of a lack of trusted third parties. Experimental results show that the proposed method outperforms the state-of-the-art image copyright protection schemes.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6635-6647"},"PeriodicalIF":9.7,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
S4R: Rethinking Point Cloud Sampling via Guiding Upsampling-Aware Perception S4R:基于上采样感知引导的点云采样重新思考
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-07-10 DOI: 10.1109/TMM.2025.3586148
Zhuangzi Li;Shan Liu;Wei Gao;Guanbin Li;Ge Li
Point cloud sampling aims to derive a sparse point cloud from a relatively dense point cloud, which is essential for efficient data transmission and storage. While existing deep sampling methods prioritize preserving the perception of sampled point clouds for downstream networks, few studies have critically examined the rationale behind this goal. Specifically, we observe that sampling can lead to a perceptual degradation phenomenon in many influential downstream networks, impairing their ability to effectively process sampled point clouds. We theoretically reveal the nature of the phenomenon and attempt to construct a novel sampling target by uniting upsampling and perceptual reconstruction. Accordingly, we propose a Maximum A Posteriori (MAP) sampling framework named Sample for Reconstruct (S4R), which impels the sampling stage to infer upsampling-guided perception. In S4R, we design very simple but effective sampling and upsampling networks using residual-based graph convolutions and incorporate a pseudo-residual connection to introduce prior knowledge. This architecture takes advantage of reconstruction properties and allows the sampling network to be trained in an unsupervised manner. Extensive experiments on classical networks demonstrates the excellent performance of S4R compared with the previous sampling schemes and reveals its advantages on different point cloud downstream tasks, i.e., classification, reconstruction and segmentation.
点云采样的目的是从相对密集的点云中提取稀疏的点云,这是高效传输和存储数据的必要条件。虽然现有的深度采样方法优先考虑保留对下游网络采样点云的感知,但很少有研究严格检查这一目标背后的基本原理。具体而言,我们观察到采样可能导致许多有影响力的下游网络中的感知退化现象,从而削弱其有效处理采样点云的能力。我们从理论上揭示了这一现象的本质,并试图将上采样和感知重构结合起来构建一个新的采样目标。因此,我们提出了一种称为S4R (Sample for reconstruction)的MAP采样框架,该框架推动采样阶段推断上采样引导的感知。在S4R中,我们使用基于残差的图卷积设计了非常简单但有效的采样和上采样网络,并结合了伪残差连接来引入先验知识。这种结构利用了重构特性,允许采样网络以无监督的方式进行训练。在经典网络上的大量实验表明,S4R与以往的采样方案相比具有优异的性能,并揭示了其在不同点云下游任务上的优势,即分类、重构和分割。
{"title":"S4R: Rethinking Point Cloud Sampling via Guiding Upsampling-Aware Perception","authors":"Zhuangzi Li;Shan Liu;Wei Gao;Guanbin Li;Ge Li","doi":"10.1109/TMM.2025.3586148","DOIUrl":"https://doi.org/10.1109/TMM.2025.3586148","url":null,"abstract":"Point cloud sampling aims to derive a sparse point cloud from a relatively dense point cloud, which is essential for efficient data transmission and storage. While existing deep sampling methods prioritize preserving the perception of sampled point clouds for downstream networks, few studies have critically examined the rationale behind this goal. Specifically, we observe that sampling can lead to a perceptual degradation phenomenon in many influential downstream networks, impairing their ability to effectively process sampled point clouds. We theoretically reveal the nature of the phenomenon and attempt to construct a novel sampling target by uniting upsampling and perceptual reconstruction. Accordingly, we propose a Maximum A Posteriori (MAP) sampling framework named Sample for Reconstruct (S4R), which impels the sampling stage to infer upsampling-guided perception. In S4R, we design very simple but effective sampling and upsampling networks using residual-based graph convolutions and incorporate a pseudo-residual connection to introduce prior knowledge. This architecture takes advantage of reconstruction properties and allows the sampling network to be trained in an unsupervised manner. Extensive experiments on classical networks demonstrates the excellent performance of S4R compared with the previous sampling schemes and reveals its advantages on different point cloud downstream tasks, i.e., classification, reconstruction and segmentation.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6677-6689"},"PeriodicalIF":9.7,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Synthesizing Multi-Person and Rare Pose Images for Human Pose Estimation 合成多人和稀有姿态图像用于人体姿态估计
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-07-04 DOI: 10.1109/TMM.2025.3586122
Liuqing Zhao;Zichen Tian;Peng Zou;Richang Hong;Qianru Sun
Human pose estimation (HPE) models underperform in recognizing rare poses because they suffer from data imbalance problems (i.e., there are few image samples for rare poses) in their training datasets. From a data perspective, the most intuitive solution is to synthesize data for rare poses. Specifically, the rule-based methods apply manual manipulations (such as Cutout and GridMask) to the existing data, so the limited diversity of the data constrains the model. An alternative method is to learn the underlying data distribution via deep generative models (such as ControlNet and HumanSD) and then sample “new data” from the distribution. This works well for generating frequent poses in common scenes, but suffers when applied to rare poses or complex scenes (such as multiple persons with overlapping limbs). In this paper, we aim to address the above two issues, i.e., rare poses and complex scenes, for person image generation. We propose a two-stage method. In the first stage, we design a controllable pose generator named PoseFactory to synthesize rare poses. This generator is specifically trained on augmented pose data, and each pose is labelled with its level of difficulty and rarity. In the second stage, we introduce a multi-person image generator named MultipGenerator. It is conditioned on multiple human poses and textual descriptions of complex scenes. Both stages are controllable in terms of the diversity of poses and the complexity of scenes. For evaluation, we conduct extensive experiments on three widely used datasets: MS-COCO, HumanArt, and OCHuman. We compare our method against traditional pose data augmentation and person image generation methods, and it demonstrates its superior performance both quantitatively and qualitatively.
人体姿态估计(HPE)模型在识别罕见姿态方面表现不佳,因为它们在训练数据集中存在数据不平衡问题(即罕见姿态的图像样本很少)。从数据的角度来看,最直观的解决方案是对罕见姿势进行数据合成。具体来说,基于规则的方法对现有数据应用手动操作(如Cutout和GridMask),因此数据的有限多样性限制了模型。另一种方法是通过深度生成模型(如ControlNet和HumanSD)学习底层数据分布,然后从分布中采样“新数据”。这对于在常见场景中生成频繁的姿势效果很好,但是当应用于罕见的姿势或复杂的场景时(例如多个肢体重叠的人)就会受到影响。在本文中,我们的目标是解决上述两个问题,即罕见的姿势和复杂的场景,为人物图像的生成。我们提出一个两阶段的方法。在第一阶段,我们设计了一个名为PoseFactory的可控姿态生成器来合成稀有姿态。这个生成器专门训练增强姿势数据,每个姿势都标有其难度和稀有程度。在第二阶段,我们介绍了一个名为MultipGenerator的多人图像生成器。它以多种人体姿势和复杂场景的文本描述为条件。就姿势的多样性和场景的复杂性而言,这两个阶段都是可控的。为了评估,我们在三个广泛使用的数据集上进行了广泛的实验:MS-COCO, HumanArt和ochhuman。将该方法与传统的姿态数据增强和人物图像生成方法进行了比较,结果表明该方法在定性和定量上都具有较好的性能。
{"title":"Synthesizing Multi-Person and Rare Pose Images for Human Pose Estimation","authors":"Liuqing Zhao;Zichen Tian;Peng Zou;Richang Hong;Qianru Sun","doi":"10.1109/TMM.2025.3586122","DOIUrl":"https://doi.org/10.1109/TMM.2025.3586122","url":null,"abstract":"Human pose estimation (HPE) models underperform in recognizing rare poses because they suffer from data imbalance problems (i.e., there are few image samples for rare poses) in their training datasets. From a data perspective, the most intuitive solution is to synthesize data for rare poses. Specifically, the rule-based methods apply manual manipulations (such as Cutout and GridMask) to the existing data, so the limited diversity of the data constrains the model. An alternative method is to learn the underlying data distribution via deep generative models (such as ControlNet and HumanSD) and then sample “new data” from the distribution. This works well for generating frequent poses in common scenes, but suffers when applied to rare poses or complex scenes (such as multiple persons with overlapping limbs). In this paper, we aim to address the above two issues, i.e., rare poses and complex scenes, for person image generation. We propose a two-stage method. In the first stage, we design a controllable pose generator named PoseFactory to synthesize rare poses. This generator is specifically trained on augmented pose data, and each pose is labelled with its level of difficulty and rarity. In the second stage, we introduce a multi-person image generator named MultipGenerator. It is conditioned on multiple human poses and textual descriptions of complex scenes. Both stages are controllable in terms of the diversity of poses and the complexity of scenes. For evaluation, we conduct extensive experiments on three widely used datasets: MS-COCO, HumanArt, and OCHuman. We compare our method against traditional pose data augmentation and person image generation methods, and it demonstrates its superior performance both quantitatively and qualitatively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6568-6580"},"PeriodicalIF":9.7,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hierarchical Multi-Prototype Discrimination: Boosting Support-Query Matching for Few-Shot Segmentation 分层多原型判别:促进支持查询匹配的少镜头分割
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-07-04 DOI: 10.1109/TMM.2025.3586125
Wenbo Xu;Huaxi Huang;Yongshun Gong;Litao Yu;Qiang Wu;Jian Zhang
Few-shot segmentation (FSS) aims at training a model on base classes with sufficient annotations and then tasking the model with predicting a binary mask to identify novel class pixels with limited labeled images. Mainstream FSS methods adopt a support-query matching paradigm that activates target regions of the query image according to their similarity with a single support class prototype. However, this prototype vector is inclined to overfit the support images, leading to potential under-matching in latent query object regions and incorrect mismatches with base class features in the query image. To address these issues, this study reformulates conventional single foreground prototype matching to a multi-prototype matching paradigm. In this paradigm, query features exhibiting high confidence with non-target prototypes will be categorized as background. Specifically, the target query features are drawn closer to the novel class prototype through a Masked Cross-Image Encoding (MCE) module and a Semantic Multi-prototype Matching (SMM) module is incorporated to collaboratively filter unexpected base class regions on multi-scale features. Furthermore, we devise an adaptive class activation map, termed target-aware class activation map (TCAM) to preserve semantically coherent regions that might be inadvertently suppressed under pixel-wise matching guidance. Experimental results on PASCAL-5$^{i}$ and COCO-20$^{i}$ datasets demonstrate the advantage of the proposed novel modules, with the holistic approach outperforming compared state-of-the-art methods.
少镜头分割(Few-shot segmentation, FSS)的目的是在具有足够注释的基类上训练模型,然后将预测二值掩码的任务分配给模型,以在有限的标记图像中识别新的类像素。主流的FSS方法采用支持-查询匹配范式,根据查询图像的目标区域与单个支持类原型的相似性激活查询图像的目标区域。然而,该原型向量倾向于对支持图像进行过拟合,从而导致潜在查询对象区域的潜在不匹配以及与查询图像中的基类特征不匹配。为了解决这些问题,本研究将传统的单一前景原型匹配重新定义为多原型匹配范式。在这个范例中,对非目标原型表现出高置信度的查询特征将被归类为背景。具体而言,通过掩码交叉图像编码(mask Cross-Image Encoding, MCE)模块使目标查询特征更接近新类原型,并结合语义多原型匹配(Semantic Multi-prototype Matching, SMM)模块在多尺度特征上协同过滤意外基类区域。此外,我们设计了一个自适应类激活图,称为目标感知类激活图(TCAM),以保留语义上连贯的区域,这些区域可能在像素匹配指导下无意中被抑制。在PASCAL-5$^{i}$和COCO-20$^{i}$数据集上的实验结果证明了所提出的新模块的优势,整体方法优于比较先进的方法。
{"title":"Hierarchical Multi-Prototype Discrimination: Boosting Support-Query Matching for Few-Shot Segmentation","authors":"Wenbo Xu;Huaxi Huang;Yongshun Gong;Litao Yu;Qiang Wu;Jian Zhang","doi":"10.1109/TMM.2025.3586125","DOIUrl":"https://doi.org/10.1109/TMM.2025.3586125","url":null,"abstract":"Few-shot segmentation (FSS) aims at training a model on base classes with sufficient annotations and then tasking the model with predicting a binary mask to identify novel class pixels with limited labeled images. Mainstream FSS methods adopt a support-query matching paradigm that activates target regions of the query image according to their similarity with a single support class prototype. However, this prototype vector is inclined to overfit the support images, leading to potential under-matching in latent query object regions and incorrect mismatches with base class features in the query image. To address these issues, this study reformulates conventional single foreground prototype matching to a multi-prototype matching paradigm. In this paradigm, query features exhibiting high confidence with non-target prototypes will be categorized as background. Specifically, the target query features are drawn closer to the novel class prototype through a Masked Cross-Image Encoding (MCE) module and a Semantic Multi-prototype Matching (SMM) module is incorporated to collaboratively filter unexpected base class regions on multi-scale features. Furthermore, we devise an adaptive class activation map, termed target-aware class activation map (TCAM) to preserve semantically coherent regions that might be inadvertently suppressed under pixel-wise matching guidance. Experimental results on PASCAL-5<inline-formula><tex-math>$^{i}$</tex-math></inline-formula> and COCO-20<inline-formula><tex-math>$^{i}$</tex-math></inline-formula> datasets demonstrate the advantage of the proposed novel modules, with the holistic approach outperforming compared state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6705-6718"},"PeriodicalIF":9.7,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MossVLN: Memory-Observation Synergistic System for Continuous Vision-Language Navigation 连续视觉语言导航的记忆-观察协同系统
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-07-04 DOI: 10.1109/TMM.2025.3586105
Ting Yu;Yifei Wu;Qiongjie Cui;Qingming Huang;Jun Yu
Navigating in continuous environments with vision-language cues presents critical challenges, particularly in the accuracy of waypoint prediction and the quality of navigation decision-making. Traditional methods, which predominantly rely on spatial data from depth images or straightforward RGB-depth integrations, frequently encounter difficulties in environments where waypoints share similar spatial characteristics, leading to erroneous navigational outcomes. Additionally, the capacity for effective navigation decisions is often hindered by the inadequacies of traditional topological maps and the issue of uneven data sampling. In response, this paper introduces a robust memory-observation synergistic vision-language navigation framework to substantially enhance the navigation capabilities of agents operating in continuous environments. We present an advanced observation-driven waypoint predictor that effectively utilizes spatial data and integrates aligned visual and textual cues to significantly improve the accuracy of waypoint predictions within complex real-world scenarios. Additionally, we develop a strategic memory-observation planning approach that leverages memory panoramic environmental data and detailed current observation information, enabling more informed and precise navigation decisions. Our framework sets new performance benchmarks on the VLN-CE dataset, achieving a 60.25% Success Rate (SR) and a 50.89% Path Length Score (SPL) on the R2R-CE dataset’s unseen validation splits. Furthermore, when adapted to a discrete environment, our model also shows exceptional performance on the R2R dataset, achieving a 74% SR and a 64% SPL on the unseen validation split.
在具有视觉语言线索的连续环境中导航提出了严峻的挑战,特别是在航点预测的准确性和导航决策的质量方面。传统方法主要依赖于深度图像的空间数据或直接的rgb -深度集成,在航路点具有相似空间特征的环境中经常遇到困难,导致错误的导航结果。此外,有效的导航决策能力往往受到传统拓扑地图的不足和数据采样不均匀问题的阻碍。为此,本文引入了一种鲁棒的记忆-观察协同视觉语言导航框架,以大幅提高智能体在连续环境中操作的导航能力。我们提出了一种先进的观测驱动的路点预测器,它有效地利用空间数据,并集成对齐的视觉和文本线索,以显着提高在复杂的现实世界场景中路点预测的准确性。此外,我们还开发了一种战略性内存观察规划方法,利用内存全景环境数据和详细的当前观察信息,实现更明智和精确的导航决策。我们的框架在VLN-CE数据集上设置了新的性能基准,在R2R-CE数据集的未见过的验证分割上实现了60.25%的成功率(SR)和50.89%的路径长度分数(SPL)。此外,当适应于离散环境时,我们的模型在R2R数据集上也显示出出色的性能,在未见过的验证分割上实现了74%的SR和64%的SPL。
{"title":"MossVLN: Memory-Observation Synergistic System for Continuous Vision-Language Navigation","authors":"Ting Yu;Yifei Wu;Qiongjie Cui;Qingming Huang;Jun Yu","doi":"10.1109/TMM.2025.3586105","DOIUrl":"https://doi.org/10.1109/TMM.2025.3586105","url":null,"abstract":"Navigating in continuous environments with vision-language cues presents critical challenges, particularly in the accuracy of waypoint prediction and the quality of navigation decision-making. Traditional methods, which predominantly rely on spatial data from depth images or straightforward RGB-depth integrations, frequently encounter difficulties in environments where waypoints share similar spatial characteristics, leading to erroneous navigational outcomes. Additionally, the capacity for effective navigation decisions is often hindered by the inadequacies of traditional topological maps and the issue of uneven data sampling. In response, this paper introduces a robust memory-observation synergistic vision-language navigation framework to substantially enhance the navigation capabilities of agents operating in continuous environments. We present an advanced observation-driven waypoint predictor that effectively utilizes spatial data and integrates aligned visual and textual cues to significantly improve the accuracy of waypoint predictions within complex real-world scenarios. Additionally, we develop a strategic memory-observation planning approach that leverages memory panoramic environmental data and detailed current observation information, enabling more informed and precise navigation decisions. Our framework sets new performance benchmarks on the VLN-CE dataset, achieving a 60.25% Success Rate (SR) and a 50.89% Path Length Score (SPL) on the R2R-CE dataset’s unseen validation splits. Furthermore, when adapted to a discrete environment, our model also shows exceptional performance on the R2R dataset, achieving a 74% SR and a 64% SPL on the unseen validation split.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6690-6704"},"PeriodicalIF":9.7,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pathology-Preserving Transformer Based on Multicolor Space for Low-Quality Medical Image Enhancement 基于多色空间的低质量医学图像增强病理保持变压器
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-07-04 DOI: 10.1109/TMM.2025.3586133
Qingshan Hou;Yaqi Wang;Peng Cao;Jianguo Ju;Huijuan Tu;Xiaoli Liu;Jinzhu Yang;Huazhu Fu;Yih Chung Tham;Osmar R. Zaiane
Medical images acquired under suboptimal conditions often suffer from quality degradation, such as low-light, blurring, and artifacts. Such degradations obscure the lesions and anatomical structures in medical images, making it difficult to distinguish key pathological regions. This significantly increases the risk of misdiagnosis by automated medical diagnostic systems or clinicians. To address this challenge, we propose a multi-Color space-based quality enhancement network (MSQNet) that effectively eliminates global low-quality factors while preserving pathology-related characteristics for improved clinical observation and analysis. We first revisit the properties of image quality enhancement in different color spaces, where the V-channel in the HSV space can better represent the contrast and brightness enhancement process, whereas the A/B-channel in the LAB space is more focused on the color change of low-quality images. The proposed framework harnesses the unique properties of different color spaces to optimize the image enhancement process. Specifically, we propose a pathology-preserving transformer, designed to selectively aggregate features across different color spaces and enable comprehensive multiscale feature fusion. Leveraging these capabilities, MSQNet effectively enhances low-quality RGB medical images while preserving key pathological features, thereby establishing a new paradigm in medical image enhancement. Extensive experiments on three public medical image datasets demonstrate that MSQNet outperforms traditional enhancement techniques and state-of-the-art methods, in terms of both quantitative metrics and qualitative visual assessment. MSQNet successfully improves image quality while preserving pathological features and anatomical structures, facilitating accurate diagnosis and analysis by medical professionals and automated systems.
在次优条件下获得的医学图像通常会出现质量下降,如弱光、模糊和伪影。这种退化模糊了医学图像中的病变和解剖结构,使区分关键病理区域变得困难。这大大增加了自动医疗诊断系统或临床医生误诊的风险。为了应对这一挑战,我们提出了一个基于多色空间的质量增强网络(MSQNet),该网络有效地消除了全局低质量因素,同时保留了病理相关特征,以改善临床观察和分析。我们首先回顾了不同色彩空间中图像质量增强的特性,其中HSV空间中的v通道可以更好地代表对比度和亮度增强过程,而LAB空间中的A/ b通道更侧重于低质量图像的颜色变化。提出的框架利用不同色彩空间的独特属性来优化图像增强过程。具体来说,我们提出了一种病理保存转换器,旨在选择性地聚合不同颜色空间的特征,并实现全面的多尺度特征融合。利用这些功能,MSQNet在保留关键病理特征的同时有效增强了低质量RGB医学图像,从而建立了医学图像增强的新范式。在三个公共医学图像数据集上进行的大量实验表明,MSQNet在定量指标和定性视觉评估方面都优于传统增强技术和最先进的方法。MSQNet成功地提高了图像质量,同时保留了病理特征和解剖结构,促进了医疗专业人员和自动化系统的准确诊断和分析。
{"title":"Pathology-Preserving Transformer Based on Multicolor Space for Low-Quality Medical Image Enhancement","authors":"Qingshan Hou;Yaqi Wang;Peng Cao;Jianguo Ju;Huijuan Tu;Xiaoli Liu;Jinzhu Yang;Huazhu Fu;Yih Chung Tham;Osmar R. Zaiane","doi":"10.1109/TMM.2025.3586133","DOIUrl":"https://doi.org/10.1109/TMM.2025.3586133","url":null,"abstract":"Medical images acquired under suboptimal conditions often suffer from quality degradation, such as low-light, blurring, and artifacts. Such degradations obscure the lesions and anatomical structures in medical images, making it difficult to distinguish key pathological regions. This significantly increases the risk of misdiagnosis by automated medical diagnostic systems or clinicians. To address this challenge, we propose a multi-Color space-based quality enhancement network (MSQNet) that effectively eliminates global low-quality factors while preserving pathology-related characteristics for improved clinical observation and analysis. We first revisit the properties of image quality enhancement in different color spaces, where the V-channel in the HSV space can better represent the contrast and brightness enhancement process, whereas the A/B-channel in the LAB space is more focused on the color change of low-quality images. The proposed framework harnesses the unique properties of different color spaces to optimize the image enhancement process. Specifically, we propose a pathology-preserving transformer, designed to selectively aggregate features across different color spaces and enable comprehensive multiscale feature fusion. Leveraging these capabilities, MSQNet effectively enhances low-quality RGB medical images while preserving key pathological features, thereby establishing a new paradigm in medical image enhancement. Extensive experiments on three public medical image datasets demonstrate that MSQNet outperforms traditional enhancement techniques and state-of-the-art methods, in terms of both quantitative metrics and qualitative visual assessment. MSQNet successfully improves image quality while preserving pathological features and anatomical structures, facilitating accurate diagnosis and analysis by medical professionals and automated systems.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6661-6676"},"PeriodicalIF":9.7,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Multimedia
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1