首页 > 最新文献

Journal of Visual Communication and Image Representation最新文献

英文 中文
GAN semantics for personalized facial beauty synthesis and enhancement 个性化面部美合成与增强的GAN语义
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-07 DOI: 10.1016/j.jvcir.2025.104640
Irina Lebedeva , Fangli Ying , Yi Guo , Taihao Li
Generative adversarial networks (GANs) whose popularity and scope of applications continue to grow, have already demonstrated impressive results in human face image processing. Face aging, completion, attribute transfer, and synthesis are not the only examples of the successful implementation of GANs. Although, beauty enhancement and face generation with conditioning on attractiveness level are also among the applications of GANs, it has been investigated only from the universal or generic point of view, and there are no studies addressed to the personalized aspect of these issues. In this work, this gap is filled and a generative framework that synthesizes a realistic human face that is based on an individual’s beauty preferences is introduced. To this end, StyleGAN’s properties and the capacities of semantic face manipulation in its latent space are studied and utilized. Beyond the face generation, the proposed framework is able to enhance a beauty level on a real face according to personal beauty preferences. Extensive experiments are conducted on two publicly available facial beauty datasets with different properties in terms of images and raters, SCUT-FBP5500 and multi-ethnic MEBeauty. The quantitative evaluations demonstrate the effectiveness of the proposed framework and its advantages compared to the state-of-the-art, while the qualitative evaluations also reveal and illustrate interesting social and cultural patterns in personal beauty preferences.
生成对抗网络(GANs)的普及和应用范围不断扩大,已经在人脸图像处理方面取得了令人印象深刻的成果。人脸老化、补全、属性转移和合成并不是gan成功实现的唯一例子。尽管基于吸引力水平的美容增强和面部生成也是gan的应用之一,但它仅从通用或通用的角度进行了研究,并且没有针对这些问题的个性化方面的研究。在这项工作中,填补了这一空白,并引入了一个生成框架,根据个人的审美偏好合成现实的人脸。为此,研究并利用了StyleGAN的属性及其潜在空间的语义脸操作能力。除了人脸生成之外,所提出的框架还能够根据个人的审美偏好增强真实人脸的美丽程度。在SCUT-FBP5500和多民族MEBeauty两个公开的面部美容数据集上进行了大量的实验,这些数据集在图像和评分者方面具有不同的属性。定量评估显示了所提出的框架的有效性及其与最新技术相比的优势,而定性评估也揭示并说明了个人审美偏好中有趣的社会和文化模式。
{"title":"GAN semantics for personalized facial beauty synthesis and enhancement","authors":"Irina Lebedeva ,&nbsp;Fangli Ying ,&nbsp;Yi Guo ,&nbsp;Taihao Li","doi":"10.1016/j.jvcir.2025.104640","DOIUrl":"10.1016/j.jvcir.2025.104640","url":null,"abstract":"<div><div>Generative adversarial networks (GANs) whose popularity and scope of applications continue to grow, have already demonstrated impressive results in human face image processing. Face aging, completion, attribute transfer, and synthesis are not the only examples of the successful implementation of GANs. Although, beauty enhancement and face generation with conditioning on attractiveness level are also among the applications of GANs, it has been investigated only from the universal or generic point of view, and there are no studies addressed to the personalized aspect of these issues. In this work, this gap is filled and a generative framework that synthesizes a realistic human face that is based on an individual’s beauty preferences is introduced. To this end, StyleGAN’s properties and the capacities of semantic face manipulation in its latent space are studied and utilized. Beyond the face generation, the proposed framework is able to enhance a beauty level on a real face according to personal beauty preferences. Extensive experiments are conducted on two publicly available facial beauty datasets with different properties in terms of images and raters, SCUT-FBP5500 and multi-ethnic MEBeauty. The quantitative evaluations demonstrate the effectiveness of the proposed framework and its advantages compared to the state-of-the-art, while the qualitative evaluations also reveal and illustrate interesting social and cultural patterns in personal beauty preferences.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"114 ","pages":"Article 104640"},"PeriodicalIF":3.1,"publicationDate":"2025-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145571726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Copy-move forgery detection of social media images using tendency sparsity filtering and variable cluster spectral clustering 基于趋势稀疏滤波和可变聚类光谱聚类的社交媒体图像复制-移动伪造检测
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-07 DOI: 10.1016/j.jvcir.2025.104635
Cong Lin , Hai Yang , Ke Huang , Daqiang Long , Yuke Zhong , Yuqiao Deng , Yamin Wen
Copy-move forgery is a common way of image tampering. In reality, most of the images encountered are compressed by social media. Based on this, a copy-move forgery detection method of social media images based on tendency sparsity (TS) filtering and variable cluster spectral clustering (VCS clustering) is proposed. First, we normalize the image scale to obtain the sufficient number of keypoints. To accelerate the matching speed, the hierarchical matching method is adopted. Next, the TS filtering is applied to remove the preference set (PS) vectors that do not meet the condition. To estimate the good affine transformation, the PS vectors are clustered using the VCS clustering. Finally, the tampering location result is output. Through comparative experiments on several public uncompressed datasets, as well as datasets compressed by social media, it has been proven that the proposed method has good robustness in detecting social media images, outperforming the state-of-the-art methods.
复制-移动伪造是一种常见的图像篡改方式。在现实中,大多数遇到的图像都被社交媒体压缩了。在此基础上,提出了一种基于趋势稀疏度(TS)滤波和变聚类光谱聚类(VCS)聚类的社交媒体图像复制-移动伪造检测方法。首先,对图像尺度进行归一化,得到足够数量的关键点。为了加快匹配速度,采用了分层匹配方法。接下来,应用TS滤波去除不满足条件的偏好集(PS)向量。为了估计好的仿射变换,使用VCS聚类对PS向量进行聚类。最后输出篡改定位结果。通过对几个公开的未压缩数据集和社交媒体压缩数据集的对比实验,证明了该方法在社交媒体图像检测方面具有良好的鲁棒性,优于现有的方法。
{"title":"Copy-move forgery detection of social media images using tendency sparsity filtering and variable cluster spectral clustering","authors":"Cong Lin ,&nbsp;Hai Yang ,&nbsp;Ke Huang ,&nbsp;Daqiang Long ,&nbsp;Yuke Zhong ,&nbsp;Yuqiao Deng ,&nbsp;Yamin Wen","doi":"10.1016/j.jvcir.2025.104635","DOIUrl":"10.1016/j.jvcir.2025.104635","url":null,"abstract":"<div><div>Copy-move forgery is a common way of image tampering. In reality, most of the images encountered are compressed by social media. Based on this, a copy-move forgery detection method of social media images based on tendency sparsity (TS) filtering and variable cluster spectral clustering (VCS clustering) is proposed. First, we normalize the image scale to obtain the sufficient number of keypoints. To accelerate the matching speed, the hierarchical matching method is adopted. Next, the TS filtering is applied to remove the preference set (PS) vectors that do not meet the condition. To estimate the good affine transformation, the PS vectors are clustered using the VCS clustering. Finally, the tampering location result is output. Through comparative experiments on several public uncompressed datasets, as well as datasets compressed by social media, it has been proven that the proposed method has good robustness in detecting social media images, outperforming the state-of-the-art methods.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"114 ","pages":"Article 104635"},"PeriodicalIF":3.1,"publicationDate":"2025-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-layer graph constraint dictionary pair learning for image classification 用于图像分类的多层图约束字典对学习
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-06 DOI: 10.1016/j.jvcir.2025.104638
Yulin Sun , Guangming Shi , Weisheng Dong , Xuemei Xie
Multi-layer dictionary learning (MDL) has demonstrated significantly improved performance for image classification. However, most of the existing MDL methods just overall shared dictionary learning architecture, which weakens the discrimination ability of the dictionaries. For this, we proposed a powerful framework called the Multi-layer Graph Constraint Dictionary Pair Learning (MGDPL). Our MGDPL integrates multi-layer dictionary pair learning, structure graph constraint, and discrimination sparse representations into a unified framework. First, the multi-layer structured dictionary learning mechanism is applied to dictionary pairs to enhance the discrimination performance by rebuilding the reconstruction error of the previous layer via the latter layer. Second, it subjects the structure graph constraint on the sub-sparse representations to ensure the discrimination capability of the near neighbor graph. Third, the multi-layer discriminant graph regularized constraint term can ensure high intra-class tightness and inter-class dispersion of dictionary atoms in reconstruction space. Extensive experiments show that MGDPL can achieve excellent performance over other state-of-the-arts.
多层字典学习(MDL)在图像分类方面的性能得到了显著提高。然而,现有的MDL方法大多只是整体共享字典学习架构,这削弱了字典的识别能力。为此,我们提出了一个强大的框架,称为多层图约束字典对学习(MGDPL)。我们的MGDPL将多层字典对学习、结构图约束和判别稀疏表示集成到一个统一的框架中。首先,将多层结构化字典学习机制应用于字典对,通过后一层重建前一层的重建误差来提高识别性能。其次,对子稀疏表示进行结构图约束,保证近邻图的识别能力;第三,多层判别图正则化约束项可以保证字典原子在重构空间中的高类内紧密性和类间弥散性。大量实验表明,MGDPL的性能优于其他先进技术。
{"title":"Multi-layer graph constraint dictionary pair learning for image classification","authors":"Yulin Sun ,&nbsp;Guangming Shi ,&nbsp;Weisheng Dong ,&nbsp;Xuemei Xie","doi":"10.1016/j.jvcir.2025.104638","DOIUrl":"10.1016/j.jvcir.2025.104638","url":null,"abstract":"<div><div>Multi-layer dictionary learning (MDL) has demonstrated significantly improved performance for image classification. However, most of the existing MDL methods just overall shared dictionary learning architecture, which weakens the discrimination ability of the dictionaries. For this, we proposed a powerful framework called the Multi-layer Graph Constraint Dictionary Pair Learning (MGDPL). Our MGDPL integrates multi-layer dictionary pair learning, structure graph constraint, and discrimination sparse representations into a unified framework. First, the multi-layer structured dictionary learning mechanism is applied to dictionary pairs to enhance the discrimination performance by rebuilding the reconstruction error of the previous layer via the latter layer. Second, it subjects the structure graph constraint on the sub-sparse representations to ensure the discrimination capability of the near neighbor graph. Third, the multi-layer discriminant graph regularized constraint term can ensure high intra-class tightness and inter-class dispersion of dictionary atoms in reconstruction space. Extensive experiments show that MGDPL can achieve excellent performance over other state-of-the-arts.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"114 ","pages":"Article 104638"},"PeriodicalIF":3.1,"publicationDate":"2025-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145468801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring a Non-Parametric Uncertain Adaptive training method for facial expression recognition 探索一种非参数不确定自适应面部表情识别方法
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-06 DOI: 10.1016/j.jvcir.2025.104636
Renhao Sun , Chaoqun Wang , Yujian Wang
In facial expression recognition, the uncertainties impregnated by ambiguous facial expressions and subjectiveness of annotators lead to inter-class similarity and intra-class diversity among annotated samples, which in turn leads to deterioration of recognition results. To mitigate bad performance due to uncertainties, we explore a Non-Parametric Uncertain Adaptive (NoPUA) method during the training process to suppress ambiguous samples for facial expression recognition. Specifically, we first propose a self-paced feature bank module on mini-batches to calculate the top-K similarity rank for each training sample, and then design a sample-to-class weighting score module based on the similarity rank to grade the different categories with respect to the similarity classes of the samples themselves. Finally, we modify the labels of each uncertain sample using the self-adaptive relabeling module for multi-category scoring described above. Our method is non-parametric and easy to implement. Moreover, it is model-agnostic. Extensive experiments on three public benchmarks (RAF-DB, FERPlus, AffectNet) validate the effectiveness of our NoPUA embedded in a variety of algorithms (baseline, SCN, RUL, EAC, DAN, POSTER++) and achieve better performance.
在面部表情识别中,由于面部表情的模糊性和标注者的主观性所带来的不确定性,导致标注样本的类间相似性和类内多样性,从而导致识别结果的恶化。为了减轻由于不确定性导致的性能差,我们在训练过程中探索了一种非参数不确定自适应(NoPUA)方法来抑制模糊样本用于面部表情识别。具体而言,我们首先提出了一个基于mini-batch的自定节奏特征库模块,用于计算每个训练样本的top-K相似度排名,然后设计一个基于相似度排名的样本到类别加权评分模块,对不同类别相对于样本本身的相似度类别进行评分。最后,我们使用上述多类别评分的自适应重标记模块修改每个不确定样本的标签。我们的方法是非参数的,易于实现。此外,它是模型不可知论的。在三个公共基准(RAF-DB, FERPlus, AffectNet)上进行的大量实验验证了我们的NoPUA嵌入各种算法(基线,SCN, RUL, EAC, DAN, POSTER++)的有效性,并取得了更好的性能。
{"title":"Exploring a Non-Parametric Uncertain Adaptive training method for facial expression recognition","authors":"Renhao Sun ,&nbsp;Chaoqun Wang ,&nbsp;Yujian Wang","doi":"10.1016/j.jvcir.2025.104636","DOIUrl":"10.1016/j.jvcir.2025.104636","url":null,"abstract":"<div><div>In facial expression recognition, the uncertainties impregnated by ambiguous facial expressions and subjectiveness of annotators lead to inter-class similarity and intra-class diversity among annotated samples, which in turn leads to deterioration of recognition results. To mitigate bad performance due to uncertainties, we explore a <strong>No</strong>n-<strong>P</strong>arametric <strong>U</strong>ncertain <strong>A</strong>daptive (NoPUA) method during the training process to suppress ambiguous samples for facial expression recognition. Specifically, we first propose a <em>self-paced feature bank module</em> on mini-batches to calculate the top-<em>K</em> similarity rank for each training sample, and then design a <em>sample-to-class weighting score module</em> based on the similarity rank to grade the different categories with respect to the similarity classes of the samples themselves. Finally, we modify the labels of each uncertain sample using the self-adaptive relabeling module for multi-category scoring described above. Our method is non-parametric and easy to implement. Moreover, it is model-agnostic. Extensive experiments on three public benchmarks (RAF-DB, FERPlus, AffectNet) validate the effectiveness of our NoPUA embedded in a variety of algorithms (baseline, SCN, RUL, EAC, DAN, POSTER++) and achieve better performance.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"114 ","pages":"Article 104636"},"PeriodicalIF":3.1,"publicationDate":"2025-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fast adaptive QTMT partitioning for intra 360°video coding based on gradient boosted trees 基于梯度增强树的360°视频编码快速自适应QTMT分割
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-06 DOI: 10.1016/j.jvcir.2025.104629
Jose N. Filipe , Luis M.N. Tavora , Sergio M.M. Faria , Antonio Navarro , Pedro A.A. Assuncao
The rising demand for UHD and 360°content has driven the creation of advanced compression tools with enhanced coding efficiency. Versatile Video Coding (VVC) has recently improved coding efficiency over previous standards, but introduces significantly higher computational complexity. To address this, this paper presents a novel intra-coding method for 360°video in Equirectangular Projection (ERP) format that reduces complexity with minimal impact on coding efficiency. It shows that the North, Equator, and South regions of ERP images exhibit distinct complexity and spatial characteristics. A region-based approach uses multiple Gradient Boosted Trees models for each region to determine if a partition type can be skipped. Additionally, an adaptive decision threshold scheme is introduced to optimise vertical partitioning in polar regions. The paper also presents an optimisation solution for the Complexity/BD-Rate loss trade-off parameters. Experimental results demonstrate a 50% complexity gain with only a 0.37% BD-Rate loss, outperforming current state-of-the-art methods.
对超高清(UHD)和360°内容的需求不断增长,推动了先进压缩工具的开发,提高了编码效率。通用视频编码(VVC)最近比以前的标准提高了编码效率,但引入了显着更高的计算复杂度。为了解决这一问题,本文提出了一种新的360°等矩形投影(ERP)格式视频的内编码方法,该方法在降低编码复杂度的同时对编码效率的影响最小。结果表明,ERP图像的北、赤道和南区域具有明显的复杂性和空间特征。基于区域的方法为每个区域使用多个梯度增强树模型来确定是否可以跳过分区类型。此外,引入自适应决策阈值方案来优化极坐标区域的垂直划分。本文还提出了复杂度/BD-Rate损失权衡参数的优化方案。实验结果表明,该方法的复杂度提高了50%,而BD-Rate的损失仅为0.37%,优于目前最先进的方法。
{"title":"Fast adaptive QTMT partitioning for intra 360°video coding based on gradient boosted trees","authors":"Jose N. Filipe ,&nbsp;Luis M.N. Tavora ,&nbsp;Sergio M.M. Faria ,&nbsp;Antonio Navarro ,&nbsp;Pedro A.A. Assuncao","doi":"10.1016/j.jvcir.2025.104629","DOIUrl":"10.1016/j.jvcir.2025.104629","url":null,"abstract":"<div><div>The rising demand for UHD and 360°content has driven the creation of advanced compression tools with enhanced coding efficiency. Versatile Video Coding (VVC) has recently improved coding efficiency over previous standards, but introduces significantly higher computational complexity. To address this, this paper presents a novel intra-coding method for 360°video in Equirectangular Projection (ERP) format that reduces complexity with minimal impact on coding efficiency. It shows that the North, Equator, and South regions of ERP images exhibit distinct complexity and spatial characteristics. A region-based approach uses multiple Gradient Boosted Trees models for each region to determine if a partition type can be skipped. Additionally, an adaptive decision threshold scheme is introduced to optimise vertical partitioning in polar regions. The paper also presents an optimisation solution for the Complexity/BD-Rate loss trade-off parameters. Experimental results demonstrate a 50% complexity gain with only a 0.37% BD-Rate loss, outperforming current state-of-the-art methods.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"114 ","pages":"Article 104629"},"PeriodicalIF":3.1,"publicationDate":"2025-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Variable-rate learned image compression with integer-arithmetic-only inference 基于纯整数推理的可变速率学习图像压缩
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-04 DOI: 10.1016/j.jvcir.2025.104634
Fan Ye, Li Li, Dong Liu
Learned image compression (LIC) achieves superior rate–distortion performance over traditional codecs but faces deployment challenges due to floating-point inconsistencies and high computational cost. Existing quantized LIC models are typically single-rate and lack support for variable-rate compression, limiting their adaptability. We propose a fully quantized variable-rate LIC framework that enables integer-only inference across all components. Our method introduces bitrate-specific quantization parameters to address rate-dependent activation variations. All computations — including weights, biases, activations, and nonlinearities — are performed using 8-bit integer operations such as multiplications, bit-shifts, and lookup tables. To further enhance hardware efficiency, we adopt per-layer quantization and reduce intermediate precision from 32-bit to 16-bit. Experiments show that our fully 8-bit quantized model reduces bitrate by 19.2% compared to VTM-17.2 intra coding on standard test sets. It also achieves 50.5% and 52.2% speedup in encoding and decoding, respectively, over its floating-point counterpart.
学习图像压缩(LIC)具有比传统编解码器更好的率失真性能,但由于浮点数不一致和较高的计算成本而面临部署挑战。现有的量化LIC模型通常是单速率的,缺乏对可变速率压缩的支持,限制了它们的适应性。我们提出了一个完全量化的可变速率LIC框架,使所有组件之间的纯整数推理成为可能。我们的方法引入了比特率特定的量化参数来处理与速率相关的激活变化。所有计算——包括权重、偏置、激活和非线性——都是使用8位整数运算(如乘法、位移位和查找表)执行的。为了进一步提高硬件效率,我们采用逐层量化,并将中间精度从32位降低到16位。实验表明,与标准测试集上的VTM-17.2编码相比,我们的全8位量化模型的码率降低了19.2%。与浮点型相比,它在编码和解码方面也分别实现了50.5%和52.2%的加速。
{"title":"Variable-rate learned image compression with integer-arithmetic-only inference","authors":"Fan Ye,&nbsp;Li Li,&nbsp;Dong Liu","doi":"10.1016/j.jvcir.2025.104634","DOIUrl":"10.1016/j.jvcir.2025.104634","url":null,"abstract":"<div><div>Learned image compression (LIC) achieves superior rate–distortion performance over traditional codecs but faces deployment challenges due to floating-point inconsistencies and high computational cost. Existing quantized LIC models are typically single-rate and lack support for variable-rate compression, limiting their adaptability. We propose a fully quantized variable-rate LIC framework that enables integer-only inference across all components. Our method introduces bitrate-specific quantization parameters to address rate-dependent activation variations. All computations — including weights, biases, activations, and nonlinearities — are performed using 8-bit integer operations such as multiplications, bit-shifts, and lookup tables. To further enhance hardware efficiency, we adopt per-layer quantization and reduce intermediate precision from 32-bit to 16-bit. Experiments show that our fully 8-bit quantized model reduces bitrate by 19.2% compared to VTM-17.2 intra coding on standard test sets. It also achieves 50.5% and 52.2% speedup in encoding and decoding, respectively, over its floating-point counterpart.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"113 ","pages":"Article 104634"},"PeriodicalIF":3.1,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145466975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SSUFormer: Spatial–spectral UnetFormer for improving hyperspectral image classification SSUFormer:用于改进高光谱图像分类的空间光谱UnetFormer
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-04 DOI: 10.1016/j.jvcir.2025.104633
Thuan Minh Nguyen , Khoi Anh Bui , Myungsik Yoo
For hyperspectral image (HSI) classification, convolutional neural networks with a local kernel neglect the global HSI properties, and transformer networks often predict only the central pixel. This study proposes a spatial–spectral UnetFormer network to extract the full local and global spatial similarities and the long short-range spectral dependencies for HSI classification. This approach fuses a spectral transformer subnetwork and a spatial attention U-net subnetwork to create outputs. In the spectral subnetwork, the transformer is tailored at the embedding and head layers to generate a prediction for all input pixels. In the spatial attention U-net subnetwork, a local–global spatial feature model is introduced based on the U-net structure with a singular value decomposition-aided spatial self-attention module to emphasize useful details, mitigate the impact of noise, and eventually learn the global spatial features. The proposed model obtains competitive results with state-of-the-art methods in HSI classification on various public datasets.
对于高光谱图像(HSI)分类,具有局部核的卷积神经网络忽略了全局高光谱图像的特性,而变压器网络通常只预测中心像素。本研究提出了一种空间-光谱UnetFormer网络,用于提取完整的局部和全局空间相似性以及用于HSI分类的长短距离光谱依赖性。这种方法融合了一个频谱变压器子网和一个空间注意力U-net子网来创建输出。在频谱子网络中,变压器在嵌入层和头部层进行定制,以生成对所有输入像素的预测。在空间注意力U-net子网络中,引入基于U-net结构的局部-全局空间特征模型,利用奇异值分解辅助的空间自注意模块来强调有用的细节,减轻噪声的影响,最终学习全局空间特征。该模型在各种公共数据集的HSI分类中获得了与最先进的方法相竞争的结果。
{"title":"SSUFormer: Spatial–spectral UnetFormer for improving hyperspectral image classification","authors":"Thuan Minh Nguyen ,&nbsp;Khoi Anh Bui ,&nbsp;Myungsik Yoo","doi":"10.1016/j.jvcir.2025.104633","DOIUrl":"10.1016/j.jvcir.2025.104633","url":null,"abstract":"<div><div>For hyperspectral image (HSI) classification, convolutional neural networks with a local kernel neglect the global HSI properties, and transformer networks often predict only the central pixel. This study proposes a spatial–spectral UnetFormer network to extract the full local and global spatial similarities and the long short-range spectral dependencies for HSI classification. This approach fuses a spectral transformer subnetwork and a spatial attention U-net subnetwork to create outputs. In the spectral subnetwork, the transformer is tailored at the embedding and head layers to generate a prediction for all input pixels. In the spatial attention U-net subnetwork, a local–global spatial feature model is introduced based on the U-net structure with a singular value decomposition-aided spatial self-attention module to emphasize useful details, mitigate the impact of noise, and eventually learn the global spatial features. The proposed model obtains competitive results with state-of-the-art methods in HSI classification on various public datasets.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"114 ","pages":"Article 104633"},"PeriodicalIF":3.1,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145468765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Retrieval augmented generation for smart calorie estimation in complex food scenarios 复杂食物场景中智能卡路里估算的检索增强生成
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-03 DOI: 10.1016/j.jvcir.2025.104632
Mayank Sah, Saurya Suman, Jimson Mathew
Accurate food recognition and calorie estimation are critical for managing diet-related health issues such as obesity and diabetes. Traditional food logging methods rely on manual input, leading to inaccurate nutritional records. Although recent advances in computer vision and deep learning offer automated solutions, existing models struggle with generalizability due to homogeneous datasets and limited representation of complex cuisines like Indian food. This paper introduces a dataset containing over 15,000 images of 56 popular Indian food items. Curated from diverse sources, including social media and real-world photography, the dataset aims to capture the complexity of Indian meals, where multiple food items often appear together in a single image. This ensures greater lighting, presentation, and image quality variability compared to existing data sets. We evaluated the data set with various YOLO-based models, including YOLOv5 through YOLOv12, and enhanced the backbone with omniscale feature learning from OSNet, improving detection accuracy. In addition, we integrate a Retrieval-Augmented-Generation (RAG) module with YOLO, which refines food identification by associating fine-grained food categories with nutritional information, ingredients, and recipes. Our approach demonstrates improved performance in recognizing complex meals. It addresses key challenges in food recognition, offering a scalable solution for accurate calorie estimation, especially for culturally diverse cuisines like Indian food.
准确的食物识别和卡路里估算对于管理与饮食相关的健康问题(如肥胖和糖尿病)至关重要。传统的食物记录方法依赖于人工输入,导致营养记录不准确。尽管计算机视觉和深度学习的最新进展提供了自动化解决方案,但由于数据集同质,并且对印度菜等复杂美食的代表性有限,现有模型难以泛化。本文介绍了一个数据集,其中包含56种受欢迎的印度食品的15,000多张图像。该数据集来自各种来源,包括社交媒体和现实世界的照片,旨在捕捉印度食物的复杂性,其中多种食物经常一起出现在一张图片中。与现有数据集相比,这确保了更大的照明、呈现和图像质量可变性。我们使用多种基于YOLOv5到YOLOv12的模型对数据集进行评估,并通过从OSNet学习全尺度特征来增强主干,提高检测精度。此外,我们还将检索增强生成(RAG)模块与YOLO集成在一起,该模块通过将细粒度食品类别与营养信息、成分和食谱相关联来改进食品识别。我们的方法在识别复杂膳食方面表现出更高的性能。它解决了食物识别中的关键挑战,为准确估计卡路里提供了可扩展的解决方案,特别是对于印度菜等文化多样化的美食。
{"title":"Retrieval augmented generation for smart calorie estimation in complex food scenarios","authors":"Mayank Sah,&nbsp;Saurya Suman,&nbsp;Jimson Mathew","doi":"10.1016/j.jvcir.2025.104632","DOIUrl":"10.1016/j.jvcir.2025.104632","url":null,"abstract":"<div><div>Accurate food recognition and calorie estimation are critical for managing diet-related health issues such as obesity and diabetes. Traditional food logging methods rely on manual input, leading to inaccurate nutritional records. Although recent advances in computer vision and deep learning offer automated solutions, existing models struggle with generalizability due to homogeneous datasets and limited representation of complex cuisines like Indian food. This paper introduces a dataset containing over 15,000 images of 56 popular Indian food items. Curated from diverse sources, including social media and real-world photography, the dataset aims to capture the complexity of Indian meals, where multiple food items often appear together in a single image. This ensures greater lighting, presentation, and image quality variability compared to existing data sets. We evaluated the data set with various YOLO-based models, including YOLOv5 through YOLOv12, and enhanced the backbone with omniscale feature learning from OSNet, improving detection accuracy. In addition, we integrate a Retrieval-Augmented-Generation (RAG) module with YOLO, which refines food identification by associating fine-grained food categories with nutritional information, ingredients, and recipes. Our approach demonstrates improved performance in recognizing complex meals. It addresses key challenges in food recognition, offering a scalable solution for accurate calorie estimation, especially for culturally diverse cuisines like Indian food.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"113 ","pages":"Article 104632"},"PeriodicalIF":3.1,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145466971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CrossGlue: Cross-Modal Image matching via potential message investigation and visual-gradient message integration 交叉胶:跨模态图像匹配通过潜在的信息调查和视觉梯度信息集成
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-03 DOI: 10.1016/j.jvcir.2025.104620
Chaobo Yu , Zhonghui Pei , Xiaoran Wang , Huabing Zhou
Compared with single-modal image matching, cross-modal image matching can provide more comprehensive and detailed information, which is essential for a series of visual-related tasks. However, the matching process is difficult due to differences in imaging principles, proportions, relative translation and rotation between visible and infrared images. Besides, other detection-based single-modal matching methods have low accuracy, while detection-free methods are time-consuming and fail to handle real-world scenarios. Therefore, this paper proposes CrossGlue, a light cross-modal images matching framework. The framework introduces a cross-modal message transfer (CMT) module to integrate more potential information for each keypoint through one-to-one image transfer, and a visual-gradient graph neural network (VG-GNN) to enhance visible–infrared matching in degraded scenarios. Experimental results on public datasets show that CrossGlue has excellent performance among detection-based methods and outperforms strong baseline methods in tasks such as homography estimation and relative pose estimation.
与单模态图像匹配相比,跨模态图像匹配可以提供更全面和详细的信息,这对于一系列与视觉相关的任务至关重要。然而,由于可见光和红外图像在成像原理、比例、相对平移和旋转等方面的差异,使得匹配过程变得困难。此外,其他基于检测的单模态匹配方法精度较低,而无检测的方法耗时且无法处理真实场景。为此,本文提出了一种轻型跨模态图像匹配框架CrossGlue。该框架引入了一个跨模态信息传输(CMT)模块,通过一对一的图像传输来整合每个关键点的更多潜在信息,并引入了一个视觉梯度图神经网络(svg - gnn)来增强退化场景下的可见-红外匹配。在公共数据集上的实验结果表明,CrossGlue在基于检测的方法中具有优异的性能,在单应性估计和相对姿态估计等任务中优于强基线方法。
{"title":"CrossGlue: Cross-Modal Image matching via potential message investigation and visual-gradient message integration","authors":"Chaobo Yu ,&nbsp;Zhonghui Pei ,&nbsp;Xiaoran Wang ,&nbsp;Huabing Zhou","doi":"10.1016/j.jvcir.2025.104620","DOIUrl":"10.1016/j.jvcir.2025.104620","url":null,"abstract":"<div><div>Compared with single-modal image matching, cross-modal image matching can provide more comprehensive and detailed information, which is essential for a series of visual-related tasks. However, the matching process is difficult due to differences in imaging principles, proportions, relative translation and rotation between visible and infrared images. Besides, other detection-based single-modal matching methods have low accuracy, while detection-free methods are time-consuming and fail to handle real-world scenarios. Therefore, this paper proposes CrossGlue, a light cross-modal images matching framework. The framework introduces a cross-modal message transfer (CMT) module to integrate more potential information for each keypoint through one-to-one image transfer, and a visual-gradient graph neural network (VG-GNN) to enhance visible–infrared matching in degraded scenarios. Experimental results on public datasets show that CrossGlue has excellent performance among detection-based methods and outperforms strong baseline methods in tasks such as homography estimation and relative pose estimation.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"114 ","pages":"Article 104620"},"PeriodicalIF":3.1,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145468802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multiple cross-modal complementation network for lightweight RGB-D salient object detection 轻量级RGB-D显著目标检测的多跨模态互补网络
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-01 DOI: 10.1016/j.jvcir.2025.104622
Changhe Zhang, Fen Chen, Lian Huang, Zongju Peng, Xin Hu
The large model sizes and high computational costs of traditional convolutional neural networks hinder the deployment of RGB-D salient object detection (SOD) models on mobile devices. To effectively balance and improve the efficiency and accuracy of RGB-D SOD, we propose a multiple cross-modal complementation network (MCCNet) which fully utilizes complementary information in multiple dimensions. First, according to the information complementarity between depth features and RGB features , we propose a multiple cross-modal complementation (MCC) module to strengthen the feature representation and fusion ability of lightweight networks. Secondly, based on the MCC module, we propose a global and local features cooperative depth enhancement module to improve the quality of depth maps. Finally, we propose an RGB-assisted extraction and fusion backbone. RGB features are fed into this backbone and assist in the extraction of depth features, so as to be efficiently fused with extracted depth features. The experimental results on five challenging datasets show that the MCCNet achieves 1955fps on a single RTX 4090 GPU with few parameters (5.5M), and performs favorably against 12 state-of-the-art RGB-D SOD methods in term of accuracy.
传统卷积神经网络模型尺寸大、计算成本高,阻碍了RGB-D显著目标检测(SOD)模型在移动设备上的应用。为了有效平衡和提高RGB-D SOD的效率和准确性,我们提出了一个多维度充分利用互补信息的多跨模态互补网络(mcnet)。首先,根据深度特征与RGB特征之间的信息互补性,提出了多模态互补(multi - cross-modal complementary, MCC)模块,增强了轻量化网络的特征表示和融合能力;其次,在MCC模块的基础上,提出了全局与局部特征协同深度增强模块,提高深度图质量;最后,我们提出了一个rgb辅助提取和融合骨干。将RGB特征输入到该主干中,辅助深度特征的提取,从而与提取的深度特征有效融合。在5个具有挑战性的数据集上的实验结果表明,MCCNet在单个RTX 4090 GPU上以较少的参数(5.5M)达到了1955fps,并且在精度方面优于12种最先进的RGB-D SOD方法。
{"title":"Multiple cross-modal complementation network for lightweight RGB-D salient object detection","authors":"Changhe Zhang,&nbsp;Fen Chen,&nbsp;Lian Huang,&nbsp;Zongju Peng,&nbsp;Xin Hu","doi":"10.1016/j.jvcir.2025.104622","DOIUrl":"10.1016/j.jvcir.2025.104622","url":null,"abstract":"<div><div>The large model sizes and high computational costs of traditional convolutional neural networks hinder the deployment of RGB-D salient object detection (SOD) models on mobile devices. To effectively balance and improve the efficiency and accuracy of RGB-D SOD, we propose a multiple cross-modal complementation network (MCCNet) which fully utilizes complementary information in multiple dimensions. First, according to the information complementarity between depth features and RGB features , we propose a multiple cross-modal complementation (MCC) module to strengthen the feature representation and fusion ability of lightweight networks. Secondly, based on the MCC module, we propose a global and local features cooperative depth enhancement module to improve the quality of depth maps. Finally, we propose an RGB-assisted extraction and fusion backbone. RGB features are fed into this backbone and assist in the extraction of depth features, so as to be efficiently fused with extracted depth features. The experimental results on five challenging datasets show that the MCCNet achieves 1955fps on a single RTX 4090 GPU with few parameters (5.5M), and performs favorably against 12 state-of-the-art RGB-D SOD methods in term of accuracy.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"113 ","pages":"Article 104622"},"PeriodicalIF":3.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145466972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Visual Communication and Image Representation
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1