Pub Date : 2025-11-09DOI: 10.1016/j.jvcir.2025.104637
M. Anand, S. Babu
The stacked Gaussian blur edge detection filter (S-Gbed) is used for filtering. Dynamic Histogram equalization (DHE) is used to improve the contrast of an image. Then, a triple attention-assisted FER-vision transformer (T-FERViT) is used for feature extraction, and optimal features are selected using the Honey Badger chaotic optimization (HbcOa) algorithm. Finally, facial expressions are classified based on emotion using African vulture assisted Depth convolutional stacked Long Short-Term Memory (LSTM) Frame Attention network (ADcFNet). The African Vulture Optimization Algorithm is used to optimize the network’s loss function. Karolinska Directed Emotional Face dataset (KDEF), Face Expression Recognition-2013 (FER-2013) dataset, and facial emotion dataset are used to evaluate the ADcFNet model. Then, the overall performance of the proposed model is compared with other existing models to describe its superiority. The ADcFNet model attained 99.17%, 91.6%, and 95.9% accuracy in terms of KDEF, FER-2013, and facial emotion datasets, respectively.
{"title":"ADcFNet-deep learning based facial expression identification using FER vision transformer","authors":"M. Anand, S. Babu","doi":"10.1016/j.jvcir.2025.104637","DOIUrl":"10.1016/j.jvcir.2025.104637","url":null,"abstract":"<div><div>The stacked Gaussian blur edge detection filter (S-Gbed) is used for filtering. Dynamic Histogram equalization (DHE) is used to improve the contrast of an image. Then, a triple attention-assisted FER-vision transformer (T-FERViT) is used for feature extraction, and optimal features are selected using the Honey Badger chaotic optimization (HbcOa) algorithm. Finally, facial expressions are classified based on emotion using African vulture assisted Depth convolutional stacked Long Short-Term Memory (LSTM) Frame Attention network (ADcFNet). The African Vulture Optimization Algorithm is used to optimize the network’s loss function. Karolinska Directed Emotional Face dataset (KDEF), Face Expression Recognition-2013 (FER-2013) dataset, and facial emotion dataset are used to evaluate the ADcFNet model. Then, the overall performance of the proposed model is compared with other existing models to describe its superiority. The ADcFNet model attained 99.17%, 91.6%, and 95.9% accuracy in terms of KDEF, FER-2013, and facial emotion datasets, respectively.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"114 ","pages":"Article 104637"},"PeriodicalIF":3.1,"publicationDate":"2025-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145571648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-07DOI: 10.1016/j.jvcir.2025.104631
Sanxin Jiang, Hongliang Zhang, Changde Ding
To address the current issues of inaccurate object localization and insufficient edge information extraction in Camouflaged Object Detection (COD), inspired by how humans detect camouflaged objects—first identifying their general outline and then focusing on finer details—we propose a novel two-stage network, DONet, for COD. In the first stage, the network leverages an Edge Exploration Module (EEM) to locate object boundaries, refining this boundary information through Retrieve Attention. Subsequently, the Object Position Recognition Module (OPRM) detects the horizontal and vertical locations of camouflaged objects by integrating boundary information with high-level features. This information is further enhanced by combining multi-dilation channels and neighboring features. In the second stage, a Context Aggregation Module (CAM) is used to aggregate contextual information, improving detection accuracy. Extensive experiments demonstrate that DONet surpasses 16 state-of-the-art methods across three challenging datasets, highlighting its effectiveness and superior performance. In addition, DONet also has outstanding detection performance in the field of medical polyp segmentation.
{"title":"Dual-optimized two-stage Camouflaged Object Detection","authors":"Sanxin Jiang, Hongliang Zhang, Changde Ding","doi":"10.1016/j.jvcir.2025.104631","DOIUrl":"10.1016/j.jvcir.2025.104631","url":null,"abstract":"<div><div>To address the current issues of inaccurate object localization and insufficient edge information extraction in Camouflaged Object Detection (COD), inspired by how humans detect camouflaged objects—first identifying their general outline and then focusing on finer details—we propose a novel two-stage network, DONet, for COD. In the first stage, the network leverages an Edge Exploration Module (EEM) to locate object boundaries, refining this boundary information through Retrieve Attention. Subsequently, the Object Position Recognition Module (OPRM) detects the horizontal and vertical locations of camouflaged objects by integrating boundary information with high-level features. This information is further enhanced by combining multi-dilation channels and neighboring features. In the second stage, a Context Aggregation Module (CAM) is used to aggregate contextual information, improving detection accuracy. Extensive experiments demonstrate that DONet surpasses 16 state-of-the-art methods across three challenging datasets, highlighting its effectiveness and superior performance. In addition, DONet also has outstanding detection performance in the field of medical polyp segmentation.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"114 ","pages":"Article 104631"},"PeriodicalIF":3.1,"publicationDate":"2025-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-07DOI: 10.1016/j.jvcir.2025.104640
Irina Lebedeva , Fangli Ying , Yi Guo , Taihao Li
Generative adversarial networks (GANs) whose popularity and scope of applications continue to grow, have already demonstrated impressive results in human face image processing. Face aging, completion, attribute transfer, and synthesis are not the only examples of the successful implementation of GANs. Although, beauty enhancement and face generation with conditioning on attractiveness level are also among the applications of GANs, it has been investigated only from the universal or generic point of view, and there are no studies addressed to the personalized aspect of these issues. In this work, this gap is filled and a generative framework that synthesizes a realistic human face that is based on an individual’s beauty preferences is introduced. To this end, StyleGAN’s properties and the capacities of semantic face manipulation in its latent space are studied and utilized. Beyond the face generation, the proposed framework is able to enhance a beauty level on a real face according to personal beauty preferences. Extensive experiments are conducted on two publicly available facial beauty datasets with different properties in terms of images and raters, SCUT-FBP5500 and multi-ethnic MEBeauty. The quantitative evaluations demonstrate the effectiveness of the proposed framework and its advantages compared to the state-of-the-art, while the qualitative evaluations also reveal and illustrate interesting social and cultural patterns in personal beauty preferences.
{"title":"GAN semantics for personalized facial beauty synthesis and enhancement","authors":"Irina Lebedeva , Fangli Ying , Yi Guo , Taihao Li","doi":"10.1016/j.jvcir.2025.104640","DOIUrl":"10.1016/j.jvcir.2025.104640","url":null,"abstract":"<div><div>Generative adversarial networks (GANs) whose popularity and scope of applications continue to grow, have already demonstrated impressive results in human face image processing. Face aging, completion, attribute transfer, and synthesis are not the only examples of the successful implementation of GANs. Although, beauty enhancement and face generation with conditioning on attractiveness level are also among the applications of GANs, it has been investigated only from the universal or generic point of view, and there are no studies addressed to the personalized aspect of these issues. In this work, this gap is filled and a generative framework that synthesizes a realistic human face that is based on an individual’s beauty preferences is introduced. To this end, StyleGAN’s properties and the capacities of semantic face manipulation in its latent space are studied and utilized. Beyond the face generation, the proposed framework is able to enhance a beauty level on a real face according to personal beauty preferences. Extensive experiments are conducted on two publicly available facial beauty datasets with different properties in terms of images and raters, SCUT-FBP5500 and multi-ethnic MEBeauty. The quantitative evaluations demonstrate the effectiveness of the proposed framework and its advantages compared to the state-of-the-art, while the qualitative evaluations also reveal and illustrate interesting social and cultural patterns in personal beauty preferences.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"114 ","pages":"Article 104640"},"PeriodicalIF":3.1,"publicationDate":"2025-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145571726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-07DOI: 10.1016/j.jvcir.2025.104635
Cong Lin , Hai Yang , Ke Huang , Daqiang Long , Yuke Zhong , Yuqiao Deng , Yamin Wen
Copy-move forgery is a common way of image tampering. In reality, most of the images encountered are compressed by social media. Based on this, a copy-move forgery detection method of social media images based on tendency sparsity (TS) filtering and variable cluster spectral clustering (VCS clustering) is proposed. First, we normalize the image scale to obtain the sufficient number of keypoints. To accelerate the matching speed, the hierarchical matching method is adopted. Next, the TS filtering is applied to remove the preference set (PS) vectors that do not meet the condition. To estimate the good affine transformation, the PS vectors are clustered using the VCS clustering. Finally, the tampering location result is output. Through comparative experiments on several public uncompressed datasets, as well as datasets compressed by social media, it has been proven that the proposed method has good robustness in detecting social media images, outperforming the state-of-the-art methods.
{"title":"Copy-move forgery detection of social media images using tendency sparsity filtering and variable cluster spectral clustering","authors":"Cong Lin , Hai Yang , Ke Huang , Daqiang Long , Yuke Zhong , Yuqiao Deng , Yamin Wen","doi":"10.1016/j.jvcir.2025.104635","DOIUrl":"10.1016/j.jvcir.2025.104635","url":null,"abstract":"<div><div>Copy-move forgery is a common way of image tampering. In reality, most of the images encountered are compressed by social media. Based on this, a copy-move forgery detection method of social media images based on tendency sparsity (TS) filtering and variable cluster spectral clustering (VCS clustering) is proposed. First, we normalize the image scale to obtain the sufficient number of keypoints. To accelerate the matching speed, the hierarchical matching method is adopted. Next, the TS filtering is applied to remove the preference set (PS) vectors that do not meet the condition. To estimate the good affine transformation, the PS vectors are clustered using the VCS clustering. Finally, the tampering location result is output. Through comparative experiments on several public uncompressed datasets, as well as datasets compressed by social media, it has been proven that the proposed method has good robustness in detecting social media images, outperforming the state-of-the-art methods.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"114 ","pages":"Article 104635"},"PeriodicalIF":3.1,"publicationDate":"2025-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-06DOI: 10.1016/j.jvcir.2025.104638
Yulin Sun , Guangming Shi , Weisheng Dong , Xuemei Xie
Multi-layer dictionary learning (MDL) has demonstrated significantly improved performance for image classification. However, most of the existing MDL methods just overall shared dictionary learning architecture, which weakens the discrimination ability of the dictionaries. For this, we proposed a powerful framework called the Multi-layer Graph Constraint Dictionary Pair Learning (MGDPL). Our MGDPL integrates multi-layer dictionary pair learning, structure graph constraint, and discrimination sparse representations into a unified framework. First, the multi-layer structured dictionary learning mechanism is applied to dictionary pairs to enhance the discrimination performance by rebuilding the reconstruction error of the previous layer via the latter layer. Second, it subjects the structure graph constraint on the sub-sparse representations to ensure the discrimination capability of the near neighbor graph. Third, the multi-layer discriminant graph regularized constraint term can ensure high intra-class tightness and inter-class dispersion of dictionary atoms in reconstruction space. Extensive experiments show that MGDPL can achieve excellent performance over other state-of-the-arts.
{"title":"Multi-layer graph constraint dictionary pair learning for image classification","authors":"Yulin Sun , Guangming Shi , Weisheng Dong , Xuemei Xie","doi":"10.1016/j.jvcir.2025.104638","DOIUrl":"10.1016/j.jvcir.2025.104638","url":null,"abstract":"<div><div>Multi-layer dictionary learning (MDL) has demonstrated significantly improved performance for image classification. However, most of the existing MDL methods just overall shared dictionary learning architecture, which weakens the discrimination ability of the dictionaries. For this, we proposed a powerful framework called the Multi-layer Graph Constraint Dictionary Pair Learning (MGDPL). Our MGDPL integrates multi-layer dictionary pair learning, structure graph constraint, and discrimination sparse representations into a unified framework. First, the multi-layer structured dictionary learning mechanism is applied to dictionary pairs to enhance the discrimination performance by rebuilding the reconstruction error of the previous layer via the latter layer. Second, it subjects the structure graph constraint on the sub-sparse representations to ensure the discrimination capability of the near neighbor graph. Third, the multi-layer discriminant graph regularized constraint term can ensure high intra-class tightness and inter-class dispersion of dictionary atoms in reconstruction space. Extensive experiments show that MGDPL can achieve excellent performance over other state-of-the-arts.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"114 ","pages":"Article 104638"},"PeriodicalIF":3.1,"publicationDate":"2025-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145468801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-06DOI: 10.1016/j.jvcir.2025.104636
Renhao Sun , Chaoqun Wang , Yujian Wang
In facial expression recognition, the uncertainties impregnated by ambiguous facial expressions and subjectiveness of annotators lead to inter-class similarity and intra-class diversity among annotated samples, which in turn leads to deterioration of recognition results. To mitigate bad performance due to uncertainties, we explore a Non-Parametric Uncertain Adaptive (NoPUA) method during the training process to suppress ambiguous samples for facial expression recognition. Specifically, we first propose a self-paced feature bank module on mini-batches to calculate the top-K similarity rank for each training sample, and then design a sample-to-class weighting score module based on the similarity rank to grade the different categories with respect to the similarity classes of the samples themselves. Finally, we modify the labels of each uncertain sample using the self-adaptive relabeling module for multi-category scoring described above. Our method is non-parametric and easy to implement. Moreover, it is model-agnostic. Extensive experiments on three public benchmarks (RAF-DB, FERPlus, AffectNet) validate the effectiveness of our NoPUA embedded in a variety of algorithms (baseline, SCN, RUL, EAC, DAN, POSTER++) and achieve better performance.
在面部表情识别中,由于面部表情的模糊性和标注者的主观性所带来的不确定性,导致标注样本的类间相似性和类内多样性,从而导致识别结果的恶化。为了减轻由于不确定性导致的性能差,我们在训练过程中探索了一种非参数不确定自适应(NoPUA)方法来抑制模糊样本用于面部表情识别。具体而言,我们首先提出了一个基于mini-batch的自定节奏特征库模块,用于计算每个训练样本的top-K相似度排名,然后设计一个基于相似度排名的样本到类别加权评分模块,对不同类别相对于样本本身的相似度类别进行评分。最后,我们使用上述多类别评分的自适应重标记模块修改每个不确定样本的标签。我们的方法是非参数的,易于实现。此外,它是模型不可知论的。在三个公共基准(RAF-DB, FERPlus, AffectNet)上进行的大量实验验证了我们的NoPUA嵌入各种算法(基线,SCN, RUL, EAC, DAN, POSTER++)的有效性,并取得了更好的性能。
{"title":"Exploring a Non-Parametric Uncertain Adaptive training method for facial expression recognition","authors":"Renhao Sun , Chaoqun Wang , Yujian Wang","doi":"10.1016/j.jvcir.2025.104636","DOIUrl":"10.1016/j.jvcir.2025.104636","url":null,"abstract":"<div><div>In facial expression recognition, the uncertainties impregnated by ambiguous facial expressions and subjectiveness of annotators lead to inter-class similarity and intra-class diversity among annotated samples, which in turn leads to deterioration of recognition results. To mitigate bad performance due to uncertainties, we explore a <strong>No</strong>n-<strong>P</strong>arametric <strong>U</strong>ncertain <strong>A</strong>daptive (NoPUA) method during the training process to suppress ambiguous samples for facial expression recognition. Specifically, we first propose a <em>self-paced feature bank module</em> on mini-batches to calculate the top-<em>K</em> similarity rank for each training sample, and then design a <em>sample-to-class weighting score module</em> based on the similarity rank to grade the different categories with respect to the similarity classes of the samples themselves. Finally, we modify the labels of each uncertain sample using the self-adaptive relabeling module for multi-category scoring described above. Our method is non-parametric and easy to implement. Moreover, it is model-agnostic. Extensive experiments on three public benchmarks (RAF-DB, FERPlus, AffectNet) validate the effectiveness of our NoPUA embedded in a variety of algorithms (baseline, SCN, RUL, EAC, DAN, POSTER++) and achieve better performance.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"114 ","pages":"Article 104636"},"PeriodicalIF":3.1,"publicationDate":"2025-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-06DOI: 10.1016/j.jvcir.2025.104629
Jose N. Filipe , Luis M.N. Tavora , Sergio M.M. Faria , Antonio Navarro , Pedro A.A. Assuncao
The rising demand for UHD and 360°content has driven the creation of advanced compression tools with enhanced coding efficiency. Versatile Video Coding (VVC) has recently improved coding efficiency over previous standards, but introduces significantly higher computational complexity. To address this, this paper presents a novel intra-coding method for 360°video in Equirectangular Projection (ERP) format that reduces complexity with minimal impact on coding efficiency. It shows that the North, Equator, and South regions of ERP images exhibit distinct complexity and spatial characteristics. A region-based approach uses multiple Gradient Boosted Trees models for each region to determine if a partition type can be skipped. Additionally, an adaptive decision threshold scheme is introduced to optimise vertical partitioning in polar regions. The paper also presents an optimisation solution for the Complexity/BD-Rate loss trade-off parameters. Experimental results demonstrate a 50% complexity gain with only a 0.37% BD-Rate loss, outperforming current state-of-the-art methods.
{"title":"Fast adaptive QTMT partitioning for intra 360°video coding based on gradient boosted trees","authors":"Jose N. Filipe , Luis M.N. Tavora , Sergio M.M. Faria , Antonio Navarro , Pedro A.A. Assuncao","doi":"10.1016/j.jvcir.2025.104629","DOIUrl":"10.1016/j.jvcir.2025.104629","url":null,"abstract":"<div><div>The rising demand for UHD and 360°content has driven the creation of advanced compression tools with enhanced coding efficiency. Versatile Video Coding (VVC) has recently improved coding efficiency over previous standards, but introduces significantly higher computational complexity. To address this, this paper presents a novel intra-coding method for 360°video in Equirectangular Projection (ERP) format that reduces complexity with minimal impact on coding efficiency. It shows that the North, Equator, and South regions of ERP images exhibit distinct complexity and spatial characteristics. A region-based approach uses multiple Gradient Boosted Trees models for each region to determine if a partition type can be skipped. Additionally, an adaptive decision threshold scheme is introduced to optimise vertical partitioning in polar regions. The paper also presents an optimisation solution for the Complexity/BD-Rate loss trade-off parameters. Experimental results demonstrate a 50% complexity gain with only a 0.37% BD-Rate loss, outperforming current state-of-the-art methods.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"114 ","pages":"Article 104629"},"PeriodicalIF":3.1,"publicationDate":"2025-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-04DOI: 10.1016/j.jvcir.2025.104634
Fan Ye, Li Li, Dong Liu
Learned image compression (LIC) achieves superior rate–distortion performance over traditional codecs but faces deployment challenges due to floating-point inconsistencies and high computational cost. Existing quantized LIC models are typically single-rate and lack support for variable-rate compression, limiting their adaptability. We propose a fully quantized variable-rate LIC framework that enables integer-only inference across all components. Our method introduces bitrate-specific quantization parameters to address rate-dependent activation variations. All computations — including weights, biases, activations, and nonlinearities — are performed using 8-bit integer operations such as multiplications, bit-shifts, and lookup tables. To further enhance hardware efficiency, we adopt per-layer quantization and reduce intermediate precision from 32-bit to 16-bit. Experiments show that our fully 8-bit quantized model reduces bitrate by 19.2% compared to VTM-17.2 intra coding on standard test sets. It also achieves 50.5% and 52.2% speedup in encoding and decoding, respectively, over its floating-point counterpart.
{"title":"Variable-rate learned image compression with integer-arithmetic-only inference","authors":"Fan Ye, Li Li, Dong Liu","doi":"10.1016/j.jvcir.2025.104634","DOIUrl":"10.1016/j.jvcir.2025.104634","url":null,"abstract":"<div><div>Learned image compression (LIC) achieves superior rate–distortion performance over traditional codecs but faces deployment challenges due to floating-point inconsistencies and high computational cost. Existing quantized LIC models are typically single-rate and lack support for variable-rate compression, limiting their adaptability. We propose a fully quantized variable-rate LIC framework that enables integer-only inference across all components. Our method introduces bitrate-specific quantization parameters to address rate-dependent activation variations. All computations — including weights, biases, activations, and nonlinearities — are performed using 8-bit integer operations such as multiplications, bit-shifts, and lookup tables. To further enhance hardware efficiency, we adopt per-layer quantization and reduce intermediate precision from 32-bit to 16-bit. Experiments show that our fully 8-bit quantized model reduces bitrate by 19.2% compared to VTM-17.2 intra coding on standard test sets. It also achieves 50.5% and 52.2% speedup in encoding and decoding, respectively, over its floating-point counterpart.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"113 ","pages":"Article 104634"},"PeriodicalIF":3.1,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145466975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-04DOI: 10.1016/j.jvcir.2025.104633
Thuan Minh Nguyen , Khoi Anh Bui , Myungsik Yoo
For hyperspectral image (HSI) classification, convolutional neural networks with a local kernel neglect the global HSI properties, and transformer networks often predict only the central pixel. This study proposes a spatial–spectral UnetFormer network to extract the full local and global spatial similarities and the long short-range spectral dependencies for HSI classification. This approach fuses a spectral transformer subnetwork and a spatial attention U-net subnetwork to create outputs. In the spectral subnetwork, the transformer is tailored at the embedding and head layers to generate a prediction for all input pixels. In the spatial attention U-net subnetwork, a local–global spatial feature model is introduced based on the U-net structure with a singular value decomposition-aided spatial self-attention module to emphasize useful details, mitigate the impact of noise, and eventually learn the global spatial features. The proposed model obtains competitive results with state-of-the-art methods in HSI classification on various public datasets.
{"title":"SSUFormer: Spatial–spectral UnetFormer for improving hyperspectral image classification","authors":"Thuan Minh Nguyen , Khoi Anh Bui , Myungsik Yoo","doi":"10.1016/j.jvcir.2025.104633","DOIUrl":"10.1016/j.jvcir.2025.104633","url":null,"abstract":"<div><div>For hyperspectral image (HSI) classification, convolutional neural networks with a local kernel neglect the global HSI properties, and transformer networks often predict only the central pixel. This study proposes a spatial–spectral UnetFormer network to extract the full local and global spatial similarities and the long short-range spectral dependencies for HSI classification. This approach fuses a spectral transformer subnetwork and a spatial attention U-net subnetwork to create outputs. In the spectral subnetwork, the transformer is tailored at the embedding and head layers to generate a prediction for all input pixels. In the spatial attention U-net subnetwork, a local–global spatial feature model is introduced based on the U-net structure with a singular value decomposition-aided spatial self-attention module to emphasize useful details, mitigate the impact of noise, and eventually learn the global spatial features. The proposed model obtains competitive results with state-of-the-art methods in HSI classification on various public datasets.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"114 ","pages":"Article 104633"},"PeriodicalIF":3.1,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145468765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-03DOI: 10.1016/j.jvcir.2025.104632
Mayank Sah, Saurya Suman, Jimson Mathew
Accurate food recognition and calorie estimation are critical for managing diet-related health issues such as obesity and diabetes. Traditional food logging methods rely on manual input, leading to inaccurate nutritional records. Although recent advances in computer vision and deep learning offer automated solutions, existing models struggle with generalizability due to homogeneous datasets and limited representation of complex cuisines like Indian food. This paper introduces a dataset containing over 15,000 images of 56 popular Indian food items. Curated from diverse sources, including social media and real-world photography, the dataset aims to capture the complexity of Indian meals, where multiple food items often appear together in a single image. This ensures greater lighting, presentation, and image quality variability compared to existing data sets. We evaluated the data set with various YOLO-based models, including YOLOv5 through YOLOv12, and enhanced the backbone with omniscale feature learning from OSNet, improving detection accuracy. In addition, we integrate a Retrieval-Augmented-Generation (RAG) module with YOLO, which refines food identification by associating fine-grained food categories with nutritional information, ingredients, and recipes. Our approach demonstrates improved performance in recognizing complex meals. It addresses key challenges in food recognition, offering a scalable solution for accurate calorie estimation, especially for culturally diverse cuisines like Indian food.
{"title":"Retrieval augmented generation for smart calorie estimation in complex food scenarios","authors":"Mayank Sah, Saurya Suman, Jimson Mathew","doi":"10.1016/j.jvcir.2025.104632","DOIUrl":"10.1016/j.jvcir.2025.104632","url":null,"abstract":"<div><div>Accurate food recognition and calorie estimation are critical for managing diet-related health issues such as obesity and diabetes. Traditional food logging methods rely on manual input, leading to inaccurate nutritional records. Although recent advances in computer vision and deep learning offer automated solutions, existing models struggle with generalizability due to homogeneous datasets and limited representation of complex cuisines like Indian food. This paper introduces a dataset containing over 15,000 images of 56 popular Indian food items. Curated from diverse sources, including social media and real-world photography, the dataset aims to capture the complexity of Indian meals, where multiple food items often appear together in a single image. This ensures greater lighting, presentation, and image quality variability compared to existing data sets. We evaluated the data set with various YOLO-based models, including YOLOv5 through YOLOv12, and enhanced the backbone with omniscale feature learning from OSNet, improving detection accuracy. In addition, we integrate a Retrieval-Augmented-Generation (RAG) module with YOLO, which refines food identification by associating fine-grained food categories with nutritional information, ingredients, and recipes. Our approach demonstrates improved performance in recognizing complex meals. It addresses key challenges in food recognition, offering a scalable solution for accurate calorie estimation, especially for culturally diverse cuisines like Indian food.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"113 ","pages":"Article 104632"},"PeriodicalIF":3.1,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145466971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}