Pub Date : 2024-09-05DOI: 10.1109/TIP.2024.3445738
Liang Liao, Kangmin Xu, Haoning Wu, Chaofeng Chen, Wenxiu Sun, Qiong Yan, C-C Jay Kuo, Weisi Lin
Blind video quality assessment (VQA) has become an increasingly demanding problem in automatically assessing the quality of ever-growing in-the-wild videos. Although efforts have been made to measure temporal distortions, the core to distinguish between VQA and image quality assessment (IQA), the lack of modeling of how the human visual system (HVS) relates to the temporal quality of videos hinders the precise mapping of predicted temporal scores to the human perception. Inspired by the recent discovery of the temporal straightness law of natural videos in the HVS, this paper intends to model the complex temporal distortions of in-the-wild videos in a simple and uniform representation by describing the geometric properties of videos in the visual perceptual domain. A novel videolet, with perceptual representation embedding of a few consecutive frames, is designed as the basic quality measurement unit to quantify temporal distortions by measuring the angular and linear displacements from the straightness law. By combining the predicted score on each videolet, a perceptually temporal quality evaluator (PTQE) is formed to measure the temporal quality of the entire video. Experimental results demonstrate that the perceptual representation in the HVS is an efficient way of predicting subjective temporal quality. Moreover, when combined with spatial quality metrics, PTQE achieves top performance over popular in-the-wild video datasets. More importantly, PTQE requires no additional information beyond the video being assessed, making it applicable to any dataset without parameter tuning. Additionally, the generalizability of PTQE is evaluated on video frame interpolation tasks, demonstrating its potential to benefit temporal-related enhancement tasks.
{"title":"Blind Video Quality Prediction by Uncovering Human Video Perceptual Representation.","authors":"Liang Liao, Kangmin Xu, Haoning Wu, Chaofeng Chen, Wenxiu Sun, Qiong Yan, C-C Jay Kuo, Weisi Lin","doi":"10.1109/TIP.2024.3445738","DOIUrl":"https://doi.org/10.1109/TIP.2024.3445738","url":null,"abstract":"<p><p>Blind video quality assessment (VQA) has become an increasingly demanding problem in automatically assessing the quality of ever-growing in-the-wild videos. Although efforts have been made to measure temporal distortions, the core to distinguish between VQA and image quality assessment (IQA), the lack of modeling of how the human visual system (HVS) relates to the temporal quality of videos hinders the precise mapping of predicted temporal scores to the human perception. Inspired by the recent discovery of the temporal straightness law of natural videos in the HVS, this paper intends to model the complex temporal distortions of in-the-wild videos in a simple and uniform representation by describing the geometric properties of videos in the visual perceptual domain. A novel videolet, with perceptual representation embedding of a few consecutive frames, is designed as the basic quality measurement unit to quantify temporal distortions by measuring the angular and linear displacements from the straightness law. By combining the predicted score on each videolet, a perceptually temporal quality evaluator (PTQE) is formed to measure the temporal quality of the entire video. Experimental results demonstrate that the perceptual representation in the HVS is an efficient way of predicting subjective temporal quality. Moreover, when combined with spatial quality metrics, PTQE achieves top performance over popular in-the-wild video datasets. More importantly, PTQE requires no additional information beyond the video being assessed, making it applicable to any dataset without parameter tuning. Additionally, the generalizability of PTQE is evaluated on video frame interpolation tasks, demonstrating its potential to benefit temporal-related enhancement tasks.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142142132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-05DOI: 10.1109/TIP.2024.3451931
Yapeng Li, Yong Luo, Bo Du
Existing multi-view classification algorithms usually assume that all examples have observations on all views, and the data in different views are clean. However, in real-world applications, we are often provided with data that have missing representations or contain noise on some views (i.e., missing or noise views). This may lead to significant performance degeneration, and thus many algorithms are proposed to address the incomplete view or noisy view issues. However, most of existing algorithms deal with the two issues separately, and hence may fail when both missing and noisy views exist. They are also usually not flexible in that the view or feature significance cannot be adaptively identified. Besides, the view missing patterns may vary in the training and test phases, and such difference is often ignored. To remedy these drawbacks, we propose a novel multi-view classification framework that is simultaneously robust to both incomplete and noisy views. This is achieved by integrating early fusion and late fusion in a single framework. Specifically, in our early fusion module, we propose a view-aware transformer to mask the missing views and adaptively explore the relationships between views and target tasks to deal with missing views. Considering that view missing patterns may change from the training to the test phase, we also design single-view classification and category-consistency constraints to reduce the dependence of our model on view-missing patterns. In our late fusion module, we quantify the view uncertainty in an ensemble way to estimate the noise level of that view. Then the uncertainty and prediction logits of different views are integrated to make our model robust to noisy views. The framework is trained in an end-to-end manner. Experimental results on diverse datasets demonstrate the robustness and effectiveness of our model for both incomplete and noisy views. Codes are available at https://github.com/li-yapeng/UVaT.
{"title":"UVaT: Uncertainty Incorporated View-aware Transformer for Robust Multi-view Classification.","authors":"Yapeng Li, Yong Luo, Bo Du","doi":"10.1109/TIP.2024.3451931","DOIUrl":"https://doi.org/10.1109/TIP.2024.3451931","url":null,"abstract":"<p><p>Existing multi-view classification algorithms usually assume that all examples have observations on all views, and the data in different views are clean. However, in real-world applications, we are often provided with data that have missing representations or contain noise on some views (i.e., missing or noise views). This may lead to significant performance degeneration, and thus many algorithms are proposed to address the incomplete view or noisy view issues. However, most of existing algorithms deal with the two issues separately, and hence may fail when both missing and noisy views exist. They are also usually not flexible in that the view or feature significance cannot be adaptively identified. Besides, the view missing patterns may vary in the training and test phases, and such difference is often ignored. To remedy these drawbacks, we propose a novel multi-view classification framework that is simultaneously robust to both incomplete and noisy views. This is achieved by integrating early fusion and late fusion in a single framework. Specifically, in our early fusion module, we propose a view-aware transformer to mask the missing views and adaptively explore the relationships between views and target tasks to deal with missing views. Considering that view missing patterns may change from the training to the test phase, we also design single-view classification and category-consistency constraints to reduce the dependence of our model on view-missing patterns. In our late fusion module, we quantify the view uncertainty in an ensemble way to estimate the noise level of that view. Then the uncertainty and prediction logits of different views are integrated to make our model robust to noisy views. The framework is trained in an end-to-end manner. Experimental results on diverse datasets demonstrate the robustness and effectiveness of our model for both incomplete and noisy views. Codes are available at https://github.com/li-yapeng/UVaT.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142142174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-05DOI: 10.1109/TIP.2024.3451936
Tongxue Zhou
Accurate segmentation of brain tumors across multiple MRI sequences is essential for diagnosis, treatment planning, and clinical decision-making. In this paper, I propose a cutting-edge framework, named multi-modal graph convolution network (M2GCNet), to explore the relationships across different MR modalities, and address the challenge of brain tumor segmentation. The core of M2GCNet is the multi-modal graph convolution module (M2GCM), a pivotal component that represents MR modalities as graphs, with nodes corresponding to image pixels and edges capturing latent relationships between pixels. This graph-based representation enables the effective utilization of both local and global contextual information. Notably, M2GCM comprises two important modules: the spatial-wise graph convolution module (SGCM), adept at capturing extensive spatial dependencies among distinct regions within an image, and the channel-wise graph convolution module (CGCM), dedicated to modelling intricate contextual dependencies among different channels within the image. Additionally, acknowledging the intrinsic correlation present among different MR modalities, a multi-modal correlation loss function is introduced. This novel loss function aims to capture specific nonlinear relationships between correlated modality pairs, enhancing the model’s ability to achieve accurate segmentation results. The experimental evaluation on two brain tumor datasets demonstrates the superiority of the proposed M2GCNet over other state-of-the-art segmentation methods. Furthermore, the proposed method paves the way for improved tumor diagnosis, multi-modal information fusion, and a deeper understanding of brain tumor pathology.
{"title":"M2GCNet: Multi-Modal Graph Convolution Network for Precise Brain Tumor Segmentation Across Multiple MRI Sequences","authors":"Tongxue Zhou","doi":"10.1109/TIP.2024.3451936","DOIUrl":"10.1109/TIP.2024.3451936","url":null,"abstract":"Accurate segmentation of brain tumors across multiple MRI sequences is essential for diagnosis, treatment planning, and clinical decision-making. In this paper, I propose a cutting-edge framework, named multi-modal graph convolution network (M2GCNet), to explore the relationships across different MR modalities, and address the challenge of brain tumor segmentation. The core of M2GCNet is the multi-modal graph convolution module (M2GCM), a pivotal component that represents MR modalities as graphs, with nodes corresponding to image pixels and edges capturing latent relationships between pixels. This graph-based representation enables the effective utilization of both local and global contextual information. Notably, M2GCM comprises two important modules: the spatial-wise graph convolution module (SGCM), adept at capturing extensive spatial dependencies among distinct regions within an image, and the channel-wise graph convolution module (CGCM), dedicated to modelling intricate contextual dependencies among different channels within the image. Additionally, acknowledging the intrinsic correlation present among different MR modalities, a multi-modal correlation loss function is introduced. This novel loss function aims to capture specific nonlinear relationships between correlated modality pairs, enhancing the model’s ability to achieve accurate segmentation results. The experimental evaluation on two brain tumor datasets demonstrates the superiority of the proposed M2GCNet over other state-of-the-art segmentation methods. Furthermore, the proposed method paves the way for improved tumor diagnosis, multi-modal information fusion, and a deeper understanding of brain tumor pathology.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142142168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-05DOI: 10.1109/TIP.2024.3451938
Bardia Azizian;Ivan V. Bajić
Privacy is a crucial concern in collaborative machine vision where a part of a Deep Neural Network (DNN) model runs on the edge, and the rest is executed on the cloud. In such applications, the machine vision model does not need the exact visual content to perform its task. Taking advantage of this potential, private information could be removed from the data insofar as it does not significantly impair the accuracy of the machine vision system. In this paper, we present an autoencoder-style network integrated within an object detection pipeline, which generates a latent representation of the input image that preserves task-relevant information while removing private information. Our approach employs an adversarial training strategy that not only removes private information from the bottleneck of the autoencoder but also promotes improved compression efficiency for feature channels coded by conventional codecs like VVC-Intra. We assess the proposed system using a realistic evaluation framework for privacy, directly measuring face and license plate recognition accuracy. Experimental results show that our proposed method is able to reduce the bitrate significantly at the same object detection accuracy compared to coding the input images directly, while keeping the face and license plate recognition accuracy on the images recovered from the bottleneck features low, implying strong privacy protection. Our code is available at https://github.com/bardia-az/ppa-code