Yuanman Li, Lanhao Ye, Haokun Cao, Wei Wang, Zhongyun Hua
In the realm of image security, there has been a burgeoning interest in harnessing deep learning techniques for the detection of digital image copy-move forgeries, resulting in promising outcomes. The generation process of such forgeries results in a distinctive topological structure among patches, and collaborative modeling based on these underlying topologies proves instrumental in enhancing the discrimination of ambiguous pixels. Despite the attention received, existing deep learning models predominantly rely on convolutional neural networks (CNNs), falling short in adequately capturing correlations among distant patches. This limitation impedes the seamless propagation of information and collaborative learning across related patches. To address this gap, our work introduces an innovative framework for image copy-move forensics rooted in graph representation learning. Initially, we introduce an adaptive graph learning approach to foster collaboration among related patches, dynamically learning the inherent topology of patches. The devised approach excels in promoting efficient information flow among related patches, encompassing both short-range and long-range correlations. Additionally, we formulate a cascaded graph learning framework, progressively refining patch representations and disseminating information to broader correlated patches based on their updated topologies. Finally, we propose a hierarchical cross-attention mechanism facilitating the exchange of information between the cascaded graph learning branch and a dedicated forgery detection branch. This equips our method with the capability to jointly grasp the homology of copy-move correspondences and identify inconsistencies between the target region and the background. Comprehensive experimental results validate the superiority of our proposed scheme, providing a robust solution to security challenges posed by digital image manipulations.
{"title":"Cascaded Adaptive Graph Representation Learning for Image Copy-Move Forgery Detection","authors":"Yuanman Li, Lanhao Ye, Haokun Cao, Wei Wang, Zhongyun Hua","doi":"10.1145/3669905","DOIUrl":"https://doi.org/10.1145/3669905","url":null,"abstract":"<p>In the realm of image security, there has been a burgeoning interest in harnessing deep learning techniques for the detection of digital image copy-move forgeries, resulting in promising outcomes. The generation process of such forgeries results in a distinctive topological structure among patches, and collaborative modeling based on these underlying topologies proves instrumental in enhancing the discrimination of ambiguous pixels. Despite the attention received, existing deep learning models predominantly rely on convolutional neural networks (CNNs), falling short in adequately capturing correlations among distant patches. This limitation impedes the seamless propagation of information and collaborative learning across related patches. To address this gap, our work introduces an innovative framework for image copy-move forensics rooted in graph representation learning. Initially, we introduce an adaptive graph learning approach to foster collaboration among related patches, dynamically learning the inherent topology of patches. The devised approach excels in promoting efficient information flow among related patches, encompassing both short-range and long-range correlations. Additionally, we formulate a cascaded graph learning framework, progressively refining patch representations and disseminating information to broader correlated patches based on their updated topologies. Finally, we propose a hierarchical cross-attention mechanism facilitating the exchange of information between the cascaded graph learning branch and a dedicated forgery detection branch. This equips our method with the capability to jointly grasp the homology of copy-move correspondences and identify inconsistencies between the target region and the background. Comprehensive experimental results validate the superiority of our proposed scheme, providing a robust solution to security challenges posed by digital image manipulations.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"286 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141173122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kaixin Chen, Lin Zhang, Zhong Wang, Shengjie Zhao, Yicong Zhou
Recently, sparse-inertial human pose estimation (SI-HPE) with only a few IMUs has shown great potential in various fields. The most advanced work in this area achieved fairish results using only six IMUs. However, there are still two major issues that remain to be addressed. First, existing methods typically treat SI-HPE as a temporal sequential learning problem and often ignore the important spatial prior of skeletal topology. Second, there are far more synthetic data in their training data than real data, and the data distribution of synthetic data and real data is quite different, which makes it difficult for the model to be applied to more diverse real data. To address these issues, we propose “Graph-based Adversarial Inertial Poser (GAIP)”, which tracks body movements using sparse data from six IMUs. To make full use of the spatial prior, we design a multi-stage pose regressor with graph convolution to explicitly learn the skeletal topology. A joint position loss is also introduced to implicitly mine spatial information. To enhance the generalization ability, we propose supervising the pose regression with an adversarial loss from a discriminator, bringing the ability of adversarial networks to learn implicit constraints into full play. Additionally, we construct a real dataset that includes hip support movements and a synthetic dataset containing various motion categories to enrich the diversity of inertial data for SI-HPE. Extensive experiments demonstrate that GAIP produces results with more precise limb movement amplitudes and relative joint positions, accompanied by smaller joint angle and position errors compared to state-of-the-art counterparts. The datasets and codes are publicly available at https://cslinzhang.github.io/GAIP/.
{"title":"Skeleton-aware Graph-based Adversarial Networks for Human Pose Estimation from Sparse IMUs","authors":"Kaixin Chen, Lin Zhang, Zhong Wang, Shengjie Zhao, Yicong Zhou","doi":"10.1145/3669904","DOIUrl":"https://doi.org/10.1145/3669904","url":null,"abstract":"<p>Recently, sparse-inertial human pose estimation (SI-HPE) with only a few IMUs has shown great potential in various fields. The most advanced work in this area achieved fairish results using only six IMUs. However, there are still two major issues that remain to be addressed. First, existing methods typically treat SI-HPE as a temporal sequential learning problem and often ignore the important spatial prior of skeletal topology. Second, there are far more synthetic data in their training data than real data, and the data distribution of synthetic data and real data is quite different, which makes it difficult for the model to be applied to more diverse real data. To address these issues, we propose “Graph-based Adversarial Inertial Poser (GAIP)”, which tracks body movements using sparse data from six IMUs. To make full use of the spatial prior, we design a multi-stage pose regressor with graph convolution to explicitly learn the skeletal topology. A joint position loss is also introduced to implicitly mine spatial information. To enhance the generalization ability, we propose supervising the pose regression with an adversarial loss from a discriminator, bringing the ability of adversarial networks to learn implicit constraints into full play. Additionally, we construct a real dataset that includes hip support movements and a synthetic dataset containing various motion categories to enrich the diversity of inertial data for SI-HPE. Extensive experiments demonstrate that GAIP produces results with more precise limb movement amplitudes and relative joint positions, accompanied by smaller joint angle and position errors compared to state-of-the-art counterparts. The datasets and codes are publicly available at https://cslinzhang.github.io/GAIP/.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"63 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141173261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Roberto Amoroso, Davide Morelli, Marcella Cornia, Lorenzo Baraldi, Alberto Del Bimbo, Rita Cucchiara
Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language. While these models have numerous benefits across various sectors, they have also raised concerns about the potential misuse of fake images and cast new pressures on fake image detection. In this work, we pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models. Firstly, we conduct a comprehensive analysis of the performance of contrastive and classification-based visual features, respectively extracted from CLIP-based models and ResNet or ViT-based architectures trained on image classification datasets. Our results demonstrate that fake images share common low-level cues, which render them easily recognizable. Further, we devise a multimodal setting wherein fake images are synthesized by different textual captions, which are used as seeds for a generator. Under this setting, we quantify the performance of fake detection strategies and introduce a contrastive-based disentangling method that lets us analyze the role of the semantics of textual descriptions and low-level perceptual cues. Finally, we release a new dataset, called COCOFake, containing about 1.2M images generated from the original COCO image-caption pairs using two recent text-to-image diffusion models, namely Stable Diffusion v1.4 and v2.0.
{"title":"Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images","authors":"Roberto Amoroso, Davide Morelli, Marcella Cornia, Lorenzo Baraldi, Alberto Del Bimbo, Rita Cucchiara","doi":"10.1145/3665497","DOIUrl":"https://doi.org/10.1145/3665497","url":null,"abstract":"<p>Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language. While these models have numerous benefits across various sectors, they have also raised concerns about the potential misuse of fake images and cast new pressures on fake image detection. In this work, we pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models. Firstly, we conduct a comprehensive analysis of the performance of contrastive and classification-based visual features, respectively extracted from CLIP-based models and ResNet or ViT-based architectures trained on image classification datasets. Our results demonstrate that fake images share common low-level cues, which render them easily recognizable. Further, we devise a multimodal setting wherein fake images are synthesized by different textual captions, which are used as seeds for a generator. Under this setting, we quantify the performance of fake detection strategies and introduce a contrastive-based disentangling method that lets us analyze the role of the semantics of textual descriptions and low-level perceptual cues. Finally, we release a new dataset, called COCOFake, containing about 1.2M images generated from the original COCO image-caption pairs using two recent text-to-image diffusion models, namely Stable Diffusion v1.4 and v2.0.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"58 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141153534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the rise of the metaverse, the rapid advancement of Deepfakes technology has become closely intertwined. Within the metaverse, individuals exist in digital form and engage in interactions, transactions, and communications through virtual avatars. However, the development of Deepfakes technology has led to the proliferation of forged information disseminated under the guise of users’ virtual identities, posing significant security risks to the metaverse. Hence, there is an urgent need to research and develop more robust methods for detecting deep forgeries to address these challenges. This paper explores deepfake video detection by leveraging the spatiotemporal inconsistencies generated by deepfake generation techniques, and thereby proposing the interactive spatioTemporal inconsistency learning and interactive fusion (ST-ILIF) detection method, which consists of phase-aware and sequence streams. The spatial inconsistencies exhibited in frames of deepfake videos are primarily attributed to variations in the structural information contained within the phase component of the Fourier domain. To mitigate the issue of overfitting the content information, a phase-aware stream is introduced to learn the spatial inconsistencies from the phase-based reconstructed frames. Additionally, considering that deepfake videos are generated frame-by-frame and lack temporal consistency between frames, a sequence stream is proposed to extract temporal inconsistency features from the spatiotemporal difference information between consecutive frames. Finally, through feature interaction and fusion of the two streams, the representation ability of intermediate and classification features is further enhanced. The proposed method, which was evaluated on four mainstream datasets, outperformed most existing methods, and extensive experimental results demonstrated its effectiveness in identifying deepfake videos. Our source code is available at https://github.com/qff98/Deepfake-Video-Detection
{"title":"Spatiotemporal Inconsistency Learning and Interactive Fusion for Deepfake Video Detection","authors":"Dengyong Zhang, Wenjie Zhu, Xin Liao, Feifan Qi, Gaobo Yang, Xiangling Ding","doi":"10.1145/3664654","DOIUrl":"https://doi.org/10.1145/3664654","url":null,"abstract":"<p>With the rise of the metaverse, the rapid advancement of Deepfakes technology has become closely intertwined. Within the metaverse, individuals exist in digital form and engage in interactions, transactions, and communications through virtual avatars. However, the development of Deepfakes technology has led to the proliferation of forged information disseminated under the guise of users’ virtual identities, posing significant security risks to the metaverse. Hence, there is an urgent need to research and develop more robust methods for detecting deep forgeries to address these challenges. This paper explores deepfake video detection by leveraging the spatiotemporal inconsistencies generated by deepfake generation techniques, and thereby proposing the interactive spatioTemporal inconsistency learning and interactive fusion (ST-ILIF) detection method, which consists of phase-aware and sequence streams. The spatial inconsistencies exhibited in frames of deepfake videos are primarily attributed to variations in the structural information contained within the phase component of the Fourier domain. To mitigate the issue of overfitting the content information, a phase-aware stream is introduced to learn the spatial inconsistencies from the phase-based reconstructed frames. Additionally, considering that deepfake videos are generated frame-by-frame and lack temporal consistency between frames, a sequence stream is proposed to extract temporal inconsistency features from the spatiotemporal difference information between consecutive frames. Finally, through feature interaction and fusion of the two streams, the representation ability of intermediate and classification features is further enhanced. The proposed method, which was evaluated on four mainstream datasets, outperformed most existing methods, and extensive experimental results demonstrated its effectiveness in identifying deepfake videos. Our source code is available at https://github.com/qff98/Deepfake-Video-Detection</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"156 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140930573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Muhammad Bilal Shaikh, Douglas Chai, Syed Muhammad Shamsul Islam, Naveed Akhtar
Due to its widespread applications, human action recognition is one of the most widely studied research problems in Computer Vision. Recent studies have shown that addressing it using multimodal data leads to superior performance as compared to relying on a single data modality. During the adoption of deep learning for visual modelling in the last decade, action recognition approaches have mainly relied on Convolutional Neural Networks (CNNs). However, the recent rise of Transformers in visual modelling is now also causing a paradigm shift for the action recognition task. This survey captures this transition while focusing on Multimodal Human Action Recognition (MHAR). Unique to the induction of multimodal computational models is the process of ‘fusing’ the features of the individual data modalities. Hence, we specifically focus on the fusion design aspects of the MHAR approaches. We analyze the classic and emerging techniques in this regard, while also highlighting the popular trends in the adaption of CNN and Transformer building blocks for the overall problem. In particular, we emphasize on recent design choices that have led to more efficient MHAR models. Unlike existing reviews, which discuss Human Action Recognition from a broad perspective, this survey is specifically aimed at pushing the boundaries of MHAR research by identifying promising architectural and fusion design choices to train practicable models. We also provide an outlook of the multimodal datasets from their scale and evaluation viewpoint. Finally, building on the reviewed literature, we discuss the challenges and future avenues for MHAR.
{"title":"From CNNs to Transformers in Multimodal Human Action Recognition: A Survey","authors":"Muhammad Bilal Shaikh, Douglas Chai, Syed Muhammad Shamsul Islam, Naveed Akhtar","doi":"10.1145/3664815","DOIUrl":"https://doi.org/10.1145/3664815","url":null,"abstract":"<p>Due to its widespread applications, human action recognition is one of the most widely studied research problems in Computer Vision. Recent studies have shown that addressing it using multimodal data leads to superior performance as compared to relying on a single data modality. During the adoption of deep learning for visual modelling in the last decade, action recognition approaches have mainly relied on Convolutional Neural Networks (CNNs). However, the recent rise of Transformers in visual modelling is now also causing a paradigm shift for the action recognition task. This survey captures this transition while focusing on Multimodal Human Action Recognition (MHAR). Unique to the induction of multimodal computational models is the process of ‘fusing’ the features of the individual data modalities. Hence, we specifically focus on the fusion design aspects of the MHAR approaches. We analyze the classic and emerging techniques in this regard, while also highlighting the popular trends in the adaption of CNN and Transformer building blocks for the overall problem. In particular, we emphasize on recent design choices that have led to more efficient MHAR models. Unlike existing reviews, which discuss Human Action Recognition from a broad perspective, this survey is specifically aimed at pushing the boundaries of MHAR research by identifying promising architectural and fusion design choices to train practicable models. We also provide an outlook of the multimodal datasets from their scale and evaluation viewpoint. Finally, building on the reviewed literature, we discuss the challenges and future avenues for MHAR.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"10 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140930571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yunhui Xu, Youru Li, Muhao Xu, Zhenfeng Zhu, Yao Zhao
Recent years have witnessed the successful application of knowledge graph techniques in structured data processing, while how to incorporate knowledge from visual and textual modalities into knowledge graphs has been given less attention. To better organize them, Multimodal Knowledge Graphs (MKGs), comprising the structural triplets of traditional Knowledge Graphs (KGs) together with entity-related multimodal data (e.g., images and texts), have been introduced consecutively. However, it is still a great challenge to explore MKGs due to their inherent incompleteness. Although most existing Multimodal Knowledge Graph Completion (MKGC) approaches can infer missing triplets based on available factual triplets and multimodal information, they almost ignore the modal conflicts and supervisory effect, failing to achieve a more comprehensive understanding of entities. To address these issues, we propose a novel Hierarchical Knowledge Alignment (HKA) framework for MKGC. Specifically, a macro-knowledge alignment module is proposed to capture global semantic relevance between modalities for dealing with modal conflicts in MKG. Furthermore, a micro-knowledge alignment module is also developed to reveal the local consistency information through inter- and intra-modality supervisory effect more effectively. By integrating different modal predictions, a final decision can be made. Experimental results on three benchmark MKGC tasks have demonstrated the effectiveness of the proposed HKA framework.
{"title":"HKA: A Hierarchical Knowledge Alignment Framework for Multimodal Knowledge Graph Completion","authors":"Yunhui Xu, Youru Li, Muhao Xu, Zhenfeng Zhu, Yao Zhao","doi":"10.1145/3664288","DOIUrl":"https://doi.org/10.1145/3664288","url":null,"abstract":"<p>Recent years have witnessed the successful application of knowledge graph techniques in structured data processing, while how to incorporate knowledge from visual and textual modalities into knowledge graphs has been given less attention. To better organize them, Multimodal Knowledge Graphs (MKGs), comprising the structural triplets of traditional Knowledge Graphs (KGs) together with entity-related multimodal data (e.g., images and texts), have been introduced consecutively. However, it is still a great challenge to explore MKGs due to their inherent incompleteness. Although most existing Multimodal Knowledge Graph Completion (MKGC) approaches can infer missing triplets based on available factual triplets and multimodal information, they almost ignore the modal conflicts and supervisory effect, failing to achieve a more comprehensive understanding of entities. To address these issues, we propose a novel <underline>H</underline>ierarchical <underline>K</underline>nowledge <underline>A</underline>lignment (<b>HKA</b>) framework for MKGC. Specifically, a macro-knowledge alignment module is proposed to capture global semantic relevance between modalities for dealing with modal conflicts in MKG. Furthermore, a micro-knowledge alignment module is also developed to reveal the local consistency information through inter- and intra-modality supervisory effect more effectively. By integrating different modal predictions, a final decision can be made. Experimental results on three benchmark MKGC tasks have demonstrated the effectiveness of the proposed HKA framework.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"2015 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140936641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Objective quality assessment of 3D point clouds is essential for the development of immersive multimedia systems in real-world applications. Despite the success of perceptual quality evaluation for 2D images and videos, blind/no-reference metrics are still scarce for 3D point clouds with large-scale irregularly distributed 3D points. Therefore, in this paper, we propose an objective point cloud quality index with Structure Guided Resampling (SGR) to automatically evaluate the perceptually visual quality of dense 3D point clouds. The proposed SGR is a general-purpose blind quality assessment method without the assistance of any reference information. Specifically, considering that the human visual system (HVS) is highly sensitive to structure information, we first exploit the unique normal vectors of point clouds to execute regional pre-processing which consists of keypoint resampling and local region construction. Then, we extract three groups of quality-related features, including: 1) geometry density features; 2) color naturalness features; 3) angular consistency features. Both the cognitive peculiarities of the human brain and naturalness regularity are involved in the designed quality-aware features that can capture the most vital aspects of distorted 3D point clouds. Extensive experiments on several publicly available subjective point cloud quality databases validate that our proposed SGR can compete with state-of-the-art full-reference, reduced-reference, and no-reference quality assessment algorithms.
{"title":"Blind Quality Assessment of Dense 3D Point Clouds with Structure Guided Resampling","authors":"Wei Zhou, Qi Yang, Wu Chen, Qiuping Jiang, Guangtao Zhai, Weisi Lin","doi":"10.1145/3664199","DOIUrl":"https://doi.org/10.1145/3664199","url":null,"abstract":"<p>Objective quality assessment of 3D point clouds is essential for the development of immersive multimedia systems in real-world applications. Despite the success of perceptual quality evaluation for 2D images and videos, blind/no-reference metrics are still scarce for 3D point clouds with large-scale irregularly distributed 3D points. Therefore, in this paper, we propose an objective point cloud quality index with Structure Guided Resampling (SGR) to automatically evaluate the perceptually visual quality of dense 3D point clouds. The proposed SGR is a general-purpose blind quality assessment method without the assistance of any reference information. Specifically, considering that the human visual system (HVS) is highly sensitive to structure information, we first exploit the unique normal vectors of point clouds to execute regional pre-processing which consists of keypoint resampling and local region construction. Then, we extract three groups of quality-related features, including: 1) geometry density features; 2) color naturalness features; 3) angular consistency features. Both the cognitive peculiarities of the human brain and naturalness regularity are involved in the designed quality-aware features that can capture the most vital aspects of distorted 3D point clouds. Extensive experiments on several publicly available subjective point cloud quality databases validate that our proposed SGR can compete with state-of-the-art full-reference, reduced-reference, and no-reference quality assessment algorithms.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"60 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140930566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuli Zhao, Yin Zhang, Francis C. M. Lau, Hai Yu, Zhiliang Zhu, Bin Zhang
In this paper, we present a coding method called expanding-window zigzag decodable fountain code with unequal error protection property (EWF-ZD UEP code) to achieve scalable multimedia transmission. The key idea of the EWF-ZD UEP code is to utilize bit-shift operation and expanding-window strategy to improve the decoding performance of the high-priority data without performance deterioration of the low-priority data. To provide more protection for the high-priority data, we precode the different importance level using LDPC codes of varying code rates. The generalized variable nodes of different importance levels are further grouped into several windows. Each window is associated with a selection probability and a bit-shift distribution. The combination of bit-shift and symbol exclusive-or operations is used to generate an encoded symbol. Theoretical and simulation results on input symbols of two importance levels reveal that the proposed EWF-ZD UEP code exhibits UEP property. With a small bit shift, the decoding delay for recovering high-priority input symbols is decreased without degrading the decoding performance of the low-priority input symbols. Moreover, according to the simulation results on scalable video coding, our scheme provides better basic video quality at a lower proportion of received symbols compared to three state-of-art UEP fountain codes.
{"title":"Expanding-Window Zigzag Decodable Fountain Codes for Scalable Multimedia Transmission","authors":"Yuli Zhao, Yin Zhang, Francis C. M. Lau, Hai Yu, Zhiliang Zhu, Bin Zhang","doi":"10.1145/3664610","DOIUrl":"https://doi.org/10.1145/3664610","url":null,"abstract":"<p>In this paper, we present a coding method called expanding-window zigzag decodable fountain code with unequal error protection property (EWF-ZD UEP code) to achieve scalable multimedia transmission. The key idea of the EWF-ZD UEP code is to utilize bit-shift operation and expanding-window strategy to improve the decoding performance of the high-priority data without performance deterioration of the low-priority data. To provide more protection for the high-priority data, we precode the different importance level using LDPC codes of varying code rates. The generalized variable nodes of different importance levels are further grouped into several windows. Each window is associated with a selection probability and a bit-shift distribution. The combination of bit-shift and symbol exclusive-or operations is used to generate an encoded symbol. Theoretical and simulation results on input symbols of two importance levels reveal that the proposed EWF-ZD UEP code exhibits UEP property. With a small bit shift, the decoding delay for recovering high-priority input symbols is decreased without degrading the decoding performance of the low-priority input symbols. Moreover, according to the simulation results on scalable video coding, our scheme provides better basic video quality at a lower proportion of received symbols compared to three state-of-art UEP fountain codes.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"65 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140936895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Image-text retrieval, a fundamental cross-modal task, performs similarity reasoning for images and texts. The primary challenge for image-text retrieval is cross-modal semantic heterogeneity, where the semantic features of visual and textual modalities are rich but distinct. Scene graph is an effective representation for images and texts as it explicitly models objects and their relations. Existing scene graph based methods have not fully taken the features regarding various granularities implicit in scene graph into consideration (e.g. triplets), the inadequate feature matching incurs the absence of non-trivial semantic information (e.g. inner relations among triplets). Therefore, we propose a Semantic-Consistency Enhanced Multi-Level Scene Graph Matching (SEMScene) network, which exploits the semantic relevance between visual and textual scene graphs from fine-grained to coarse-grained. Firstly, under the scene graph representation, we perform feature matching including low-level node matching, mid-level semantic triplet matching, and high-level holistic scene graph matching. Secondly, to enhance the semantic-consistency for object-fused triplets carrying key correlation information, we propose a dual-step constraint mechanism in mid-level matching. Thirdly, to guide the model to learn the semantic-consistency of matched image-text pairs, we devise effective loss functions for each stage of the dual-step constraint. Comprehensive experiments on Flickr30K and MS-COCO datasets demonstrate that SEMScene achieves state-of-the-art performances with significant improvements.
{"title":"SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval","authors":"Yuankun Liu, Xiang Yuan, Haochen Li, Zhijie Tan, Jinsong Huang, Jingjie Xiao, Weiping Li, Tong Mo","doi":"10.1145/3664816","DOIUrl":"https://doi.org/10.1145/3664816","url":null,"abstract":"<p>Image-text retrieval, a fundamental cross-modal task, performs similarity reasoning for images and texts. The primary challenge for image-text retrieval is cross-modal semantic heterogeneity, where the semantic features of visual and textual modalities are rich but distinct. Scene graph is an effective representation for images and texts as it explicitly models objects and their relations. Existing scene graph based methods have not fully taken the features regarding various granularities implicit in scene graph into consideration (e.g. triplets), the inadequate feature matching incurs the absence of non-trivial semantic information (e.g. inner relations among triplets). Therefore, we propose a <b>S</b>emantic-Consistency <b>E</b>nhanced <b>M</b>ulti-Level <b>Scene</b> Graph Matching (SEMScene) network, which exploits the semantic relevance between visual and textual scene graphs from fine-grained to coarse-grained. Firstly, under the scene graph representation, we perform feature matching including low-level node matching, mid-level semantic triplet matching, and high-level holistic scene graph matching. Secondly, to enhance the semantic-consistency for object-fused triplets carrying key correlation information, we propose a dual-step constraint mechanism in mid-level matching. Thirdly, to guide the model to learn the semantic-consistency of matched image-text pairs, we devise effective loss functions for each stage of the dual-step constraint. Comprehensive experiments on Flickr30K and MS-COCO datasets demonstrate that SEMScene achieves state-of-the-art performances with significant improvements.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"131 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140936642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bo Peng, Lin Sun, Jianjun Lei, Bingzheng Liu, Haifeng Shen, Wanqing Li, Qingming Huang
Monocular depth estimation aims to infer a depth map from a single image. Although supervised learning-based methods have achieved remarkable performance, they generally rely on a large amount of labor-intensively annotated data. Self-supervised methods on the other hand do not require any annotation of ground-truth depth and have recently attracted increasing attention. In this work, we propose a self-supervised monocular depth estimation network via binocular geometric correlation learning. Specifically, considering the inter-view geometric correlation, a binocular cue prediction module is presented to generate the auxiliary vision cue for the self-supervised learning of monocular depth estimation. Then, to deal with the occlusion in depth estimation, an occlusion interference attenuated constraint is developed to guide the supervision of the network by inferring the occlusion region and producing paired occlusion masks. Experimental results on two popular benchmark datasets have demonstrated that the proposed network obtains competitive results compared to state-of-the-art self-supervised methods and achieves comparable results to some popular supervised methods.
{"title":"Self-Supervised Monocular Depth Estimation via Binocular Geometric Correlation Learning","authors":"Bo Peng, Lin Sun, Jianjun Lei, Bingzheng Liu, Haifeng Shen, Wanqing Li, Qingming Huang","doi":"10.1145/3663570","DOIUrl":"https://doi.org/10.1145/3663570","url":null,"abstract":"<p>Monocular depth estimation aims to infer a depth map from a single image. Although supervised learning-based methods have achieved remarkable performance, they generally rely on a large amount of labor-intensively annotated data. Self-supervised methods on the other hand do not require any annotation of ground-truth depth and have recently attracted increasing attention. In this work, we propose a self-supervised monocular depth estimation network via binocular geometric correlation learning. Specifically, considering the inter-view geometric correlation, a binocular cue prediction module is presented to generate the auxiliary vision cue for the self-supervised learning of monocular depth estimation. Then, to deal with the occlusion in depth estimation, an occlusion interference attenuated constraint is developed to guide the supervision of the network by inferring the occlusion region and producing paired occlusion masks. Experimental results on two popular benchmark datasets have demonstrated that the proposed network obtains competitive results compared to state-of-the-art self-supervised methods and achieves comparable results to some popular supervised methods.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"33 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140936891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}