ACM Transactions on Multimedia Computing Communications and Applications最新文献_第3页

Cascaded Adaptive Graph Representation Learning for Image Copy-Move Forgery Detection 级联自适应图形表示学习用于图像复制-移动伪造检测

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-05-29 DOI: 10.1145/3669905

Yuanman Li, Lanhao Ye, Haokun Cao, Wei Wang, Zhongyun Hua

In the realm of image security, there has been a burgeoning interest in harnessing deep learning techniques for the detection of digital image copy-move forgeries, resulting in promising outcomes. The generation process of such forgeries results in a distinctive topological structure among patches, and collaborative modeling based on these underlying topologies proves instrumental in enhancing the discrimination of ambiguous pixels. Despite the attention received, existing deep learning models predominantly rely on convolutional neural networks (CNNs), falling short in adequately capturing correlations among distant patches. This limitation impedes the seamless propagation of information and collaborative learning across related patches. To address this gap, our work introduces an innovative framework for image copy-move forensics rooted in graph representation learning. Initially, we introduce an adaptive graph learning approach to foster collaboration among related patches, dynamically learning the inherent topology of patches. The devised approach excels in promoting efficient information flow among related patches, encompassing both short-range and long-range correlations. Additionally, we formulate a cascaded graph learning framework, progressively refining patch representations and disseminating information to broader correlated patches based on their updated topologies. Finally, we propose a hierarchical cross-attention mechanism facilitating the exchange of information between the cascaded graph learning branch and a dedicated forgery detection branch. This equips our method with the capability to jointly grasp the homology of copy-move correspondences and identify inconsistencies between the target region and the background. Comprehensive experimental results validate the superiority of our proposed scheme, providing a robust solution to security challenges posed by digital image manipulations.

在图像安全领域，人们对利用深度学习技术检测数字图像复制移动赝品的兴趣日益浓厚，并取得了可喜的成果。这类伪造图像的生成过程会在斑块之间形成独特的拓扑结构，而基于这些底层拓扑结构的协作建模则有助于提高对模糊像素的辨别能力。尽管备受关注，但现有的深度学习模型主要依赖卷积神经网络（CNN），无法充分捕捉远处补丁之间的相关性。这一局限性阻碍了信息的无缝传播和相关斑块间的协作学习。为了弥补这一不足，我们的工作引入了一个植根于图表示学习的图像复制移动取证创新框架。首先，我们引入了一种自适应图学习方法来促进相关补丁之间的协作，动态学习补丁的固有拓扑结构。所设计的方法在促进相关补丁之间的高效信息流方面表现出色，包括短程和远程相关性。此外，我们还制定了一个级联图学习框架，根据更新的拓扑结构逐步完善补丁表征，并向更广泛的相关补丁传播信息。最后，我们提出了一种分层交叉关注机制，以促进级联图学习分支和专用伪造检测分支之间的信息交流。这使我们的方法具备了共同把握复制移动对应关系的同源性和识别目标区域与背景之间不一致的能力。综合实验结果验证了我们提出的方案的优越性，为应对数字图像篡改带来的安全挑战提供了稳健的解决方案。

{"title":"Cascaded Adaptive Graph Representation Learning for Image Copy-Move Forgery Detection","authors":"Yuanman Li, Lanhao Ye, Haokun Cao, Wei Wang, Zhongyun Hua","doi":"10.1145/3669905","DOIUrl":"https://doi.org/10.1145/3669905","url":null,"abstract":"In the realm of image security, there has been a burgeoning interest in harnessing deep learning techniques for the detection of digital image copy-move forgeries, resulting in promising outcomes. The generation process of such forgeries results in a distinctive topological structure among patches, and collaborative modeling based on these underlying topologies proves instrumental in enhancing the discrimination of ambiguous pixels. Despite the attention received, existing deep learning models predominantly rely on convolutional neural networks (CNNs), falling short in adequately capturing correlations among distant patches. This limitation impedes the seamless propagation of information and collaborative learning across related patches. To address this gap, our work introduces an innovative framework for image copy-move forensics rooted in graph representation learning. Initially, we introduce an adaptive graph learning approach to foster collaboration among related patches, dynamically learning the inherent topology of patches. The devised approach excels in promoting efficient information flow among related patches, encompassing both short-range and long-range correlations. Additionally, we formulate a cascaded graph learning framework, progressively refining patch representations and disseminating information to broader correlated patches based on their updated topologies. Finally, we propose a hierarchical cross-attention mechanism facilitating the exchange of information between the cascaded graph learning branch and a dedicated forgery detection branch. This equips our method with the capability to jointly grasp the homology of copy-move correspondences and identify inconsistencies between the target region and the background. Comprehensive experimental results validate the superiority of our proposed scheme, providing a robust solution to security challenges posed by digital image manipulations.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"286 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141173122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Skeleton-aware Graph-based Adversarial Networks for Human Pose Estimation from Sparse IMUs 基于骨架感知图的逆向网络，用于从稀疏 IMUs 估算人体姿态

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-05-29 DOI: 10.1145/3669904

Kaixin Chen, Lin Zhang, Zhong Wang, Shengjie Zhao, Yicong Zhou

Recently, sparse-inertial human pose estimation (SI-HPE) with only a few IMUs has shown great potential in various fields. The most advanced work in this area achieved fairish results using only six IMUs. However, there are still two major issues that remain to be addressed. First, existing methods typically treat SI-HPE as a temporal sequential learning problem and often ignore the important spatial prior of skeletal topology. Second, there are far more synthetic data in their training data than real data, and the data distribution of synthetic data and real data is quite different, which makes it difficult for the model to be applied to more diverse real data. To address these issues, we propose “Graph-based Adversarial Inertial Poser (GAIP)”, which tracks body movements using sparse data from six IMUs. To make full use of the spatial prior, we design a multi-stage pose regressor with graph convolution to explicitly learn the skeletal topology. A joint position loss is also introduced to implicitly mine spatial information. To enhance the generalization ability, we propose supervising the pose regression with an adversarial loss from a discriminator, bringing the ability of adversarial networks to learn implicit constraints into full play. Additionally, we construct a real dataset that includes hip support movements and a synthetic dataset containing various motion categories to enrich the diversity of inertial data for SI-HPE. Extensive experiments demonstrate that GAIP produces results with more precise limb movement amplitudes and relative joint positions, accompanied by smaller joint angle and position errors compared to state-of-the-art counterparts. The datasets and codes are publicly available at https://cslinzhang.github.io/GAIP/.

最近，仅使用几个 IMU 的稀疏惯性人体姿态估计（SI-HPE）在各个领域都显示出巨大的潜力。该领域最先进的工作仅使用六个 IMU 就取得了相当不错的结果。不过，仍有两大问题有待解决。首先，现有方法通常将 SI-HPE 视为时间序列学习问题，往往忽略了骨骼拓扑这一重要的空间先验问题。其次，其训练数据中的合成数据远多于真实数据，而且合成数据与真实数据的数据分布差异较大，这使得模型难以应用于更多样化的真实数据。为了解决这些问题，我们提出了 "基于图的对抗惯性抛物模型（GAIP）"，利用六个 IMU 的稀疏数据来跟踪身体运动。为了充分利用空间先验，我们设计了一个多阶段姿势回归器，利用图卷积明确学习骨骼拓扑结构。我们还引入了联合位置损失来隐式挖掘空间信息。为了增强泛化能力，我们建议使用来自判别器的对抗损失来监督姿势回归，从而充分发挥对抗网络学习隐式约束的能力。此外，我们还构建了一个包含髋部支撑运动的真实数据集和一个包含各种运动类别的合成数据集，以丰富 SI-HPE 惯性数据的多样性。广泛的实验证明，GAIP 得出的结果具有更精确的肢体运动幅度和相对关节位置，与最先进的同行相比，关节角度和位置误差更小。数据集和代码可在 https://cslinzhang.github.io/GAIP/ 公开获取。

{"title":"Skeleton-aware Graph-based Adversarial Networks for Human Pose Estimation from Sparse IMUs","authors":"Kaixin Chen, Lin Zhang, Zhong Wang, Shengjie Zhao, Yicong Zhou","doi":"10.1145/3669904","DOIUrl":"https://doi.org/10.1145/3669904","url":null,"abstract":"Recently, sparse-inertial human pose estimation (SI-HPE) with only a few IMUs has shown great potential in various fields. The most advanced work in this area achieved fairish results using only six IMUs. However, there are still two major issues that remain to be addressed. First, existing methods typically treat SI-HPE as a temporal sequential learning problem and often ignore the important spatial prior of skeletal topology. Second, there are far more synthetic data in their training data than real data, and the data distribution of synthetic data and real data is quite different, which makes it difficult for the model to be applied to more diverse real data. To address these issues, we propose “Graph-based Adversarial Inertial Poser (GAIP)”, which tracks body movements using sparse data from six IMUs. To make full use of the spatial prior, we design a multi-stage pose regressor with graph convolution to explicitly learn the skeletal topology. A joint position loss is also introduced to implicitly mine spatial information. To enhance the generalization ability, we propose supervising the pose regression with an adversarial loss from a discriminator, bringing the ability of adversarial networks to learn implicit constraints into full play. Additionally, we construct a real dataset that includes hip support movements and a synthetic dataset containing various motion categories to enrich the diversity of inertial data for SI-HPE. Extensive experiments demonstrate that GAIP produces results with more precise limb movement amplitudes and relative joint positions, accompanied by smaller joint angle and position errors compared to state-of-the-art counterparts. The datasets and codes are publicly available at https://cslinzhang.github.io/GAIP/.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"63 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141173261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images 父母与孩子从自然图像中识别多模态深度伪造图像

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-05-21 DOI: 10.1145/3665497

Roberto Amoroso, Davide Morelli, Marcella Cornia, Lorenzo Baraldi, Alberto Del Bimbo, Rita Cucchiara

Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language. While these models have numerous benefits across various sectors, they have also raised concerns about the potential misuse of fake images and cast new pressures on fake image detection. In this work, we pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models. Firstly, we conduct a comprehensive analysis of the performance of contrastive and classification-based visual features, respectively extracted from CLIP-based models and ResNet or ViT-based architectures trained on image classification datasets. Our results demonstrate that fake images share common low-level cues, which render them easily recognizable. Further, we devise a multimodal setting wherein fake images are synthesized by different textual captions, which are used as seeds for a generator. Under this setting, we quantify the performance of fake detection strategies and introduce a contrastive-based disentangling method that lets us analyze the role of the semantics of textual descriptions and low-level perceptual cues. Finally, we release a new dataset, called COCOFake, containing about 1.2M images generated from the original COCO image-caption pairs using two recent text-to-image diffusion models, namely Stable Diffusion v1.4 and v2.0.

扩散模型的最新进展使人们能够根据自然语言中的文字提示生成逼真的深度伪造图像。虽然这些模型在各行各业都有诸多益处，但它们也引发了人们对假图可能被滥用的担忧，并给假图检测带来了新的压力。在这项工作中，我们率先对最先进的扩散模型生成的深度假图检测进行了系统研究。首先，我们全面分析了基于对比和分类的视觉特征的性能，这些特征分别从基于 CLIP 的模型和基于 ResNet 或 ViT 的架构中提取，并在图像分类数据集上进行了训练。我们的研究结果表明，伪造图像具有共同的低级线索，因此很容易识别。此外，我们还设计了一种多模态环境，通过不同的文字说明合成假图像，并将其作为生成器的种子。在这种情况下，我们量化了假图检测策略的性能，并引入了一种基于对比的分辨方法，让我们能够分析文本描述的语义和低层次感知线索的作用。最后，我们发布了一个名为 COCOFake 的新数据集，其中包含约 120 万张由原始 COCO 图像-标题对生成的图像，该数据集使用了两种最新的文本-图像扩散模型，即稳定扩散 v1.4 和 v2.0。

{"title":"Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images","authors":"Roberto Amoroso, Davide Morelli, Marcella Cornia, Lorenzo Baraldi, Alberto Del Bimbo, Rita Cucchiara","doi":"10.1145/3665497","DOIUrl":"https://doi.org/10.1145/3665497","url":null,"abstract":"Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language. While these models have numerous benefits across various sectors, they have also raised concerns about the potential misuse of fake images and cast new pressures on fake image detection. In this work, we pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models. Firstly, we conduct a comprehensive analysis of the performance of contrastive and classification-based visual features, respectively extracted from CLIP-based models and ResNet or ViT-based architectures trained on image classification datasets. Our results demonstrate that fake images share common low-level cues, which render them easily recognizable. Further, we devise a multimodal setting wherein fake images are synthesized by different textual captions, which are used as seeds for a generator. Under this setting, we quantify the performance of fake detection strategies and introduce a contrastive-based disentangling method that lets us analyze the role of the semantics of textual descriptions and low-level perceptual cues. Finally, we release a new dataset, called COCOFake, containing about 1.2M images generated from the original COCO image-caption pairs using two recent text-to-image diffusion models, namely Stable Diffusion v1.4 and v2.0.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"58 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141153534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Spatiotemporal Inconsistency Learning and Interactive Fusion for Deepfake Video Detection 时空不一致性学习与交互式融合用于深度伪造视频检测

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-05-13 DOI: 10.1145/3664654

Dengyong Zhang, Wenjie Zhu, Xin Liao, Feifan Qi, Gaobo Yang, Xiangling Ding

With the rise of the metaverse, the rapid advancement of Deepfakes technology has become closely intertwined. Within the metaverse, individuals exist in digital form and engage in interactions, transactions, and communications through virtual avatars. However, the development of Deepfakes technology has led to the proliferation of forged information disseminated under the guise of users’ virtual identities, posing significant security risks to the metaverse. Hence, there is an urgent need to research and develop more robust methods for detecting deep forgeries to address these challenges. This paper explores deepfake video detection by leveraging the spatiotemporal inconsistencies generated by deepfake generation techniques, and thereby proposing the interactive spatioTemporal inconsistency learning and interactive fusion (ST-ILIF) detection method, which consists of phase-aware and sequence streams. The spatial inconsistencies exhibited in frames of deepfake videos are primarily attributed to variations in the structural information contained within the phase component of the Fourier domain. To mitigate the issue of overfitting the content information, a phase-aware stream is introduced to learn the spatial inconsistencies from the phase-based reconstructed frames. Additionally, considering that deepfake videos are generated frame-by-frame and lack temporal consistency between frames, a sequence stream is proposed to extract temporal inconsistency features from the spatiotemporal difference information between consecutive frames. Finally, through feature interaction and fusion of the two streams, the representation ability of intermediate and classification features is further enhanced. The proposed method, which was evaluated on four mainstream datasets, outperformed most existing methods, and extensive experimental results demonstrated its effectiveness in identifying deepfake videos. Our source code is available at https://github.com/qff98/Deepfake-Video-Detection

随着元宇宙的兴起，Deepfakes 技术的飞速发展与元宇宙紧密相连。在元宇宙中，个人以数字形式存在，并通过虚拟化身参与互动、交易和通信。然而，Deepfakes 技术的发展导致以用户虚拟身份为幌子的伪造信息泛滥，给元宇宙带来了巨大的安全风险。因此，迫切需要研究和开发更强大的深度伪造检测方法来应对这些挑战。本文利用深度伪造生成技术产生的时空不一致性，对深度伪造视频检测进行了探索，并由此提出了由相位感知流和序列流组成的交互式时空不一致性学习和交互式融合（ST-ILIF）检测方法。深度伪造视频帧中表现出的空间不一致性主要归因于傅立叶域相位分量中包含的结构信息的变化。为了缓解内容信息过度拟合的问题，我们引入了相位感知流，从基于相位的重建帧中学习空间不一致性。此外，考虑到深度伪造视频是逐帧生成的，帧与帧之间缺乏时间一致性，因此提出了一种序列流，从连续帧之间的时空差异信息中提取时间不一致性特征。最后，通过两个流的特征交互和融合，进一步提高中间特征和分类特征的表示能力。我们在四个主流数据集上对所提出的方法进行了评估，结果表明该方法优于大多数现有方法，大量实验结果也证明了该方法在识别深度伪造视频方面的有效性。我们的源代码可在以下网址获取：https://github.com/qff98/Deepfake-Video-Detection

{"title":"Spatiotemporal Inconsistency Learning and Interactive Fusion for Deepfake Video Detection","authors":"Dengyong Zhang, Wenjie Zhu, Xin Liao, Feifan Qi, Gaobo Yang, Xiangling Ding","doi":"10.1145/3664654","DOIUrl":"https://doi.org/10.1145/3664654","url":null,"abstract":"With the rise of the metaverse, the rapid advancement of Deepfakes technology has become closely intertwined. Within the metaverse, individuals exist in digital form and engage in interactions, transactions, and communications through virtual avatars. However, the development of Deepfakes technology has led to the proliferation of forged information disseminated under the guise of users’ virtual identities, posing significant security risks to the metaverse. Hence, there is an urgent need to research and develop more robust methods for detecting deep forgeries to address these challenges. This paper explores deepfake video detection by leveraging the spatiotemporal inconsistencies generated by deepfake generation techniques, and thereby proposing the interactive spatioTemporal inconsistency learning and interactive fusion (ST-ILIF) detection method, which consists of phase-aware and sequence streams. The spatial inconsistencies exhibited in frames of deepfake videos are primarily attributed to variations in the structural information contained within the phase component of the Fourier domain. To mitigate the issue of overfitting the content information, a phase-aware stream is introduced to learn the spatial inconsistencies from the phase-based reconstructed frames. Additionally, considering that deepfake videos are generated frame-by-frame and lack temporal consistency between frames, a sequence stream is proposed to extract temporal inconsistency features from the spatiotemporal difference information between consecutive frames. Finally, through feature interaction and fusion of the two streams, the representation ability of intermediate and classification features is further enhanced. The proposed method, which was evaluated on four mainstream datasets, outperformed most existing methods, and extensive experimental results demonstrated its effectiveness in identifying deepfake videos. Our source code is available at https://github.com/qff98/Deepfake-Video-Detection","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"156 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140930573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

From CNNs to Transformers in Multimodal Human Action Recognition: A Survey 多模态人体动作识别中的从 CNN 到变形器：调查

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-05-13 DOI: 10.1145/3664815

Muhammad Bilal Shaikh, Douglas Chai, Syed Muhammad Shamsul Islam, Naveed Akhtar

Due to its widespread applications, human action recognition is one of the most widely studied research problems in Computer Vision. Recent studies have shown that addressing it using multimodal data leads to superior performance as compared to relying on a single data modality. During the adoption of deep learning for visual modelling in the last decade, action recognition approaches have mainly relied on Convolutional Neural Networks (CNNs). However, the recent rise of Transformers in visual modelling is now also causing a paradigm shift for the action recognition task. This survey captures this transition while focusing on Multimodal Human Action Recognition (MHAR). Unique to the induction of multimodal computational models is the process of ‘fusing’ the features of the individual data modalities. Hence, we specifically focus on the fusion design aspects of the MHAR approaches. We analyze the classic and emerging techniques in this regard, while also highlighting the popular trends in the adaption of CNN and Transformer building blocks for the overall problem. In particular, we emphasize on recent design choices that have led to more efficient MHAR models. Unlike existing reviews, which discuss Human Action Recognition from a broad perspective, this survey is specifically aimed at pushing the boundaries of MHAR research by identifying promising architectural and fusion design choices to train practicable models. We also provide an outlook of the multimodal datasets from their scale and evaluation viewpoint. Finally, building on the reviewed literature, we discuss the challenges and future avenues for MHAR.

由于应用广泛，人类动作识别是计算机视觉领域研究最多的问题之一。最近的研究表明，与依赖单一数据模态相比，使用多模态数据来解决这一问题会带来更优越的性能。在过去十年中，深度学习被广泛应用于视觉建模，动作识别方法主要依赖于卷积神经网络（CNN）。然而，最近变形金刚在视觉建模中的兴起，也为动作识别任务带来了范式转变。本调查报告在捕捉这一转变的同时，重点关注多模态人类动作识别（MHAR）。多模态计算模型的独特之处在于 "融合 "各个数据模态特征的过程。因此，我们特别关注 MHAR 方法的融合设计方面。我们分析了这方面的经典技术和新兴技术，同时还强调了针对整个问题调整 CNN 和 Transformer 构建模块的流行趋势。我们特别强调了最近的设计选择，这些选择带来了更高效的 MHAR 模型。与从广阔视角讨论人类动作识别的现有综述不同，本调查旨在通过确定有前途的架构和融合设计选择来训练实用模型，从而推动 MHAR 研究的发展。我们还从规模和评估角度对多模态数据集进行了展望。最后，我们将以所查阅的文献为基础，讨论 MHAR 面临的挑战和未来的发展方向。

{"title":"From CNNs to Transformers in Multimodal Human Action Recognition: A Survey","authors":"Muhammad Bilal Shaikh, Douglas Chai, Syed Muhammad Shamsul Islam, Naveed Akhtar","doi":"10.1145/3664815","DOIUrl":"https://doi.org/10.1145/3664815","url":null,"abstract":"Due to its widespread applications, human action recognition is one of the most widely studied research problems in Computer Vision. Recent studies have shown that addressing it using multimodal data leads to superior performance as compared to relying on a single data modality. During the adoption of deep learning for visual modelling in the last decade, action recognition approaches have mainly relied on Convolutional Neural Networks (CNNs). However, the recent rise of Transformers in visual modelling is now also causing a paradigm shift for the action recognition task. This survey captures this transition while focusing on Multimodal Human Action Recognition (MHAR). Unique to the induction of multimodal computational models is the process of ‘fusing’ the features of the individual data modalities. Hence, we specifically focus on the fusion design aspects of the MHAR approaches. We analyze the classic and emerging techniques in this regard, while also highlighting the popular trends in the adaption of CNN and Transformer building blocks for the overall problem. In particular, we emphasize on recent design choices that have led to more efficient MHAR models. Unlike existing reviews, which discuss Human Action Recognition from a broad perspective, this survey is specifically aimed at pushing the boundaries of MHAR research by identifying promising architectural and fusion design choices to train practicable models. We also provide an outlook of the multimodal datasets from their scale and evaluation viewpoint. Finally, building on the reviewed literature, we discuss the challenges and future avenues for MHAR.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"10 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140930571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HKA: A Hierarchical Knowledge Alignment Framework for Multimodal Knowledge Graph Completion HKA：用于多模态知识图谱补全的分层知识对齐框架

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-05-11 DOI: 10.1145/3664288

Yunhui Xu, Youru Li, Muhao Xu, Zhenfeng Zhu, Yao Zhao

Recent years have witnessed the successful application of knowledge graph techniques in structured data processing, while how to incorporate knowledge from visual and textual modalities into knowledge graphs has been given less attention. To better organize them, Multimodal Knowledge Graphs (MKGs), comprising the structural triplets of traditional Knowledge Graphs (KGs) together with entity-related multimodal data (e.g., images and texts), have been introduced consecutively. However, it is still a great challenge to explore MKGs due to their inherent incompleteness. Although most existing Multimodal Knowledge Graph Completion (MKGC) approaches can infer missing triplets based on available factual triplets and multimodal information, they almost ignore the modal conflicts and supervisory effect, failing to achieve a more comprehensive understanding of entities. To address these issues, we propose a novel Hierarchical Knowledge Alignment (HKA) framework for MKGC. Specifically, a macro-knowledge alignment module is proposed to capture global semantic relevance between modalities for dealing with modal conflicts in MKG. Furthermore, a micro-knowledge alignment module is also developed to reveal the local consistency information through inter- and intra-modality supervisory effect more effectively. By integrating different modal predictions, a final decision can be made. Experimental results on three benchmark MKGC tasks have demonstrated the effectiveness of the proposed HKA framework.

近年来，知识图谱技术在结构化数据处理中得到了成功应用，但如何将视觉和文本模式的知识纳入知识图谱却鲜有人关注。为了更好地组织知识图谱，由传统知识图谱（KG）的结构三元组与实体相关的多模态数据（如图像和文本）组成的多模态知识图谱（MKG）相继问世。然而，由于多模态知识图谱本身的不完整性，探索多模态知识图谱仍然是一项巨大的挑战。虽然现有的多模态知识图谱补全（MKGC）方法大多能根据现有的事实三元组和多模态信息推断出缺失的三元组，但它们几乎忽略了模态冲突和监督效应，无法实现对实体更全面的理解。为了解决这些问题，我们为 MKGC 提出了一个新颖的分层知识对齐（HKA）框架。具体来说，我们提出了一个宏观知识对齐模块，用于捕捉模态之间的全局语义相关性，以处理 MKG 中的模态冲突。此外，还开发了微观知识对齐模块，通过模态间和模态内的监督效应更有效地揭示局部一致性信息。通过整合不同的模态预测，可以做出最终决策。三个基准 MKGC 任务的实验结果证明了所提出的 HKA 框架的有效性。

{"title":"HKA: A Hierarchical Knowledge Alignment Framework for Multimodal Knowledge Graph Completion","authors":"Yunhui Xu, Youru Li, Muhao Xu, Zhenfeng Zhu, Yao Zhao","doi":"10.1145/3664288","DOIUrl":"https://doi.org/10.1145/3664288","url":null,"abstract":"Recent years have witnessed the successful application of knowledge graph techniques in structured data processing, while how to incorporate knowledge from visual and textual modalities into knowledge graphs has been given less attention. To better organize them, Multimodal Knowledge Graphs (MKGs), comprising the structural triplets of traditional Knowledge Graphs (KGs) together with entity-related multimodal data (e.g., images and texts), have been introduced consecutively. However, it is still a great challenge to explore MKGs due to their inherent incompleteness. Although most existing Multimodal Knowledge Graph Completion (MKGC) approaches can infer missing triplets based on available factual triplets and multimodal information, they almost ignore the modal conflicts and supervisory effect, failing to achieve a more comprehensive understanding of entities. To address these issues, we propose a novel <underline>H</underline>ierarchical <underline>K</underline>nowledge <underline>A</underline>lignment (HKA) framework for MKGC. Specifically, a macro-knowledge alignment module is proposed to capture global semantic relevance between modalities for dealing with modal conflicts in MKG. Furthermore, a micro-knowledge alignment module is also developed to reveal the local consistency information through inter- and intra-modality supervisory effect more effectively. By integrating different modal predictions, a final decision can be made. Experimental results on three benchmark MKGC tasks have demonstrated the effectiveness of the proposed HKA framework.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"2015 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140936641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Blind Quality Assessment of Dense 3D Point Clouds with Structure Guided Resampling 利用结构引导重采样对密集三维点云进行盲质量评估

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-05-11 DOI: 10.1145/3664199

Wei Zhou, Qi Yang, Wu Chen, Qiuping Jiang, Guangtao Zhai, Weisi Lin

Objective quality assessment of 3D point clouds is essential for the development of immersive multimedia systems in real-world applications. Despite the success of perceptual quality evaluation for 2D images and videos, blind/no-reference metrics are still scarce for 3D point clouds with large-scale irregularly distributed 3D points. Therefore, in this paper, we propose an objective point cloud quality index with Structure Guided Resampling (SGR) to automatically evaluate the perceptually visual quality of dense 3D point clouds. The proposed SGR is a general-purpose blind quality assessment method without the assistance of any reference information. Specifically, considering that the human visual system (HVS) is highly sensitive to structure information, we first exploit the unique normal vectors of point clouds to execute regional pre-processing which consists of keypoint resampling and local region construction. Then, we extract three groups of quality-related features, including: 1) geometry density features; 2) color naturalness features; 3) angular consistency features. Both the cognitive peculiarities of the human brain and naturalness regularity are involved in the designed quality-aware features that can capture the most vital aspects of distorted 3D point clouds. Extensive experiments on several publicly available subjective point cloud quality databases validate that our proposed SGR can compete with state-of-the-art full-reference, reduced-reference, and no-reference quality assessment algorithms.

三维点云的客观质量评估对于开发现实世界应用中的沉浸式多媒体系统至关重要。尽管二维图像和视频的感知质量评估取得了成功，但对于具有大规模不规则分布三维点的三维点云来说，盲/无参考指标仍然十分匮乏。因此，我们在本文中提出了一种采用结构引导重采样（SGR）的客观点云质量指标，用于自动评估密集三维点云的感知视觉质量。所提出的 SGR 是一种通用的盲目质量评估方法，无需任何参考信息的辅助。具体来说，考虑到人类视觉系统（HVS）对结构信息高度敏感，我们首先利用点云的独特法向量执行区域预处理，包括关键点重采样和局部区域构建。然后，我们提取三组与质量相关的特征，包括1) 几何密度特征；2) 色彩自然度特征；3) 角度一致性特征。在设计质量感知特征时，既考虑了人脑认知的特殊性，也考虑了自然度的规律性，这些特征可以捕捉到扭曲三维点云最重要的方面。在几个公开的主观点云质量数据库上进行的广泛实验验证了我们提出的 SGR 可以与最先进的全参考、缩减参考和无参考质量评估算法相媲美。

{"title":"Blind Quality Assessment of Dense 3D Point Clouds with Structure Guided Resampling","authors":"Wei Zhou, Qi Yang, Wu Chen, Qiuping Jiang, Guangtao Zhai, Weisi Lin","doi":"10.1145/3664199","DOIUrl":"https://doi.org/10.1145/3664199","url":null,"abstract":"Objective quality assessment of 3D point clouds is essential for the development of immersive multimedia systems in real-world applications. Despite the success of perceptual quality evaluation for 2D images and videos, blind/no-reference metrics are still scarce for 3D point clouds with large-scale irregularly distributed 3D points. Therefore, in this paper, we propose an objective point cloud quality index with Structure Guided Resampling (SGR) to automatically evaluate the perceptually visual quality of dense 3D point clouds. The proposed SGR is a general-purpose blind quality assessment method without the assistance of any reference information. Specifically, considering that the human visual system (HVS) is highly sensitive to structure information, we first exploit the unique normal vectors of point clouds to execute regional pre-processing which consists of keypoint resampling and local region construction. Then, we extract three groups of quality-related features, including: 1) geometry density features; 2) color naturalness features; 3) angular consistency features. Both the cognitive peculiarities of the human brain and naturalness regularity are involved in the designed quality-aware features that can capture the most vital aspects of distorted 3D point clouds. Extensive experiments on several publicly available subjective point cloud quality databases validate that our proposed SGR can compete with state-of-the-art full-reference, reduced-reference, and no-reference quality assessment algorithms.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"60 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140930566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Expanding-Window Zigzag Decodable Fountain Codes for Scalable Multimedia Transmission 用于可扩展多媒体传输的扩展窗口之字形可解码喷泉代码

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-05-11 DOI: 10.1145/3664610

Yuli Zhao, Yin Zhang, Francis C. M. Lau, Hai Yu, Zhiliang Zhu, Bin Zhang

In this paper, we present a coding method called expanding-window zigzag decodable fountain code with unequal error protection property (EWF-ZD UEP code) to achieve scalable multimedia transmission. The key idea of the EWF-ZD UEP code is to utilize bit-shift operation and expanding-window strategy to improve the decoding performance of the high-priority data without performance deterioration of the low-priority data. To provide more protection for the high-priority data, we precode the different importance level using LDPC codes of varying code rates. The generalized variable nodes of different importance levels are further grouped into several windows. Each window is associated with a selection probability and a bit-shift distribution. The combination of bit-shift and symbol exclusive-or operations is used to generate an encoded symbol. Theoretical and simulation results on input symbols of two importance levels reveal that the proposed EWF-ZD UEP code exhibits UEP property. With a small bit shift, the decoding delay for recovering high-priority input symbols is decreased without degrading the decoding performance of the low-priority input symbols. Moreover, according to the simulation results on scalable video coding, our scheme provides better basic video quality at a lower proportion of received symbols compared to three state-of-art UEP fountain codes.

本文提出了一种编码方法，称为具有不平等错误保护特性的扩展窗口之字形可解码喷泉码（EWF-ZD UEP 码），以实现可扩展的多媒体传输。EWF-ZD UEP 代码的主要思想是利用位移操作和扩展窗策略来提高高优先级数据的解码性能，而不降低低优先级数据的性能。为了给高优先级数据提供更多保护，我们使用不同码率的 LDPC 码对不同重要程度的数据进行预编码。不同重要程度的广义变量节点被进一步分组为多个窗口。每个窗口都与选择概率和位移分布相关联。位移和符号排他运算的组合用于生成编码符号。对两种重要程度输入符号的理论和模拟结果表明，所提出的 EWF-ZD UEP 编码具有 UEP 特性。通过较小的位移，恢复高优先级输入符号的解码延迟可以降低，而不会降低低优先级输入符号的解码性能。此外，根据可扩展视频编码的仿真结果，与三种最先进的 UEP 喷泉编码相比，我们的方案能以较低的接收符号比例提供更好的基本视频质量。

{"title":"Expanding-Window Zigzag Decodable Fountain Codes for Scalable Multimedia Transmission","authors":"Yuli Zhao, Yin Zhang, Francis C. M. Lau, Hai Yu, Zhiliang Zhu, Bin Zhang","doi":"10.1145/3664610","DOIUrl":"https://doi.org/10.1145/3664610","url":null,"abstract":"In this paper, we present a coding method called expanding-window zigzag decodable fountain code with unequal error protection property (EWF-ZD UEP code) to achieve scalable multimedia transmission. The key idea of the EWF-ZD UEP code is to utilize bit-shift operation and expanding-window strategy to improve the decoding performance of the high-priority data without performance deterioration of the low-priority data. To provide more protection for the high-priority data, we precode the different importance level using LDPC codes of varying code rates. The generalized variable nodes of different importance levels are further grouped into several windows. Each window is associated with a selection probability and a bit-shift distribution. The combination of bit-shift and symbol exclusive-or operations is used to generate an encoded symbol. Theoretical and simulation results on input symbols of two importance levels reveal that the proposed EWF-ZD UEP code exhibits UEP property. With a small bit shift, the decoding delay for recovering high-priority input symbols is decreased without degrading the decoding performance of the low-priority input symbols. Moreover, according to the simulation results on scalable video coding, our scheme provides better basic video quality at a lower proportion of received symbols compared to three state-of-art UEP fountain codes.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"65 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140936895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval SEMScene：用于图像-文本检索的语义一致性增强型多层次场景图匹配

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-05-11 DOI: 10.1145/3664816

Yuankun Liu, Xiang Yuan, Haochen Li, Zhijie Tan, Jinsong Huang, Jingjie Xiao, Weiping Li, Tong Mo

Image-text retrieval, a fundamental cross-modal task, performs similarity reasoning for images and texts. The primary challenge for image-text retrieval is cross-modal semantic heterogeneity, where the semantic features of visual and textual modalities are rich but distinct. Scene graph is an effective representation for images and texts as it explicitly models objects and their relations. Existing scene graph based methods have not fully taken the features regarding various granularities implicit in scene graph into consideration (e.g. triplets), the inadequate feature matching incurs the absence of non-trivial semantic information (e.g. inner relations among triplets). Therefore, we propose a Semantic-Consistency Enhanced Multi-Level Scene Graph Matching (SEMScene) network, which exploits the semantic relevance between visual and textual scene graphs from fine-grained to coarse-grained. Firstly, under the scene graph representation, we perform feature matching including low-level node matching, mid-level semantic triplet matching, and high-level holistic scene graph matching. Secondly, to enhance the semantic-consistency for object-fused triplets carrying key correlation information, we propose a dual-step constraint mechanism in mid-level matching. Thirdly, to guide the model to learn the semantic-consistency of matched image-text pairs, we devise effective loss functions for each stage of the dual-step constraint. Comprehensive experiments on Flickr30K and MS-COCO datasets demonstrate that SEMScene achieves state-of-the-art performances with significant improvements.

图像-文本检索是一项基本的跨模态任务，对图像和文本进行相似性推理。图像-文本检索面临的主要挑战是跨模态语义异质性，即视觉模态和文本模态的语义特征丰富但各不相同。场景图是图像和文本的有效表示方法，因为它明确地模拟了对象及其关系。现有的基于场景图的方法没有充分考虑到场景图中隐含的各种粒度的特征（如三元组），不充分的特征匹配导致了非重要语义信息（如三元组之间的内在关系）的缺失。因此，我们提出了语义一致性增强型多层次场景图匹配（SEMScene）网络，利用视觉和文本场景图之间从细粒度到粗粒度的语义相关性。首先，在场景图表示下，我们进行特征匹配，包括低级节点匹配、中级语义三元组匹配和高级整体场景图匹配。其次，为了增强携带关键相关信息的对象融合三元组的语义一致性，我们在中层匹配中提出了双步约束机制。第三，为了引导模型学习匹配图像-文本对的语义一致性，我们为双步骤约束的每个阶段设计了有效的损失函数。在 Flickr30K 和 MS-COCO 数据集上进行的综合实验表明，SEMScene 的性能达到了最先进水平，并有显著提高。

{"title":"SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval","authors":"Yuankun Liu, Xiang Yuan, Haochen Li, Zhijie Tan, Jinsong Huang, Jingjie Xiao, Weiping Li, Tong Mo","doi":"10.1145/3664816","DOIUrl":"https://doi.org/10.1145/3664816","url":null,"abstract":"Image-text retrieval, a fundamental cross-modal task, performs similarity reasoning for images and texts. The primary challenge for image-text retrieval is cross-modal semantic heterogeneity, where the semantic features of visual and textual modalities are rich but distinct. Scene graph is an effective representation for images and texts as it explicitly models objects and their relations. Existing scene graph based methods have not fully taken the features regarding various granularities implicit in scene graph into consideration (e.g. triplets), the inadequate feature matching incurs the absence of non-trivial semantic information (e.g. inner relations among triplets). Therefore, we propose a Semantic-Consistency Enhanced Multi-Level Scene Graph Matching (SEMScene) network, which exploits the semantic relevance between visual and textual scene graphs from fine-grained to coarse-grained. Firstly, under the scene graph representation, we perform feature matching including low-level node matching, mid-level semantic triplet matching, and high-level holistic scene graph matching. Secondly, to enhance the semantic-consistency for object-fused triplets carrying key correlation information, we propose a dual-step constraint mechanism in mid-level matching. Thirdly, to guide the model to learn the semantic-consistency of matched image-text pairs, we devise effective loss functions for each stage of the dual-step constraint. Comprehensive experiments on Flickr30K and MS-COCO datasets demonstrate that SEMScene achieves state-of-the-art performances with significant improvements.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"131 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140936642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Self-Supervised Monocular Depth Estimation via Binocular Geometric Correlation Learning 通过双眼几何相关性学习实现自我监督的单眼深度估计

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-05-08 DOI: 10.1145/3663570

Bo Peng, Lin Sun, Jianjun Lei, Bingzheng Liu, Haifeng Shen, Wanqing Li, Qingming Huang

Monocular depth estimation aims to infer a depth map from a single image. Although supervised learning-based methods have achieved remarkable performance, they generally rely on a large amount of labor-intensively annotated data. Self-supervised methods on the other hand do not require any annotation of ground-truth depth and have recently attracted increasing attention. In this work, we propose a self-supervised monocular depth estimation network via binocular geometric correlation learning. Specifically, considering the inter-view geometric correlation, a binocular cue prediction module is presented to generate the auxiliary vision cue for the self-supervised learning of monocular depth estimation. Then, to deal with the occlusion in depth estimation, an occlusion interference attenuated constraint is developed to guide the supervision of the network by inferring the occlusion region and producing paired occlusion masks. Experimental results on two popular benchmark datasets have demonstrated that the proposed network obtains competitive results compared to state-of-the-art self-supervised methods and achieves comparable results to some popular supervised methods.

单目深度估算旨在从单幅图像中推断深度图。虽然基于监督学习的方法已经取得了显著的性能，但它们通常依赖于大量劳动密集型注释数据。另一方面，自监督方法不需要对地面实况深度进行任何标注，最近引起了越来越多的关注。在这项工作中，我们提出了一种通过双眼几何相关性学习的自监督单眼深度估计网络。具体来说，考虑到视线间的几何相关性，我们提出了一个双目线索预测模块，为单目深度估计的自我监督学习生成辅助视觉线索。然后，为处理深度估计中的遮挡问题，开发了一种遮挡干扰衰减约束，通过推断遮挡区域和生成成对遮挡掩码来指导网络监督。在两个流行的基准数据集上的实验结果表明，与最先进的自监督方法相比，所提出的网络获得了具有竞争力的结果，并与一些流行的监督方法取得了相当的结果。

{"title":"Self-Supervised Monocular Depth Estimation via Binocular Geometric Correlation Learning","authors":"Bo Peng, Lin Sun, Jianjun Lei, Bingzheng Liu, Haifeng Shen, Wanqing Li, Qingming Huang","doi":"10.1145/3663570","DOIUrl":"https://doi.org/10.1145/3663570","url":null,"abstract":"Monocular depth estimation aims to infer a depth map from a single image. Although supervised learning-based methods have achieved remarkable performance, they generally rely on a large amount of labor-intensively annotated data. Self-supervised methods on the other hand do not require any annotation of ground-truth depth and have recently attracted increasing attention. In this work, we propose a self-supervised monocular depth estimation network via binocular geometric correlation learning. Specifically, considering the inter-view geometric correlation, a binocular cue prediction module is presented to generate the auxiliary vision cue for the self-supervised learning of monocular depth estimation. Then, to deal with the occlusion in depth estimation, an occlusion interference attenuated constraint is developed to guide the supervision of the network by inferring the occlusion region and producing paired occlusion masks. Experimental results on two popular benchmark datasets have demonstrated that the proposed network obtains competitive results compared to state-of-the-art self-supervised methods and achieves comparable results to some popular supervised methods.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"33 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140936891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0