This paper introduces SmoothFlowNet3D, an innovative encoder-decoder architecture specifically designed for bridging the domain gap in scene flow estimation. To achieve this goal, SmoothFlowNet3D divides the scene flow estimation task into two stages: initial scene flow estimation and smoothness refinement. Specifically, SmoothFlowNet3D comprises a hierarchical encoder that extracts multi-scale point cloud features from two consecutive frames, along with a hierarchical decoder responsible for predicting the initial scene flow and further refining it to achieve smoother estimation. To generate the initial scene flow, a cross-frame nearest neighbor search operation is performed between the features extracted from two consecutive frames, resulting in forward and backward flow embeddings. These embeddings are then combined to form the bidirectional flow embedding, serving as input for predicting the initial scene flow. Additionally, a flow smoothing module based on the self-attention mechanism is proposed to predict the smoothing error and facilitate the refinement of the initial scene flow for more accurate and smoother estimation results. Extensive experiments demonstrate that the proposed SmoothFlowNet3D approach achieves state-of-the-art performance on both synthetic datasets and real LiDAR point clouds, confirming its effectiveness in enhancing scene flow smoothness.
{"title":"Bridging the Domain Gap in Scene Flow Estimation via Hierarchical Smoothness Refinement","authors":"Dejun Zhang, Mian Zhang, Xuefeng Tan, Jun Liu","doi":"10.1145/3661823","DOIUrl":"https://doi.org/10.1145/3661823","url":null,"abstract":"<p>This paper introduces SmoothFlowNet3D, an innovative encoder-decoder architecture specifically designed for bridging the domain gap in scene flow estimation. To achieve this goal, SmoothFlowNet3D divides the scene flow estimation task into two stages: initial scene flow estimation and smoothness refinement. Specifically, SmoothFlowNet3D comprises a hierarchical encoder that extracts multi-scale point cloud features from two consecutive frames, along with a hierarchical decoder responsible for predicting the initial scene flow and further refining it to achieve smoother estimation. To generate the initial scene flow, a cross-frame nearest neighbor search operation is performed between the features extracted from two consecutive frames, resulting in forward and backward flow embeddings. These embeddings are then combined to form the bidirectional flow embedding, serving as input for predicting the initial scene flow. Additionally, a flow smoothing module based on the self-attention mechanism is proposed to predict the smoothing error and facilitate the refinement of the initial scene flow for more accurate and smoother estimation results. Extensive experiments demonstrate that the proposed SmoothFlowNet3D approach achieves state-of-the-art performance on both synthetic datasets and real LiDAR point clouds, confirming its effectiveness in enhancing scene flow smoothness.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"1 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140800729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuanshen Guan, Ruikang Xu, Mingde Yao, Jie Huang, Zhiwei Xiong
Synthesizing the high dynamic range (HDR) image from multi-exposure images has been extensively studied by exploiting convolutional neural networks (CNNs) recently. Despite the remarkable progress, existing CNN-based methods have the intrinsic limitation of local receptive field, which hinders the model’s capability of capturing long-range correspondence and large motions across under/over-exposure images, resulting in ghosting artifacts of dynamic scenes. To address the above challenge, we propose a novel Edge-guided Transformer framework (EdiTor) customized for ghost-free HDR reconstruction, where the long-range motions across different exposures can be delicately modeled by incorporating the edge prior. Specifically, EdiTor calculates patch-wise correlation maps on both image and edge domains, enabling the network to effectively model the global movements and the fine-grained shifts across multiple exposures. Based on this framework, we further propose an exposure-masked loss to adaptively compensate for the severely distorted regions (e.g., highlights and shadows). Experiments demonstrate that EdiTor outperforms state-of-the-art methods both quantitatively and qualitatively, achieving appealing HDR visualization with unified textures and colors.
{"title":"EdiTor: Edge-guided Transformer for Ghost-free High Dynamic Range Imaging","authors":"Yuanshen Guan, Ruikang Xu, Mingde Yao, Jie Huang, Zhiwei Xiong","doi":"10.1145/3657293","DOIUrl":"https://doi.org/10.1145/3657293","url":null,"abstract":"<p>Synthesizing the high dynamic range (HDR) image from multi-exposure images has been extensively studied by exploiting convolutional neural networks (CNNs) recently. Despite the remarkable progress, existing CNN-based methods have the intrinsic limitation of local receptive field, which hinders the model’s capability of capturing long-range correspondence and large motions across under/over-exposure images, resulting in ghosting artifacts of dynamic scenes. To address the above challenge, we propose a novel <b>Ed</b>ge-gu<b>i</b>ded <b>T</b>ransf<b>or</b>mer framework (EdiTor) customized for ghost-free HDR reconstruction, where the long-range motions across different exposures can be delicately modeled by incorporating the edge prior. Specifically, EdiTor calculates patch-wise correlation maps on both image and edge domains, enabling the network to effectively model the global movements and the fine-grained shifts across multiple exposures. Based on this framework, we further propose an exposure-masked loss to adaptively compensate for the severely distorted regions (<i>e.g.</i>, highlights and shadows). Experiments demonstrate that EdiTor outperforms state-of-the-art methods both quantitatively and qualitatively, achieving appealing HDR visualization with unified textures and colors.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"5 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140800732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiayu Yang, Chunhui Yang, Fei Xiong, Yongqi Zhai, Ronggang Wang
Learned video compression has drawn great attention and shown promising compression performance recently. In this paper, we focus on the two components in learned video compression framework, i.e., conditional entropy model and quality enhancement module, to improve compression performance. Specifically, we propose an adaptive spatial-temporal entropy model for image, motion and residual compression, which introduces temporal prior to reduce temporal redundancy of latents and an additional modulated mask to evaluate the similarity and perform refinement. Besides, a quality enhancement module is proposed for predicted frame and reconstructed frame to improve frame quality and reduce bitrate cost of residual coding. The module reuses decoded optical flow as motion prior and utilizes deformable convolution to mine high-quality information from reference frame in a bit-free manner. The two proposed coding tools are integrated into a pixel-domain residual-coding based compression framework to evaluate their effectiveness. Experimental results demonstrate that our framework achieves competitive compression performance in low-delay scenario, compared with recent learning-based methods and traditional H.265/HEVC in terms of PSNR and MS-SSIM. The code is available at OpenLVC.
{"title":"Learned Video Compression with Adaptive Temporal Prior and Decoded Motion-aided Quality Enhancement","authors":"Jiayu Yang, Chunhui Yang, Fei Xiong, Yongqi Zhai, Ronggang Wang","doi":"10.1145/3661824","DOIUrl":"https://doi.org/10.1145/3661824","url":null,"abstract":"<p>Learned video compression has drawn great attention and shown promising compression performance recently. In this paper, we focus on the two components in learned video compression framework, i.e., conditional entropy model and quality enhancement module, to improve compression performance. Specifically, we propose an adaptive spatial-temporal entropy model for image, motion and residual compression, which introduces temporal prior to reduce temporal redundancy of latents and an additional modulated mask to evaluate the similarity and perform refinement. Besides, a quality enhancement module is proposed for predicted frame and reconstructed frame to improve frame quality and reduce bitrate cost of residual coding. The module reuses decoded optical flow as motion prior and utilizes deformable convolution to mine high-quality information from reference frame in a bit-free manner. The two proposed coding tools are integrated into a pixel-domain residual-coding based compression framework to evaluate their effectiveness. Experimental results demonstrate that our framework achieves competitive compression performance in low-delay scenario, compared with recent learning-based methods and traditional H.265/HEVC in terms of PSNR and MS-SSIM. The code is available at OpenLVC.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"50 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140800801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hyper-realistic avatars in the metaverse have already raised security concerns about deepfake techniques, deepfakes involving generated video “recording” may be mistaken for a real recording of the people it depicts. As a result, deepfake detection has drawn considerable attention in the multimedia forensic community. Though existing methods for deepfake detection achieve fairly good performance under the intra-dataset scenario, many of them gain unsatisfying results in the case of cross-dataset testing with more practical value, where the forged faces in training and testing datasets are from different domains. To tackle this issue, in this paper, we propose a novel Domain-Invariant and Patch-Discriminative feature learning framework - DI&PD. For image-level feature learning, a single-side adversarial domain generalization is introduced to eliminate domain variances and learn domain-invariant features in training samples from different manipulation methods, along with the global and local random crop augmentation strategy to generate more data views of forged images at various scales. A graph structure is then built by splitting the learned image-level feature maps, with each spatial location corresponding to a local patch, which facilitates patch representation learning by message-passing among similar nodes. Two types of center losses are utilized to learn more discriminative features in both image-level and patch-level embedding spaces. Extensive experimental results on several datasets demonstrate the effectiveness and generalization of the proposed method compared with other state-of-the-art methods.
{"title":"Domain-invariant and Patch-discriminative Feature Learning for General Deepfake Detection","authors":"Jian Zhang, Jiangqun Ni, Fan Nie, jiwu Huang","doi":"10.1145/3657297","DOIUrl":"https://doi.org/10.1145/3657297","url":null,"abstract":"<p>Hyper-realistic avatars in the metaverse have already raised security concerns about deepfake techniques, deepfakes involving generated video “recording” may be mistaken for a real recording of the people it depicts. As a result, deepfake detection has drawn considerable attention in the multimedia forensic community. Though existing methods for deepfake detection achieve fairly good performance under the intra-dataset scenario, many of them gain unsatisfying results in the case of cross-dataset testing with more practical value, where the forged faces in training and testing datasets are from different domains. To tackle this issue, in this paper, we propose a novel Domain-Invariant and Patch-Discriminative feature learning framework - DI&PD. For image-level feature learning, a single-side adversarial domain generalization is introduced to eliminate domain variances and learn domain-invariant features in training samples from different manipulation methods, along with the global and local random crop augmentation strategy to generate more data views of forged images at various scales. A graph structure is then built by splitting the learned image-level feature maps, with each spatial location corresponding to a local patch, which facilitates patch representation learning by message-passing among similar nodes. Two types of center losses are utilized to learn more discriminative features in both image-level and patch-level embedding spaces. Extensive experimental results on several datasets demonstrate the effectiveness and generalization of the proposed method compared with other state-of-the-art methods.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"100 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140800800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ning Chen, Zhipeng Cheng, Xuwei Fan, Zhang Liu, Bangzhen Huang, Yifeng Zhao, Lianfen Huang, Xiaojiang Du, Mohsen Guizani
Federated learning (FL) is a prominent paradigm of 6G edge intelligence (EI), which mitigates privacy breaches and high communication pressure caused by conventional centralized model training in the artificial intelligence of things (AIoT). The execution of multimodal federated perception (MFP) services comprises three sub-processes, including sensing-based multimodal data generation, communication-based model transmission, and computing-based model training, ultimately competitive on available underlying multi-domain physical resources such as time, frequency, and computing power. How to reasonably coordinate the multi-domain resources scheduling among sensing, communication, and computing, therefore, is vital to the MFP networks. To address the above issues, this paper explores service-oriented resource management with integrated sensing, communication, and computing (ISCC). Specifically, employing the incentive mechanism of the MFP service market, the resources management problem is defined as a social welfare maximization problem, where the concept of “expanding resources” and “reducing costs” is used to enhance learning performance gain and reduce resource costs. Experimental results demonstrate the effectiveness and robustness of the proposed resource scheduling mechanisms.
{"title":"Integrated Sensing, Communication, and Computing for Cost-effective Multimodal Federated Perception","authors":"Ning Chen, Zhipeng Cheng, Xuwei Fan, Zhang Liu, Bangzhen Huang, Yifeng Zhao, Lianfen Huang, Xiaojiang Du, Mohsen Guizani","doi":"10.1145/3661313","DOIUrl":"https://doi.org/10.1145/3661313","url":null,"abstract":"<p>Federated learning (FL) is a prominent paradigm of 6G edge intelligence (EI), which mitigates privacy breaches and high communication pressure caused by conventional centralized model training in the artificial intelligence of things (AIoT). The execution of multimodal federated perception (MFP) services comprises three sub-processes, including sensing-based multimodal data generation, communication-based model transmission, and computing-based model training, ultimately competitive on available underlying multi-domain physical resources such as time, frequency, and computing power. How to reasonably coordinate the multi-domain resources scheduling among sensing, communication, and computing, therefore, is vital to the MFP networks. To address the above issues, this paper explores service-oriented resource management with integrated sensing, communication, and computing (ISCC). Specifically, employing the incentive mechanism of the MFP service market, the resources management problem is defined as a social welfare maximization problem, where the concept of “expanding resources” and “reducing costs” is used to enhance learning performance gain and reduce resource costs. Experimental results demonstrate the effectiveness and robustness of the proposed resource scheduling mechanisms.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"64 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140800799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although deep learning technique has achieved significant improvement on image compression, but its advantages are not fully explored in video compression, which leads to the performance of deep-learning based video compression (DLVC) is obvious inferior to that of hybrid video coding framework. In this paper, we proposed a novel network to improve the performance of DLVC from its most important modules, including Motion Process (MP), Residual Compression (RC) and Frame Reconstruction (FR). In MP, we design a split second-order attention and multi-scale feature extraction module to fully remove the warping artifacts from multi-scale feature space and pixel space, which can help reduce the distortion in the following process. In RC, we propose a channel selection mechanism to gradually drop redundant information while preserving informative channels for a better rate-distortion performance. Finally, in FR, we introduce a residual multi-scale recurrent network to improve the quality of the current reconstructed frame by progressively exploiting temporal context information between it and its several previous reconstructed frames. Extensive experiments are conducted on the three widely used video compression datasets (HEVC, UVG and MCL-JVC), and the performance demonstrates the superiority of our proposed approach over the state-of-the-art methods.
{"title":"High Efficiency Deep-learning Based Video Compression","authors":"Lv Tang, Xinfeng Zhang","doi":"10.1145/3661311","DOIUrl":"https://doi.org/10.1145/3661311","url":null,"abstract":"<p>Although deep learning technique has achieved significant improvement on image compression, but its advantages are not fully explored in video compression, which leads to the performance of deep-learning based video compression (DLVC) is obvious inferior to that of hybrid video coding framework. In this paper, we proposed a novel network to improve the performance of DLVC from its most important modules, including <i>Motion Process</i> (MP), <i>Residual Compression</i> (RC) and <i>Frame Reconstruction</i> (FR). In MP, we design a split second-order attention and multi-scale feature extraction module to fully remove the warping artifacts from multi-scale feature space and pixel space, which can help reduce the distortion in the following process. In RC, we propose a channel selection mechanism to gradually drop redundant information while preserving informative channels for a better rate-distortion performance. Finally, in FR, we introduce a residual multi-scale recurrent network to improve the quality of the current reconstructed frame by progressively exploiting temporal context information between it and its several previous reconstructed frames. Extensive experiments are conducted on the three widely used video compression datasets (HEVC, UVG and MCL-JVC), and the performance demonstrates the superiority of our proposed approach over the state-of-the-art methods.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"99 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140636682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoling Gu, Junkai Zhu, Yongkang Wong, Zizhao Wu, Jun Yu, Jianping Fan, Mohan S. Kankanhalli
Image-based virtual try-on aims at transferring a target in-shop garment onto a reference person, which has garnered significant attention from the research communities recently. However, previous methods have faced severe challenges in handling occlusion problems. To address this limitation, we classify occlusion problems into three types based on the reference person’s arm postures: single-arm occlusion, two-arm non-crossed occlusion, and two-arm crossed occlusion. Specifically, we propose a novel Occlusion-Free Virtual Try-On Network (OF-VTON) that effectively overcomes these occlusion challenges. The OF-VTON framework consists of two core components: i) a new Recurrent Appearance Flow based Deformation (RAFD) model that robustly aligns the in-shop garment to the reference person by adopting a multi-task learning strategy. This model jointly produces the dense appearance flow to warp the garment and predicts a human segmentation map to provide semantic guidance for the subsequent image synthesis model. ii) a powerful Multi-mask Image SynthesiS (MISS) model that generates photo-realistic try-on results by introducing a new mask generation and selection mechanism. Experimental results demonstrate that our proposed OF-VTON significantly outperforms existing state-of-the-art methods by mitigating the impact of occlusion problems. Our code is available at https://github.com/gxl-groups/OF-VTON.
{"title":"Recurrent Appearance Flow for Occlusion-Free Virtual Try-On","authors":"Xiaoling Gu, Junkai Zhu, Yongkang Wong, Zizhao Wu, Jun Yu, Jianping Fan, Mohan S. Kankanhalli","doi":"10.1145/3659581","DOIUrl":"https://doi.org/10.1145/3659581","url":null,"abstract":"<p>Image-based virtual try-on aims at transferring a target in-shop garment onto a reference person, which has garnered significant attention from the research communities recently. However, previous methods have faced severe challenges in handling occlusion problems. To address this limitation, we classify occlusion problems into three types based on the reference person’s arm postures: <i>single-arm occlusion</i>, <i>two-arm non-crossed occlusion</i>, and <i>two-arm crossed occlusion</i>. Specifically, we propose a novel Occlusion-Free Virtual Try-On Network (OF-VTON) that effectively overcomes these occlusion challenges. The OF-VTON framework consists of two core components: i) a new <i>Recurrent Appearance Flow based Deformation</i> (RAFD) model that robustly aligns the in-shop garment to the reference person by adopting a <i>multi-task learning strategy</i>. This model jointly produces the dense appearance flow to warp the garment and predicts a human segmentation map to provide semantic guidance for the subsequent image synthesis model. ii) a powerful <i>Multi-mask Image SynthesiS</i> (MISS) model that generates photo-realistic try-on results by introducing a new <i>mask generation and selection mechanism</i>. Experimental results demonstrate that our proposed OF-VTON significantly outperforms existing state-of-the-art methods by mitigating the impact of occlusion problems. Our code is available at https://github.com/gxl-groups/OF-VTON.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"19 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140636678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Change captioning aims to describe the difference within an image pair in natural language, which combines visual comprehension and language generation. Although significant progress has been achieved, it remains a key challenge of perceiving the object change from different perspectives, especially the severe situation with drastic viewpoint change. In this paper, we propose a novel full-attentive network, namely Multi-grained Representation Aggregating Transformer (MURAT), to distinguish the actual change from viewpoint change. Specifically, the Pair Encoder first captures similar semantics between pairwise objects in a multi-level manner, which are regarded as the semantic cues of distinguishing the irrelevant change. Next, a novel Multi-grained Representation Aggregator (MRA) is designed to construct the reliable difference representation by employing both coarse- and fine-grained semantic cues. Finally, the language decoder generates a description of the change based on the output of MRA. Besides, the Gating Cycle Mechanism is introduced to facilitate the semantic consistency between difference representation learning and language generation with a reverse manipulation, so as to bridge the semantic gap between change features and text features. Extensive experiments demonstrate that the proposed MURAT can greatly improve the ability to describe the actual change in the distraction of irrelevant change and achieves state-of-the-art performance on three benchmarks, CLEVR-Change, CLEVR-DC and Spot-the-Diff.
{"title":"Multi-grained Representation Aggregating Transformer with Gating Cycle for Change Captioning","authors":"Shengbin Yue, Yunbin Tu, Liang Li, Shengxiang Gao, Zhengtao Yu","doi":"10.1145/3660346","DOIUrl":"https://doi.org/10.1145/3660346","url":null,"abstract":"<p>Change captioning aims to describe the difference within an image pair in natural language, which combines visual comprehension and language generation. Although significant progress has been achieved, it remains a key challenge of perceiving the object change from different perspectives, especially the severe situation with drastic viewpoint change. In this paper, we propose a novel full-attentive network, namely Multi-grained Representation Aggregating Transformer (MURAT), to distinguish the actual change from viewpoint change. Specifically, the Pair Encoder first captures similar semantics between pairwise objects in a multi-level manner, which are regarded as the semantic cues of distinguishing the irrelevant change. Next, a novel Multi-grained Representation Aggregator (MRA) is designed to construct the reliable difference representation by employing both coarse- and fine-grained semantic cues. Finally, the language decoder generates a description of the change based on the output of MRA. Besides, the Gating Cycle Mechanism is introduced to facilitate the semantic consistency between difference representation learning and language generation with a reverse manipulation, so as to bridge the semantic gap between change features and text features. Extensive experiments demonstrate that the proposed MURAT can greatly improve the ability to describe the actual change in the distraction of irrelevant change and achieves state-of-the-art performance on three benchmarks, CLEVR-Change, CLEVR-DC and Spot-the-Diff.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"81 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140636796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Walayat Hussain, Honghao Gao, Rafiul Karim, Abdulmotaleb El Saddik
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) has been dedicated to advancing multimedia research, fostering discoveries, innovations, and practical applications since 2005. The journal consistently publishes top-notch, original research in emerging fields through open submissions, calls for papers, special issues, rigorous review processes, and diverse research topics. This study aims to delve into an extensive bibliometric analysis of the journal, utilising various bibliometric indicators. The paper seeks to unveil the latent implications within the journal’s scholarly landscape from 2005 to 2022. The data primarily draws from the Web of Science (WoS) Core Collection database. The analysis encompasses diverse viewpoints, including yearly publication rates and citations, identifying highly cited papers, and assessing the most prolific authors, institutions, and countries. The paper employs VOSviewer-generated graphical maps, effectively illustrating networks of co-citations, keyword co-occurrences, and institutional and national bibliographic couplings. Furthermore, the study conducts a comprehensive global and temporal examination of co-occurrences of the author’s keywords. This investigation reveals the emergence of numerous novel keywords over the past decades.
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 自 2005 年以来一直致力于推动多媒体研究,促进发现、创新和实际应用。该期刊通过公开投稿、论文征集、特刊、严格的审稿流程和多样化的研究课题,持续发表新兴领域的一流原创研究成果。本研究旨在利用各种文献计量指标对该期刊进行广泛的文献计量分析。本文试图揭示 2005 年至 2022 年该期刊学术版图的潜在影响。数据主要来自科学网(WoS)核心期刊数据库。分析涵盖多种视角,包括年度发表率和引用率,识别高被引论文,以及评估最多产的作者、机构和国家。论文采用了 VOSviewer 生成的图形地图,有效地说明了共被引网络、关键词共现以及机构和国家书目耦合。此外,本研究还对作者关键词的全球和时间共现情况进行了全面检查。这项调查揭示了过去几十年中出现的大量新关键词。
{"title":"Seventeen Years of the ACM Transactions on Multimedia Computing, Communications and Applications: A Bibliometric Overview","authors":"Walayat Hussain, Honghao Gao, Rafiul Karim, Abdulmotaleb El Saddik","doi":"10.1145/3660347","DOIUrl":"https://doi.org/10.1145/3660347","url":null,"abstract":"<p>ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) has been dedicated to advancing multimedia research, fostering discoveries, innovations, and practical applications since 2005. The journal consistently publishes top-notch, original research in emerging fields through open submissions, calls for papers, special issues, rigorous review processes, and diverse research topics. This study aims to delve into an extensive bibliometric analysis of the journal, utilising various bibliometric indicators. The paper seeks to unveil the latent implications within the journal’s scholarly landscape from 2005 to 2022. The data primarily draws from the Web of Science (WoS) Core Collection database. The analysis encompasses diverse viewpoints, including yearly publication rates and citations, identifying highly cited papers, and assessing the most prolific authors, institutions, and countries. The paper employs VOSviewer-generated graphical maps, effectively illustrating networks of co-citations, keyword co-occurrences, and institutional and national bibliographic couplings. Furthermore, the study conducts a comprehensive global and temporal examination of co-occurrences of the author’s keywords. This investigation reveals the emergence of numerous novel keywords over the past decades.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"25 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140617366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cheonjin Park, Chinmaey Shende, Subhabrata Sen, Bing Wang
Smartphones have emerged as ubiquitous platforms for people to consume content in a wide range of consumption contexts (C2), e.g., over cellular or WiFi, playing back audio and video directly on phone or through peripheral devices such as external screens or speakers. In this paper, we argue that a user’s specific C2 is an important factor to consider in Adaptive Bitrate (ABR) streaming. We examine the current practices of using C2 in five popular ABR players, and identify various limitations in existing treatments that have a detrimental impact on network resource usage and user experience. We then formulate C2-cognizant ABR streaming as an optimization problem and develop practical best-practice guidelines to realize it. Instantiating these guidelines, we develop a proof-of-concept implementation in the widely used state-of-the-art ExoPlayer platform and demonstrate that it leads to significantly better tradeoffs in terms of user experience and resource usage. Last, we show that the guidelines also benefit dash.js player that uses an ABR logic significantly different from that of ExoPlayer.
智能手机已成为人们在各种消费环境(C2)下消费内容的无处不在的平台,例如,通过蜂窝网络或 WiFi,直接在手机上播放音频和视频,或通过外接屏幕或扬声器等外围设备播放音频和视频。在本文中,我们认为用户的特定 C2 是自适应比特率(ABR)流媒体需要考虑的一个重要因素。我们研究了目前在五种流行 ABR 播放器中使用 C2 的做法,并找出了现有处理方法中对网络资源使用和用户体验有不利影响的各种局限性。然后,我们将识别 C2 的 ABR 流作为一个优化问题,并制定了实现该问题的实用最佳实践指南。根据这些指导原则,我们在广泛使用的最先进的 ExoPlayer 平台上开发了概念验证实施方案,并证明该方案在用户体验和资源使用方面实现了显著改善。最后,我们还展示了这些指导原则同样适用于 dash.js 播放器,该播放器使用的 ABR 逻辑与 ExoPlayer 有很大不同。
{"title":"C2: ABR Streaming in Cognizant of Consumption Context for Improved QoE and Resource Usage Tradeoffs","authors":"Cheonjin Park, Chinmaey Shende, Subhabrata Sen, Bing Wang","doi":"10.1145/3652517","DOIUrl":"https://doi.org/10.1145/3652517","url":null,"abstract":"<p>Smartphones have emerged as ubiquitous platforms for people to consume content in a wide range of <i>consumption contexts (C2)</i>, e.g., over cellular or WiFi, playing back audio and video directly on phone or through peripheral devices such as external screens or speakers. In this paper, we argue that a user’s specific C2 is an important factor to consider in Adaptive Bitrate (ABR) streaming. We examine the current practices of using C2 in five popular ABR players, and identify various limitations in existing treatments that have a detrimental impact on network resource usage and user experience. We then formulate C2-cognizant ABR streaming as an optimization problem and develop practical best-practice guidelines to realize it. Instantiating these guidelines, we develop a proof-of-concept implementation in the widely used state-of-the-art ExoPlayer platform and demonstrate that it leads to significantly better tradeoffs in terms of user experience and resource usage. Last, we show that the guidelines also benefit <monospace>dash.js</monospace> player that uses an ABR logic significantly different from that of ExoPlayer.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"12 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140616852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}