The fifth ACM International Workshop on Multimedia Content Analysis in Sports (ACM MMSports'22) is part of the ACM International Conference on Multimedia 2022 (ACM Multimedia 2022). After two years of pure virtual MMSports workshops due to COVID-19, MMSports'22 is held on-site again. The goal of this workshop is to bring together researchers and practitioners from academia and industry to address challenges and report progress in mining, analyzing, understanding, and visualizing multimedia/multimodal data in sports, sports broadcasts, sports games and sports medicine. The combination of sports and modern technology offers a novel and intriguing field of research with promising approaches for visual broadcast augmentation and understanding, for statistical analysis and evaluation, and for sensor fusion during workouts as well as competitions. There is a lack of research communities focusing on the fusion of multiple modalities. We are helping to close this research gap with this workshop series on multimedia content analysis in sports. Related Workshop Proceedings are available in the ACM DL at: https://dl.acm.org/doi/proceedings/10.1145/3552437.
{"title":"MMSports'22: 5th International ACM Workshop on Multimedia Content Analysis in Sports","authors":"H. Saito, T. Moeslund, R. Lienhart","doi":"10.1145/3503161.3551791","DOIUrl":"https://doi.org/10.1145/3503161.3551791","url":null,"abstract":"The fifth ACM International Workshop on Multimedia Content Analysis in Sports (ACM MMSports'22) is part of the ACM International Conference on Multimedia 2022 (ACM Multimedia 2022). After two years of pure virtual MMSports workshops due to COVID-19, MMSports'22 is held on-site again. The goal of this workshop is to bring together researchers and practitioners from academia and industry to address challenges and report progress in mining, analyzing, understanding, and visualizing multimedia/multimodal data in sports, sports broadcasts, sports games and sports medicine. The combination of sports and modern technology offers a novel and intriguing field of research with promising approaches for visual broadcast augmentation and understanding, for statistical analysis and evaluation, and for sensor fusion during workouts as well as competitions. There is a lack of research communities focusing on the fusion of multiple modalities. We are helping to close this research gap with this workshop series on multimedia content analysis in sports. Related Workshop Proceedings are available in the ACM DL at: https://dl.acm.org/doi/proceedings/10.1145/3552437.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129933140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Over the past decades, 360-degree videos have attracted wide interest for the immersive experience they bring to viewers. The rising of high-resolution 360-degree videos greatly challenges the traditional video streaming systems in limited network environments. Given the limited bandwidth, tile-based video streaming with adaptive bitrate selection has been widely studied to improve the Quality of Experience (QoE) of viewers by tiling the video frames and allocating different bitrates for tiles inside and outside viewers' viewports. Existing solutions for viewport prediction and bitrate selection train general models without catering to the intrinsic need for personalization. In this paper, we present the first meta-learning-based personalized 360-degree video streaming framework. The commonality among viewers of different viewing patterns and QoE preferences is captured by efficient meta-network designs. Specifically, we design a meta-based long-short term memory model for viewport prediction and a meta-based reinforcement learning model for bitrate selection. Extensive experiments on real-world datasets demonstrate that our framework not only outperforms the state-of-the-art data-driven approaches in prediction accuracy by 11% on average and improves QoE by 27% on average, but also quickly adapts to users with new preferences with on average 67%-88% less training epochs.
{"title":"Personalized 360-Degree Video Streaming: A Meta-Learning Approach","authors":"Yi-Hsien Lu, Yifei Zhu, Zhi Wang","doi":"10.1145/3503161.3548047","DOIUrl":"https://doi.org/10.1145/3503161.3548047","url":null,"abstract":"Over the past decades, 360-degree videos have attracted wide interest for the immersive experience they bring to viewers. The rising of high-resolution 360-degree videos greatly challenges the traditional video streaming systems in limited network environments. Given the limited bandwidth, tile-based video streaming with adaptive bitrate selection has been widely studied to improve the Quality of Experience (QoE) of viewers by tiling the video frames and allocating different bitrates for tiles inside and outside viewers' viewports. Existing solutions for viewport prediction and bitrate selection train general models without catering to the intrinsic need for personalization. In this paper, we present the first meta-learning-based personalized 360-degree video streaming framework. The commonality among viewers of different viewing patterns and QoE preferences is captured by efficient meta-network designs. Specifically, we design a meta-based long-short term memory model for viewport prediction and a meta-based reinforcement learning model for bitrate selection. Extensive experiments on real-world datasets demonstrate that our framework not only outperforms the state-of-the-art data-driven approaches in prediction accuracy by 11% on average and improves QoE by 27% on average, but also quickly adapts to users with new preferences with on average 67%-88% less training epochs.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130398251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Neural Architecture Search (NAS) is an effective way to automatically design neural architectures for various multimedia applications. Weight-sharing, as one of the most popular NAS strategies, has been widely adopted due to its search efficiency. Existing weight-sharing NAS methods overlook the influence of data distribution and treat each data sample equally. Contrastively, in this paper, we empirically discover that different data samples have different influences on architectures, e.g., some data samples are easy to fit by certain architectures but hard by others. Hence, there exist architectures with better performances on early data samples being more likely to be discovered in the whole NAS searching process, which leads to a suboptimal searching result. To tackle this problem, we propose Curriculum-NAS, a curriculum training framework on weight-sharing NAS, which dynamically changes the training data weights during the searching process. In particular, Curriculum-NAS utilizes the multiple subnets included in weight-sharing NAS to jointly assess data uncertainty, which serves as the difficulty criterion in a curriculum manner, so that the potentially optimal architectures can obtain higher probability of being fully trained and discovered. Extensive experiments on several image and text datasets demonstrate that our Curriculum-NAS can bring consistent improvement over existing weight-sharing NAS. The code is available online at https://github.com/zhouyw16/curriculum-nas.
{"title":"Curriculum-NAS: Curriculum Weight-Sharing Neural Architecture Search","authors":"Yuwei Zhou, Xin Wang, Hong Chen, Xuguang Duan, Chaoyu Guan, Wenwu Zhu","doi":"10.1145/3503161.3548271","DOIUrl":"https://doi.org/10.1145/3503161.3548271","url":null,"abstract":"Neural Architecture Search (NAS) is an effective way to automatically design neural architectures for various multimedia applications. Weight-sharing, as one of the most popular NAS strategies, has been widely adopted due to its search efficiency. Existing weight-sharing NAS methods overlook the influence of data distribution and treat each data sample equally. Contrastively, in this paper, we empirically discover that different data samples have different influences on architectures, e.g., some data samples are easy to fit by certain architectures but hard by others. Hence, there exist architectures with better performances on early data samples being more likely to be discovered in the whole NAS searching process, which leads to a suboptimal searching result. To tackle this problem, we propose Curriculum-NAS, a curriculum training framework on weight-sharing NAS, which dynamically changes the training data weights during the searching process. In particular, Curriculum-NAS utilizes the multiple subnets included in weight-sharing NAS to jointly assess data uncertainty, which serves as the difficulty criterion in a curriculum manner, so that the potentially optimal architectures can obtain higher probability of being fully trained and discovered. Extensive experiments on several image and text datasets demonstrate that our Curriculum-NAS can bring consistent improvement over existing weight-sharing NAS. The code is available online at https://github.com/zhouyw16/curriculum-nas.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126759426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Low-light enhancement aims to recover a high contrast normal light image from a low-light image with bad exposure and low contrast. Inspired by curve adjustment in photo editing software and Chebyshev approximation, this paper presents a novel model for brightening low-light images. The proposed model, ChebyLighter, learns to estimate pixel-wise adjustment curves for a low-light image recurrently to reconstruct an enhanced output. In ChebyLighter, Chebyshev image series are first generated. Then pixel-wise coefficient matrices are estimated with Triple Coefficient Estimation (TCE) modules and the final enhanced image is recurrently reconstructed by Chebyshev Attention Weighted Summation (CAWS). The TCE module is specifically designed based on dual attention mechanism with three necessary inputs. Our method can achieve ideal performance because adjustment curves can be obtained with numerical approximation by our model. With extensive quantitative and qualitative experiments on diverse test images, we demonstrate that the proposed method performs favorably against state-of-the-art low-light image enhancement algorithms.
{"title":"ChebyLighter: Optimal Curve Estimation for Low-light Image Enhancement","authors":"Jinwang Pan, Deming Zhai, Yuanchao Bai, Junjun Jiang, Debin Zhao, Xianming Liu","doi":"10.1145/3503161.3548135","DOIUrl":"https://doi.org/10.1145/3503161.3548135","url":null,"abstract":"Low-light enhancement aims to recover a high contrast normal light image from a low-light image with bad exposure and low contrast. Inspired by curve adjustment in photo editing software and Chebyshev approximation, this paper presents a novel model for brightening low-light images. The proposed model, ChebyLighter, learns to estimate pixel-wise adjustment curves for a low-light image recurrently to reconstruct an enhanced output. In ChebyLighter, Chebyshev image series are first generated. Then pixel-wise coefficient matrices are estimated with Triple Coefficient Estimation (TCE) modules and the final enhanced image is recurrently reconstructed by Chebyshev Attention Weighted Summation (CAWS). The TCE module is specifically designed based on dual attention mechanism with three necessary inputs. Our method can achieve ideal performance because adjustment curves can be obtained with numerical approximation by our model. With extensive quantitative and qualitative experiments on diverse test images, we demonstrate that the proposed method performs favorably against state-of-the-art low-light image enhancement algorithms.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126953613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Typical text spotters follow the two-stage spotting strategy: detect the precise boundary for a text instance first and then perform text recognition within the located text region. While such strategy has achieved substantial progress, there are two underlying limitations. 1) The performance of text recognition depends heavily on the precision of text detection, resulting in the potential error propagation from detection to recognition. 2) The RoI cropping which bridges the detection and recognition brings noise from background and leads to information loss when pooling or interpolating from feature maps. In this work we propose the single shot Self-Reliant Scene Text Spotter (SRSTS), which circumvents these limitations by decoupling recognition from detection. Specifically, we conduct text detection and recognition in parallel and bridge them by the shared positive anchor point. Consequently, our method is able to recognize the text instances correctly even though the precise text boundaries are challenging to detect. Additionally, our method reduces the annotation cost for text detection substantially. Extensive experiments on regular-shaped benchmark and arbitrary-shaped benchmark demonstrate that our SRSTS compares favorably to previous state-of-the-art spotters in terms of both accuracy and efficiency.
{"title":"Decoupling Recognition from Detection: Single Shot Self-Reliant Scene Text Spotter","authors":"Jingjing Wu, Pengyuan Lyu, Guangming Lu, Chengquan Zhang, Kun Yao, Wenjie Pei","doi":"10.1145/3503161.3548266","DOIUrl":"https://doi.org/10.1145/3503161.3548266","url":null,"abstract":"Typical text spotters follow the two-stage spotting strategy: detect the precise boundary for a text instance first and then perform text recognition within the located text region. While such strategy has achieved substantial progress, there are two underlying limitations. 1) The performance of text recognition depends heavily on the precision of text detection, resulting in the potential error propagation from detection to recognition. 2) The RoI cropping which bridges the detection and recognition brings noise from background and leads to information loss when pooling or interpolating from feature maps. In this work we propose the single shot Self-Reliant Scene Text Spotter (SRSTS), which circumvents these limitations by decoupling recognition from detection. Specifically, we conduct text detection and recognition in parallel and bridge them by the shared positive anchor point. Consequently, our method is able to recognize the text instances correctly even though the precise text boundaries are challenging to detect. Additionally, our method reduces the annotation cost for text detection substantially. Extensive experiments on regular-shaped benchmark and arbitrary-shaped benchmark demonstrate that our SRSTS compares favorably to previous state-of-the-art spotters in terms of both accuracy and efficiency.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123807080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multimodal dialogue systems attract much attention recently, but they are far from skills like: 1) automatically generate context- specific responses instead of safe but general responses; 2) naturally coordinate between the different information modalities (e.g. text and image) in responses; 3) intuitively explain the reasons for generated responses and improve a specific response without re-training the whole model. To approach these goals, we propose a different angle for the task - Reflecting Experiences for Response Generation (RERG). This is supported by the fact that generating a response from scratch can be hard, but much easier if we can access other similar dialogue contexts and the corresponding responses. In particular, RERG first uses a multimodal contrastive learning enhanced retrieval model for soliciting similar dialogue instances. It then employs a cross copy based reuse model to explore the current dialogue context (vertical) and similar dialogue instances' responses (horizontal) for response generation simultaneously. Experimental results demonstrate that our model outperforms other state-of-the-art models on both automatic metrics and human evaluation. Moreover, RERG naturally provides supporting dialogue instances for better explainability. It also has a strong capability in adapting to unseen dialogue settings by simply adding related samples to the retrieval datastore without re-training the whole model.
{"title":"Reflecting on Experiences for Response Generation","authors":"Chenchen Ye, Lizi Liao, Suyu Liu, Tat-seng Chua","doi":"10.1145/3503161.3548305","DOIUrl":"https://doi.org/10.1145/3503161.3548305","url":null,"abstract":"Multimodal dialogue systems attract much attention recently, but they are far from skills like: 1) automatically generate context- specific responses instead of safe but general responses; 2) naturally coordinate between the different information modalities (e.g. text and image) in responses; 3) intuitively explain the reasons for generated responses and improve a specific response without re-training the whole model. To approach these goals, we propose a different angle for the task - Reflecting Experiences for Response Generation (RERG). This is supported by the fact that generating a response from scratch can be hard, but much easier if we can access other similar dialogue contexts and the corresponding responses. In particular, RERG first uses a multimodal contrastive learning enhanced retrieval model for soliciting similar dialogue instances. It then employs a cross copy based reuse model to explore the current dialogue context (vertical) and similar dialogue instances' responses (horizontal) for response generation simultaneously. Experimental results demonstrate that our model outperforms other state-of-the-art models on both automatic metrics and human evaluation. Moreover, RERG naturally provides supporting dialogue instances for better explainability. It also has a strong capability in adapting to unseen dialogue settings by simply adding related samples to the retrieval datastore without re-training the whole model.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"123 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121624180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei Gao, Hang Yuan, Yang Guo, Lvfang Tao, Zhanyuan Cai, Ge Li
The hardware-accelerated real-time compression of 8K Ultra-High-Definition (UHD) video is an exemplary application that empowered by the latest video coding standard. However, the coding tools added to the recently released third-generation audio video coding standard (AVS3) greatly increase the coding complexity, which seriously hinders the efficient implementation of hardware encoder. In order to break the known bottleneck, this paper presents the first open source software library for 8K UHD video coding hardware implementation, namely OpenHardwareVC. Specifically, based on the analysis of the original AVS3 software algorithm, we provide the hardware acceleration designs of the four major coding stages, including coding unit (CU) partition, intra prediction, transform and entropy coding, in this library. Simulation results on Xilinx VU440 FPGA show that the real-time compression of 8K UHD videos at 30 frames per second (fps) can be easily supported based on software-described modules packaged in this library. The release of this library is quite favorable for the hardware design and system implementation of UHD video coding, which is also beneficial to the promotion of the new coding standard. The open source library for OpenHardwareVC is available at https://git.openi.org.cn/OpenHardwareVC.
{"title":"OpenHardwareVC","authors":"Wei Gao, Hang Yuan, Yang Guo, Lvfang Tao, Zhanyuan Cai, Ge Li","doi":"10.1145/3503161.3548543","DOIUrl":"https://doi.org/10.1145/3503161.3548543","url":null,"abstract":"The hardware-accelerated real-time compression of 8K Ultra-High-Definition (UHD) video is an exemplary application that empowered by the latest video coding standard. However, the coding tools added to the recently released third-generation audio video coding standard (AVS3) greatly increase the coding complexity, which seriously hinders the efficient implementation of hardware encoder. In order to break the known bottleneck, this paper presents the first open source software library for 8K UHD video coding hardware implementation, namely OpenHardwareVC. Specifically, based on the analysis of the original AVS3 software algorithm, we provide the hardware acceleration designs of the four major coding stages, including coding unit (CU) partition, intra prediction, transform and entropy coding, in this library. Simulation results on Xilinx VU440 FPGA show that the real-time compression of 8K UHD videos at 30 frames per second (fps) can be easily supported based on software-described modules packaged in this library. The release of this library is quite favorable for the hardware design and system implementation of UHD video coding, which is also beneficial to the promotion of the new coding standard. The open source library for OpenHardwareVC is available at https://git.openi.org.cn/OpenHardwareVC.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124337199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Muhannad Alkaddour, Abhinav Dhall, U. Tariq, Hasan Al Nashash, Fares Al-Shargie
In this work we propose additions to the COSMOS and COSMOS on Steroids pipelines for the detection of Cheapfakes for Task 1 of the ACM Grand Challenge for Detecting Cheapfakes. We compute sentiment features, namely polarity and subjectivity, using the news image captions. Multiple logistic regression results show that these sentiment features are significant in prediction of the outcome. We then combine the sentiment features with the four image-text features obtained in the aforementioned previous works to train an MLP. This classifies sets of inputs into being out-of-context (OOC) or not-out-of-context (NOOC). On a test set of 400 samples, the MLP with all features achieved a score of 87.25%, and that with only the image-text features a score of 88%. In addition to the challenge requirements, we also propose a separate pipeline to automatically construct caption pairs and annotations using the images and captions provided in the large, un-annotated training dataset. We hope that this endeavor will open the door for improvements, since hand-annotating cheapfake labels is time-consuming. To evaluate the performance on the test set, the Docker image with the models is available at: https://hub.docker.com/repository/docker/malkaddour/mmsys22cheapfakes. The open-source code for the project is accessible at: https://github.com/malkaddour/ACMM-22-Cheapfake-Detection-Sentiment-aware-Classifier-for-Out-of-Context-Caption-Detection.
{"title":"Sentiment-aware Classifier for Out-of-Context Caption Detection","authors":"Muhannad Alkaddour, Abhinav Dhall, U. Tariq, Hasan Al Nashash, Fares Al-Shargie","doi":"10.1145/3503161.3551603","DOIUrl":"https://doi.org/10.1145/3503161.3551603","url":null,"abstract":"In this work we propose additions to the COSMOS and COSMOS on Steroids pipelines for the detection of Cheapfakes for Task 1 of the ACM Grand Challenge for Detecting Cheapfakes. We compute sentiment features, namely polarity and subjectivity, using the news image captions. Multiple logistic regression results show that these sentiment features are significant in prediction of the outcome. We then combine the sentiment features with the four image-text features obtained in the aforementioned previous works to train an MLP. This classifies sets of inputs into being out-of-context (OOC) or not-out-of-context (NOOC). On a test set of 400 samples, the MLP with all features achieved a score of 87.25%, and that with only the image-text features a score of 88%. In addition to the challenge requirements, we also propose a separate pipeline to automatically construct caption pairs and annotations using the images and captions provided in the large, un-annotated training dataset. We hope that this endeavor will open the door for improvements, since hand-annotating cheapfake labels is time-consuming. To evaluate the performance on the test set, the Docker image with the models is available at: https://hub.docker.com/repository/docker/malkaddour/mmsys22cheapfakes. The open-source code for the project is accessible at: https://github.com/malkaddour/ACMM-22-Cheapfake-Detection-Sentiment-aware-Classifier-for-Out-of-Context-Caption-Detection.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124349387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Tao, Jiangyan Yi, Cunhang Fan, Ruibo Fu, Shan Liang, Pengyuan Zhang, Haizhou Li, H. Meng, Dong Yu, M. Akagi
Over the last few years, the technology of speech synthesis and voice conversion has made significant improvement with the development of deep learning. The models can generate realistic and human-like speech. It is difficult for most people to distinguish the generated audio from the real. However, this technology also poses a great threat to the global political economy and social stability if some attackers and criminals misuse it with the intent to cause harm. In this workshop, we aim to bring together researchers from the fields of audio deepfake detection, audio deep synthesis, audio fake game and adversarial attacks to further discuss recent research and future directions for detecting deepfake and manipulated audios in multimedia.
{"title":"DDAM '22: 1st International Workshop on Deepfake Detection for Audio Multimedia","authors":"J. Tao, Jiangyan Yi, Cunhang Fan, Ruibo Fu, Shan Liang, Pengyuan Zhang, Haizhou Li, H. Meng, Dong Yu, M. Akagi","doi":"10.1145/3503161.3554779","DOIUrl":"https://doi.org/10.1145/3503161.3554779","url":null,"abstract":"Over the last few years, the technology of speech synthesis and voice conversion has made significant improvement with the development of deep learning. The models can generate realistic and human-like speech. It is difficult for most people to distinguish the generated audio from the real. However, this technology also poses a great threat to the global political economy and social stability if some attackers and criminals misuse it with the intent to cause harm. In this workshop, we aim to bring together researchers from the fields of audio deepfake detection, audio deep synthesis, audio fake game and adversarial attacks to further discuss recent research and future directions for detecting deepfake and manipulated audios in multimedia.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"180 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124492536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vision-and-Language Navigation needs an agent to navigate to a target location by progressively grounding and following the relevant instruction conditioning on its memory and current observation. Existing works utilize the cross-modal transformer to pass the message between visual modality and textual modality. However, they are still limited to mining the fine-grained matching between the underlying components of trajectories and instructions. Inspired by the significant progress achieved by large-scale pre-training methods, in this paper, we propose CSAP, a new method of Cross-modal Semantic Alignment Pre-training for Vision-and-Language Navigation. It is designed to learn the alignment from trajectory-instruction pairs through two novel tasks, including trajectory-conditioned masked fragment modeling and contrastive semantic-alignment modeling. Specifically, the trajectory-conditioned masked fragment modeling encourages the agent to extract useful visual information to reconstruct the masked fragment. The contrastive semantic-alignment modeling is designed to align the visual representation with corresponding phrase embeddings. By showing experimental results on the benchmark dataset, we demonstrate that transformer architecture-based navigation agent pre-trained with our proposed CSAP outperforms existing methods on both SR and SPL scores.
{"title":"Cross-modal Semantic Alignment Pre-training for Vision-and-Language Navigation","authors":"Siying Wu, Xueyang Fu, Feng Wu, Zhengjun Zha","doi":"10.1145/3503161.3548283","DOIUrl":"https://doi.org/10.1145/3503161.3548283","url":null,"abstract":"Vision-and-Language Navigation needs an agent to navigate to a target location by progressively grounding and following the relevant instruction conditioning on its memory and current observation. Existing works utilize the cross-modal transformer to pass the message between visual modality and textual modality. However, they are still limited to mining the fine-grained matching between the underlying components of trajectories and instructions. Inspired by the significant progress achieved by large-scale pre-training methods, in this paper, we propose CSAP, a new method of Cross-modal Semantic Alignment Pre-training for Vision-and-Language Navigation. It is designed to learn the alignment from trajectory-instruction pairs through two novel tasks, including trajectory-conditioned masked fragment modeling and contrastive semantic-alignment modeling. Specifically, the trajectory-conditioned masked fragment modeling encourages the agent to extract useful visual information to reconstruct the masked fragment. The contrastive semantic-alignment modeling is designed to align the visual representation with corresponding phrase embeddings. By showing experimental results on the benchmark dataset, we demonstrate that transformer architecture-based navigation agent pre-trained with our proposed CSAP outperforms existing methods on both SR and SPL scores.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124529297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}