首页 > 最新文献

Proceedings of the 30th ACM International Conference on Multimedia最新文献

英文 中文
MMSports'22: 5th International ACM Workshop on Multimedia Content Analysis in Sports MMSports'22:第五届国际ACM体育多媒体内容分析研讨会
Pub Date : 2022-10-10 DOI: 10.1145/3503161.3551791
H. Saito, T. Moeslund, R. Lienhart
The fifth ACM International Workshop on Multimedia Content Analysis in Sports (ACM MMSports'22) is part of the ACM International Conference on Multimedia 2022 (ACM Multimedia 2022). After two years of pure virtual MMSports workshops due to COVID-19, MMSports'22 is held on-site again. The goal of this workshop is to bring together researchers and practitioners from academia and industry to address challenges and report progress in mining, analyzing, understanding, and visualizing multimedia/multimodal data in sports, sports broadcasts, sports games and sports medicine. The combination of sports and modern technology offers a novel and intriguing field of research with promising approaches for visual broadcast augmentation and understanding, for statistical analysis and evaluation, and for sensor fusion during workouts as well as competitions. There is a lack of research communities focusing on the fusion of multiple modalities. We are helping to close this research gap with this workshop series on multimedia content analysis in sports. Related Workshop Proceedings are available in the ACM DL at: https://dl.acm.org/doi/proceedings/10.1145/3552437.
第五届ACM体育多媒体内容分析国际研讨会(ACM MMSports'22)是ACM国际多媒体会议2022 (ACM Multimedia 2022)的一部分。受新冠肺炎疫情影响,MMSports举办了两年的纯虚拟研讨会后,MMSports'22再次在现场举行。本次研讨会的目标是将来自学术界和工业界的研究人员和实践者聚集在一起,讨论在体育、体育广播、体育比赛和运动医学中挖掘、分析、理解和可视化多媒体/多模态数据方面的挑战和报告进展。体育和现代技术的结合为视觉广播增强和理解、统计分析和评估以及训练和比赛期间的传感器融合提供了一个新颖而有趣的研究领域。目前缺乏关注多模式融合的研究团体。我们正在通过这个关于体育多媒体内容分析的研讨会系列来帮助缩小这一研究差距。相关的研讨会论文集可在ACM DL查阅:https://dl.acm.org/doi/proceedings/10.1145/3552437。
{"title":"MMSports'22: 5th International ACM Workshop on Multimedia Content Analysis in Sports","authors":"H. Saito, T. Moeslund, R. Lienhart","doi":"10.1145/3503161.3551791","DOIUrl":"https://doi.org/10.1145/3503161.3551791","url":null,"abstract":"The fifth ACM International Workshop on Multimedia Content Analysis in Sports (ACM MMSports'22) is part of the ACM International Conference on Multimedia 2022 (ACM Multimedia 2022). After two years of pure virtual MMSports workshops due to COVID-19, MMSports'22 is held on-site again. The goal of this workshop is to bring together researchers and practitioners from academia and industry to address challenges and report progress in mining, analyzing, understanding, and visualizing multimedia/multimodal data in sports, sports broadcasts, sports games and sports medicine. The combination of sports and modern technology offers a novel and intriguing field of research with promising approaches for visual broadcast augmentation and understanding, for statistical analysis and evaluation, and for sensor fusion during workouts as well as competitions. There is a lack of research communities focusing on the fusion of multiple modalities. We are helping to close this research gap with this workshop series on multimedia content analysis in sports. Related Workshop Proceedings are available in the ACM DL at: https://dl.acm.org/doi/proceedings/10.1145/3552437.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129933140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Personalized 360-Degree Video Streaming: A Meta-Learning Approach 个性化360度视频流:元学习方法
Pub Date : 2022-10-10 DOI: 10.1145/3503161.3548047
Yi-Hsien Lu, Yifei Zhu, Zhi Wang
Over the past decades, 360-degree videos have attracted wide interest for the immersive experience they bring to viewers. The rising of high-resolution 360-degree videos greatly challenges the traditional video streaming systems in limited network environments. Given the limited bandwidth, tile-based video streaming with adaptive bitrate selection has been widely studied to improve the Quality of Experience (QoE) of viewers by tiling the video frames and allocating different bitrates for tiles inside and outside viewers' viewports. Existing solutions for viewport prediction and bitrate selection train general models without catering to the intrinsic need for personalization. In this paper, we present the first meta-learning-based personalized 360-degree video streaming framework. The commonality among viewers of different viewing patterns and QoE preferences is captured by efficient meta-network designs. Specifically, we design a meta-based long-short term memory model for viewport prediction and a meta-based reinforcement learning model for bitrate selection. Extensive experiments on real-world datasets demonstrate that our framework not only outperforms the state-of-the-art data-driven approaches in prediction accuracy by 11% on average and improves QoE by 27% on average, but also quickly adapts to users with new preferences with on average 67%-88% less training epochs.
在过去的几十年里,360度视频因其带给观众身临其境的体验而引起了广泛的兴趣。在有限的网络环境下,高分辨率360度视频的兴起给传统视频流系统带来了极大的挑战。在带宽有限的情况下,基于自适应比特率选择的平铺视频流被广泛研究,通过平铺视频帧并为观看者视口内外的平铺图像分配不同的比特率来提高观看者的体验质量。现有的视口预测和比特率选择的解决方案训练一般模型,而不满足个性化的内在需求。在本文中,我们提出了第一个基于元学习的个性化360度视频流框架。有效的元网络设计捕获了不同观看模式和QoE偏好的观众之间的共性。具体来说,我们设计了一个基于元的长短期记忆模型用于视口预测和一个基于元的强化学习模型用于比特率选择。在真实世界数据集上的大量实验表明,我们的框架不仅在预测精度上比最先进的数据驱动方法平均提高11%,QoE平均提高27%,而且能够快速适应具有新偏好的用户,平均减少67%-88%的训练时间。
{"title":"Personalized 360-Degree Video Streaming: A Meta-Learning Approach","authors":"Yi-Hsien Lu, Yifei Zhu, Zhi Wang","doi":"10.1145/3503161.3548047","DOIUrl":"https://doi.org/10.1145/3503161.3548047","url":null,"abstract":"Over the past decades, 360-degree videos have attracted wide interest for the immersive experience they bring to viewers. The rising of high-resolution 360-degree videos greatly challenges the traditional video streaming systems in limited network environments. Given the limited bandwidth, tile-based video streaming with adaptive bitrate selection has been widely studied to improve the Quality of Experience (QoE) of viewers by tiling the video frames and allocating different bitrates for tiles inside and outside viewers' viewports. Existing solutions for viewport prediction and bitrate selection train general models without catering to the intrinsic need for personalization. In this paper, we present the first meta-learning-based personalized 360-degree video streaming framework. The commonality among viewers of different viewing patterns and QoE preferences is captured by efficient meta-network designs. Specifically, we design a meta-based long-short term memory model for viewport prediction and a meta-based reinforcement learning model for bitrate selection. Extensive experiments on real-world datasets demonstrate that our framework not only outperforms the state-of-the-art data-driven approaches in prediction accuracy by 11% on average and improves QoE by 27% on average, but also quickly adapts to users with new preferences with on average 67%-88% less training epochs.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130398251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Curriculum-NAS: Curriculum Weight-Sharing Neural Architecture Search 课程- nas:课程权重共享神经结构搜索
Pub Date : 2022-10-10 DOI: 10.1145/3503161.3548271
Yuwei Zhou, Xin Wang, Hong Chen, Xuguang Duan, Chaoyu Guan, Wenwu Zhu
Neural Architecture Search (NAS) is an effective way to automatically design neural architectures for various multimedia applications. Weight-sharing, as one of the most popular NAS strategies, has been widely adopted due to its search efficiency. Existing weight-sharing NAS methods overlook the influence of data distribution and treat each data sample equally. Contrastively, in this paper, we empirically discover that different data samples have different influences on architectures, e.g., some data samples are easy to fit by certain architectures but hard by others. Hence, there exist architectures with better performances on early data samples being more likely to be discovered in the whole NAS searching process, which leads to a suboptimal searching result. To tackle this problem, we propose Curriculum-NAS, a curriculum training framework on weight-sharing NAS, which dynamically changes the training data weights during the searching process. In particular, Curriculum-NAS utilizes the multiple subnets included in weight-sharing NAS to jointly assess data uncertainty, which serves as the difficulty criterion in a curriculum manner, so that the potentially optimal architectures can obtain higher probability of being fully trained and discovered. Extensive experiments on several image and text datasets demonstrate that our Curriculum-NAS can bring consistent improvement over existing weight-sharing NAS. The code is available online at https://github.com/zhouyw16/curriculum-nas.
神经结构搜索(NAS)是为各种多媒体应用自动设计神经结构的有效方法。权值共享作为最流行的NAS策略之一,因其搜索效率高而被广泛采用。现有的权值共享NAS方法忽略了数据分布的影响,对每个数据样本一视同仁。相比之下,在本文中,我们通过经验发现不同的数据样本对体系结构的影响是不同的,例如,一些数据样本很容易被某些体系结构拟合,而另一些则很难。因此,在整个NAS搜索过程中,存在在早期数据样本上性能更好的架构,更容易被发现,从而导致次优搜索结果。为了解决这个问题,我们提出了一种基于权共享NAS的课程训练框架curriculum -NAS,它在搜索过程中动态地改变训练数据的权值。其中,curriculum -NAS利用权值共享NAS中包含的多个子网共同评估数据的不确定性,以课程的方式作为难度标准,从而使潜在的最优架构获得更高的被充分训练和发现的概率。在多个图像和文本数据集上进行的大量实验表明,我们的Curriculum-NAS可以比现有的权重共享NAS带来一致的改进。该代码可在https://github.com/zhouyw16/curriculum-nas上在线获得。
{"title":"Curriculum-NAS: Curriculum Weight-Sharing Neural Architecture Search","authors":"Yuwei Zhou, Xin Wang, Hong Chen, Xuguang Duan, Chaoyu Guan, Wenwu Zhu","doi":"10.1145/3503161.3548271","DOIUrl":"https://doi.org/10.1145/3503161.3548271","url":null,"abstract":"Neural Architecture Search (NAS) is an effective way to automatically design neural architectures for various multimedia applications. Weight-sharing, as one of the most popular NAS strategies, has been widely adopted due to its search efficiency. Existing weight-sharing NAS methods overlook the influence of data distribution and treat each data sample equally. Contrastively, in this paper, we empirically discover that different data samples have different influences on architectures, e.g., some data samples are easy to fit by certain architectures but hard by others. Hence, there exist architectures with better performances on early data samples being more likely to be discovered in the whole NAS searching process, which leads to a suboptimal searching result. To tackle this problem, we propose Curriculum-NAS, a curriculum training framework on weight-sharing NAS, which dynamically changes the training data weights during the searching process. In particular, Curriculum-NAS utilizes the multiple subnets included in weight-sharing NAS to jointly assess data uncertainty, which serves as the difficulty criterion in a curriculum manner, so that the potentially optimal architectures can obtain higher probability of being fully trained and discovered. Extensive experiments on several image and text datasets demonstrate that our Curriculum-NAS can bring consistent improvement over existing weight-sharing NAS. The code is available online at https://github.com/zhouyw16/curriculum-nas.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126759426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
ChebyLighter: Optimal Curve Estimation for Low-light Image Enhancement ChebyLighter:低光图像增强的最佳曲线估计
Pub Date : 2022-10-10 DOI: 10.1145/3503161.3548135
Jinwang Pan, Deming Zhai, Yuanchao Bai, Junjun Jiang, Debin Zhao, Xianming Liu
Low-light enhancement aims to recover a high contrast normal light image from a low-light image with bad exposure and low contrast. Inspired by curve adjustment in photo editing software and Chebyshev approximation, this paper presents a novel model for brightening low-light images. The proposed model, ChebyLighter, learns to estimate pixel-wise adjustment curves for a low-light image recurrently to reconstruct an enhanced output. In ChebyLighter, Chebyshev image series are first generated. Then pixel-wise coefficient matrices are estimated with Triple Coefficient Estimation (TCE) modules and the final enhanced image is recurrently reconstructed by Chebyshev Attention Weighted Summation (CAWS). The TCE module is specifically designed based on dual attention mechanism with three necessary inputs. Our method can achieve ideal performance because adjustment curves can be obtained with numerical approximation by our model. With extensive quantitative and qualitative experiments on diverse test images, we demonstrate that the proposed method performs favorably against state-of-the-art low-light image enhancement algorithms.
弱光增强的目的是从曝光不良、对比度低的弱光图像中恢复高对比度的正常光图像。本文受照片编辑软件中的曲线调整和切比雪夫近似的启发,提出了一种新的弱光图像增亮模型。所提出的ChebyLighter模型,学习对低光照图像的逐像素调整曲线进行周期性估计,以重建增强的输出。在ChebyLighter中,首先生成Chebyshev图像序列。然后利用三系数估计(Triple coefficient Estimation, TCE)模块估计逐像素的系数矩阵,再利用切比雪夫注意加权和(Chebyshev Attention Weighted sum, CAWS)迭代重构最终增强图像。TCE模块是专门基于双重注意机制设计的,有三个必要的输入。由于所建立的模型可以通过数值逼近得到调节曲线,因此可以达到理想的效果。通过对不同测试图像进行大量的定量和定性实验,我们证明了所提出的方法优于最先进的低光图像增强算法。
{"title":"ChebyLighter: Optimal Curve Estimation for Low-light Image Enhancement","authors":"Jinwang Pan, Deming Zhai, Yuanchao Bai, Junjun Jiang, Debin Zhao, Xianming Liu","doi":"10.1145/3503161.3548135","DOIUrl":"https://doi.org/10.1145/3503161.3548135","url":null,"abstract":"Low-light enhancement aims to recover a high contrast normal light image from a low-light image with bad exposure and low contrast. Inspired by curve adjustment in photo editing software and Chebyshev approximation, this paper presents a novel model for brightening low-light images. The proposed model, ChebyLighter, learns to estimate pixel-wise adjustment curves for a low-light image recurrently to reconstruct an enhanced output. In ChebyLighter, Chebyshev image series are first generated. Then pixel-wise coefficient matrices are estimated with Triple Coefficient Estimation (TCE) modules and the final enhanced image is recurrently reconstructed by Chebyshev Attention Weighted Summation (CAWS). The TCE module is specifically designed based on dual attention mechanism with three necessary inputs. Our method can achieve ideal performance because adjustment curves can be obtained with numerical approximation by our model. With extensive quantitative and qualitative experiments on diverse test images, we demonstrate that the proposed method performs favorably against state-of-the-art low-light image enhancement algorithms.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126953613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Decoupling Recognition from Detection: Single Shot Self-Reliant Scene Text Spotter 从检测中解耦识别:单镜头自依赖场景文本观测者
Pub Date : 2022-10-10 DOI: 10.1145/3503161.3548266
Jingjing Wu, Pengyuan Lyu, Guangming Lu, Chengquan Zhang, Kun Yao, Wenjie Pei
Typical text spotters follow the two-stage spotting strategy: detect the precise boundary for a text instance first and then perform text recognition within the located text region. While such strategy has achieved substantial progress, there are two underlying limitations. 1) The performance of text recognition depends heavily on the precision of text detection, resulting in the potential error propagation from detection to recognition. 2) The RoI cropping which bridges the detection and recognition brings noise from background and leads to information loss when pooling or interpolating from feature maps. In this work we propose the single shot Self-Reliant Scene Text Spotter (SRSTS), which circumvents these limitations by decoupling recognition from detection. Specifically, we conduct text detection and recognition in parallel and bridge them by the shared positive anchor point. Consequently, our method is able to recognize the text instances correctly even though the precise text boundaries are challenging to detect. Additionally, our method reduces the annotation cost for text detection substantially. Extensive experiments on regular-shaped benchmark and arbitrary-shaped benchmark demonstrate that our SRSTS compares favorably to previous state-of-the-art spotters in terms of both accuracy and efficiency.
典型的文本定位器遵循两阶段定位策略:首先检测文本实例的精确边界,然后在定位的文本区域内执行文本识别。虽然这一战略取得了重大进展,但仍有两个潜在的限制。1)文本识别的性能很大程度上依赖于文本检测的精度,从而导致从检测到识别的潜在误差传播。2)桥接检测和识别的RoI裁剪会带来背景噪声,导致特征图池化或插值时的信息丢失。在这项工作中,我们提出了单镜头自力更生的场景文本观测者(SRSTS),它通过将识别与检测分离来规避这些限制。具体来说,我们并行地进行文本检测和识别,并通过共享的正锚点将它们连接起来。因此,我们的方法能够正确地识别文本实例,即使精确的文本边界很难检测。此外,我们的方法大大降低了文本检测的标注成本。在规则形基准和任意形基准上的大量实验表明,我们的SRSTS在精度和效率方面都优于以前最先进的观测者。
{"title":"Decoupling Recognition from Detection: Single Shot Self-Reliant Scene Text Spotter","authors":"Jingjing Wu, Pengyuan Lyu, Guangming Lu, Chengquan Zhang, Kun Yao, Wenjie Pei","doi":"10.1145/3503161.3548266","DOIUrl":"https://doi.org/10.1145/3503161.3548266","url":null,"abstract":"Typical text spotters follow the two-stage spotting strategy: detect the precise boundary for a text instance first and then perform text recognition within the located text region. While such strategy has achieved substantial progress, there are two underlying limitations. 1) The performance of text recognition depends heavily on the precision of text detection, resulting in the potential error propagation from detection to recognition. 2) The RoI cropping which bridges the detection and recognition brings noise from background and leads to information loss when pooling or interpolating from feature maps. In this work we propose the single shot Self-Reliant Scene Text Spotter (SRSTS), which circumvents these limitations by decoupling recognition from detection. Specifically, we conduct text detection and recognition in parallel and bridge them by the shared positive anchor point. Consequently, our method is able to recognize the text instances correctly even though the precise text boundaries are challenging to detect. Additionally, our method reduces the annotation cost for text detection substantially. Extensive experiments on regular-shaped benchmark and arbitrary-shaped benchmark demonstrate that our SRSTS compares favorably to previous state-of-the-art spotters in terms of both accuracy and efficiency.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123807080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Reflecting on Experiences for Response Generation 反应生成经验的反思
Pub Date : 2022-10-10 DOI: 10.1145/3503161.3548305
Chenchen Ye, Lizi Liao, Suyu Liu, Tat-seng Chua
Multimodal dialogue systems attract much attention recently, but they are far from skills like: 1) automatically generate context- specific responses instead of safe but general responses; 2) naturally coordinate between the different information modalities (e.g. text and image) in responses; 3) intuitively explain the reasons for generated responses and improve a specific response without re-training the whole model. To approach these goals, we propose a different angle for the task - Reflecting Experiences for Response Generation (RERG). This is supported by the fact that generating a response from scratch can be hard, but much easier if we can access other similar dialogue contexts and the corresponding responses. In particular, RERG first uses a multimodal contrastive learning enhanced retrieval model for soliciting similar dialogue instances. It then employs a cross copy based reuse model to explore the current dialogue context (vertical) and similar dialogue instances' responses (horizontal) for response generation simultaneously. Experimental results demonstrate that our model outperforms other state-of-the-art models on both automatic metrics and human evaluation. Moreover, RERG naturally provides supporting dialogue instances for better explainability. It also has a strong capability in adapting to unseen dialogue settings by simply adding related samples to the retrieval datastore without re-training the whole model.
多模态对话系统最近备受关注,但它们与以下技能相距甚远:1)自动生成特定于上下文的响应,而不是安全但通用的响应;2)在不同的信息形式(如文本和图像)之间自然协调;3)直观地解释产生响应的原因,在不重新训练整个模型的情况下改进特定响应。为了达到这些目标,我们提出了一个不同的任务角度——反应生成的反思经验(RERG)。这是因为从头开始生成响应可能很困难,但如果我们可以访问其他类似的对话上下文和相应的响应,则会容易得多。特别地,RERG首先使用多模态对比学习增强检索模型来请求相似的对话实例。然后,它使用基于交叉复制的重用模型来探索当前对话上下文(垂直)和类似对话实例的响应(水平),以同时生成响应。实验结果表明,我们的模型在自动度量和人工评估方面都优于其他最先进的模型。此外,RERG自然地为更好的可解释性提供了支持对话实例。它还具有很强的适应未见对话设置的能力,只需将相关样本添加到检索数据存储中,而无需重新训练整个模型。
{"title":"Reflecting on Experiences for Response Generation","authors":"Chenchen Ye, Lizi Liao, Suyu Liu, Tat-seng Chua","doi":"10.1145/3503161.3548305","DOIUrl":"https://doi.org/10.1145/3503161.3548305","url":null,"abstract":"Multimodal dialogue systems attract much attention recently, but they are far from skills like: 1) automatically generate context- specific responses instead of safe but general responses; 2) naturally coordinate between the different information modalities (e.g. text and image) in responses; 3) intuitively explain the reasons for generated responses and improve a specific response without re-training the whole model. To approach these goals, we propose a different angle for the task - Reflecting Experiences for Response Generation (RERG). This is supported by the fact that generating a response from scratch can be hard, but much easier if we can access other similar dialogue contexts and the corresponding responses. In particular, RERG first uses a multimodal contrastive learning enhanced retrieval model for soliciting similar dialogue instances. It then employs a cross copy based reuse model to explore the current dialogue context (vertical) and similar dialogue instances' responses (horizontal) for response generation simultaneously. Experimental results demonstrate that our model outperforms other state-of-the-art models on both automatic metrics and human evaluation. Moreover, RERG naturally provides supporting dialogue instances for better explainability. It also has a strong capability in adapting to unseen dialogue settings by simply adding related samples to the retrieval datastore without re-training the whole model.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"123 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121624180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
OpenHardwareVC
Pub Date : 2022-10-10 DOI: 10.1145/3503161.3548543
Wei Gao, Hang Yuan, Yang Guo, Lvfang Tao, Zhanyuan Cai, Ge Li
The hardware-accelerated real-time compression of 8K Ultra-High-Definition (UHD) video is an exemplary application that empowered by the latest video coding standard. However, the coding tools added to the recently released third-generation audio video coding standard (AVS3) greatly increase the coding complexity, which seriously hinders the efficient implementation of hardware encoder. In order to break the known bottleneck, this paper presents the first open source software library for 8K UHD video coding hardware implementation, namely OpenHardwareVC. Specifically, based on the analysis of the original AVS3 software algorithm, we provide the hardware acceleration designs of the four major coding stages, including coding unit (CU) partition, intra prediction, transform and entropy coding, in this library. Simulation results on Xilinx VU440 FPGA show that the real-time compression of 8K UHD videos at 30 frames per second (fps) can be easily supported based on software-described modules packaged in this library. The release of this library is quite favorable for the hardware design and system implementation of UHD video coding, which is also beneficial to the promotion of the new coding standard. The open source library for OpenHardwareVC is available at https://git.openi.org.cn/OpenHardwareVC.
{"title":"OpenHardwareVC","authors":"Wei Gao, Hang Yuan, Yang Guo, Lvfang Tao, Zhanyuan Cai, Ge Li","doi":"10.1145/3503161.3548543","DOIUrl":"https://doi.org/10.1145/3503161.3548543","url":null,"abstract":"The hardware-accelerated real-time compression of 8K Ultra-High-Definition (UHD) video is an exemplary application that empowered by the latest video coding standard. However, the coding tools added to the recently released third-generation audio video coding standard (AVS3) greatly increase the coding complexity, which seriously hinders the efficient implementation of hardware encoder. In order to break the known bottleneck, this paper presents the first open source software library for 8K UHD video coding hardware implementation, namely OpenHardwareVC. Specifically, based on the analysis of the original AVS3 software algorithm, we provide the hardware acceleration designs of the four major coding stages, including coding unit (CU) partition, intra prediction, transform and entropy coding, in this library. Simulation results on Xilinx VU440 FPGA show that the real-time compression of 8K UHD videos at 30 frames per second (fps) can be easily supported based on software-described modules packaged in this library. The release of this library is quite favorable for the hardware design and system implementation of UHD video coding, which is also beneficial to the promotion of the new coding standard. The open source library for OpenHardwareVC is available at https://git.openi.org.cn/OpenHardwareVC.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124337199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Sentiment-aware Classifier for Out-of-Context Caption Detection 上下文字幕检测的情感感知分类器
Pub Date : 2022-10-10 DOI: 10.1145/3503161.3551603
Muhannad Alkaddour, Abhinav Dhall, U. Tariq, Hasan Al Nashash, Fares Al-Shargie
In this work we propose additions to the COSMOS and COSMOS on Steroids pipelines for the detection of Cheapfakes for Task 1 of the ACM Grand Challenge for Detecting Cheapfakes. We compute sentiment features, namely polarity and subjectivity, using the news image captions. Multiple logistic regression results show that these sentiment features are significant in prediction of the outcome. We then combine the sentiment features with the four image-text features obtained in the aforementioned previous works to train an MLP. This classifies sets of inputs into being out-of-context (OOC) or not-out-of-context (NOOC). On a test set of 400 samples, the MLP with all features achieved a score of 87.25%, and that with only the image-text features a score of 88%. In addition to the challenge requirements, we also propose a separate pipeline to automatically construct caption pairs and annotations using the images and captions provided in the large, un-annotated training dataset. We hope that this endeavor will open the door for improvements, since hand-annotating cheapfake labels is time-consuming. To evaluate the performance on the test set, the Docker image with the models is available at: https://hub.docker.com/repository/docker/malkaddour/mmsys22cheapfakes. The open-source code for the project is accessible at: https://github.com/malkaddour/ACMM-22-Cheapfake-Detection-Sentiment-aware-Classifier-for-Out-of-Context-Caption-Detection.
在这项工作中,我们建议在COSMOS和COSMOS上添加类固醇管道,用于检测Cheapfakes,这是ACM检测Cheapfakes大挑战的任务1。我们利用新闻图片的标题计算情感特征,即极性和主观性。多元逻辑回归结果表明,这些情绪特征在预测结果方面具有显著性。然后,我们将情感特征与上述工作中获得的四个图像-文本特征结合起来训练一个MLP。这将输入集分为上下文外(OOC)和非上下文外(NOOC)。在400个样本的测试集上,包含所有特征的MLP得分为87.25%,仅包含图像-文本特征的MLP得分为88%。除了挑战要求之外,我们还提出了一个单独的管道,使用大型未注释的训练数据集中提供的图像和字幕自动构建标题对和注释。我们希望这一努力将为改进打开大门,因为手工标注廉价的假冒标签是耗时的。要在测试集上评估性能,可以在https://hub.docker.com/repository/docker/malkaddour/mmsys22cheapfakes上获得带有模型的Docker映像。该项目的开源代码可从https://github.com/malkaddour/ACMM-22-Cheapfake-Detection-Sentiment-aware-Classifier-for-Out-of-Context-Caption-Detection访问。
{"title":"Sentiment-aware Classifier for Out-of-Context Caption Detection","authors":"Muhannad Alkaddour, Abhinav Dhall, U. Tariq, Hasan Al Nashash, Fares Al-Shargie","doi":"10.1145/3503161.3551603","DOIUrl":"https://doi.org/10.1145/3503161.3551603","url":null,"abstract":"In this work we propose additions to the COSMOS and COSMOS on Steroids pipelines for the detection of Cheapfakes for Task 1 of the ACM Grand Challenge for Detecting Cheapfakes. We compute sentiment features, namely polarity and subjectivity, using the news image captions. Multiple logistic regression results show that these sentiment features are significant in prediction of the outcome. We then combine the sentiment features with the four image-text features obtained in the aforementioned previous works to train an MLP. This classifies sets of inputs into being out-of-context (OOC) or not-out-of-context (NOOC). On a test set of 400 samples, the MLP with all features achieved a score of 87.25%, and that with only the image-text features a score of 88%. In addition to the challenge requirements, we also propose a separate pipeline to automatically construct caption pairs and annotations using the images and captions provided in the large, un-annotated training dataset. We hope that this endeavor will open the door for improvements, since hand-annotating cheapfake labels is time-consuming. To evaluate the performance on the test set, the Docker image with the models is available at: https://hub.docker.com/repository/docker/malkaddour/mmsys22cheapfakes. The open-source code for the project is accessible at: https://github.com/malkaddour/ACMM-22-Cheapfake-Detection-Sentiment-aware-Classifier-for-Out-of-Context-Caption-Detection.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124349387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
DDAM '22: 1st International Workshop on Deepfake Detection for Audio Multimedia 第一届音频多媒体深度伪造检测国际研讨会[j]
Pub Date : 2022-10-10 DOI: 10.1145/3503161.3554779
J. Tao, Jiangyan Yi, Cunhang Fan, Ruibo Fu, Shan Liang, Pengyuan Zhang, Haizhou Li, H. Meng, Dong Yu, M. Akagi
Over the last few years, the technology of speech synthesis and voice conversion has made significant improvement with the development of deep learning. The models can generate realistic and human-like speech. It is difficult for most people to distinguish the generated audio from the real. However, this technology also poses a great threat to the global political economy and social stability if some attackers and criminals misuse it with the intent to cause harm. In this workshop, we aim to bring together researchers from the fields of audio deepfake detection, audio deep synthesis, audio fake game and adversarial attacks to further discuss recent research and future directions for detecting deepfake and manipulated audios in multimedia.
在过去的几年里,随着深度学习的发展,语音合成和语音转换技术取得了显著的进步。这些模型可以生成逼真的、类似人类的语音。对于大多数人来说,很难将生成的音频与真实音频区分开来。然而,如果一些攻击者和犯罪分子出于伤害的目的滥用这项技术,也会对全球政治经济和社会稳定构成巨大威胁。在本次研讨会中,我们的目标是汇集来自音频深度伪造检测、音频深度合成、音频伪造游戏和对抗性攻击领域的研究人员,进一步讨论在多媒体中检测深度伪造和操纵音频的最新研究和未来方向。
{"title":"DDAM '22: 1st International Workshop on Deepfake Detection for Audio Multimedia","authors":"J. Tao, Jiangyan Yi, Cunhang Fan, Ruibo Fu, Shan Liang, Pengyuan Zhang, Haizhou Li, H. Meng, Dong Yu, M. Akagi","doi":"10.1145/3503161.3554779","DOIUrl":"https://doi.org/10.1145/3503161.3554779","url":null,"abstract":"Over the last few years, the technology of speech synthesis and voice conversion has made significant improvement with the development of deep learning. The models can generate realistic and human-like speech. It is difficult for most people to distinguish the generated audio from the real. However, this technology also poses a great threat to the global political economy and social stability if some attackers and criminals misuse it with the intent to cause harm. In this workshop, we aim to bring together researchers from the fields of audio deepfake detection, audio deep synthesis, audio fake game and adversarial attacks to further discuss recent research and future directions for detecting deepfake and manipulated audios in multimedia.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"180 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124492536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-modal Semantic Alignment Pre-training for Vision-and-Language Navigation 视觉语言导航的跨模态语义对齐预训练
Pub Date : 2022-10-10 DOI: 10.1145/3503161.3548283
Siying Wu, Xueyang Fu, Feng Wu, Zhengjun Zha
Vision-and-Language Navigation needs an agent to navigate to a target location by progressively grounding and following the relevant instruction conditioning on its memory and current observation. Existing works utilize the cross-modal transformer to pass the message between visual modality and textual modality. However, they are still limited to mining the fine-grained matching between the underlying components of trajectories and instructions. Inspired by the significant progress achieved by large-scale pre-training methods, in this paper, we propose CSAP, a new method of Cross-modal Semantic Alignment Pre-training for Vision-and-Language Navigation. It is designed to learn the alignment from trajectory-instruction pairs through two novel tasks, including trajectory-conditioned masked fragment modeling and contrastive semantic-alignment modeling. Specifically, the trajectory-conditioned masked fragment modeling encourages the agent to extract useful visual information to reconstruct the masked fragment. The contrastive semantic-alignment modeling is designed to align the visual representation with corresponding phrase embeddings. By showing experimental results on the benchmark dataset, we demonstrate that transformer architecture-based navigation agent pre-trained with our proposed CSAP outperforms existing methods on both SR and SPL scores.
视觉语言导航需要智能体根据其记忆和当前观察,逐步接地并遵循相关指令,从而导航到目标位置。现有作品利用跨模态转换器在视觉模态和文本模态之间传递信息。然而,它们仍然局限于挖掘轨迹和指令的底层组件之间的细粒度匹配。受大规模预训练方法取得的重大进展的启发,本文提出了一种新的视觉语言导航跨模态语义对齐预训练方法CSAP。该算法通过轨迹条件屏蔽片段建模和对比语义对齐建模两种新颖的任务,从轨迹-指令对中学习对齐。具体来说,轨迹条件屏蔽片段建模鼓励智能体提取有用的视觉信息来重建被屏蔽的片段。对比语义对齐建模旨在将视觉表示与相应的短语嵌入对齐。通过在基准数据集上展示实验结果,我们证明了使用我们提出的CSAP预训练的基于变压器体系结构的导航代理在SR和SPL分数上都优于现有方法。
{"title":"Cross-modal Semantic Alignment Pre-training for Vision-and-Language Navigation","authors":"Siying Wu, Xueyang Fu, Feng Wu, Zhengjun Zha","doi":"10.1145/3503161.3548283","DOIUrl":"https://doi.org/10.1145/3503161.3548283","url":null,"abstract":"Vision-and-Language Navigation needs an agent to navigate to a target location by progressively grounding and following the relevant instruction conditioning on its memory and current observation. Existing works utilize the cross-modal transformer to pass the message between visual modality and textual modality. However, they are still limited to mining the fine-grained matching between the underlying components of trajectories and instructions. Inspired by the significant progress achieved by large-scale pre-training methods, in this paper, we propose CSAP, a new method of Cross-modal Semantic Alignment Pre-training for Vision-and-Language Navigation. It is designed to learn the alignment from trajectory-instruction pairs through two novel tasks, including trajectory-conditioned masked fragment modeling and contrastive semantic-alignment modeling. Specifically, the trajectory-conditioned masked fragment modeling encourages the agent to extract useful visual information to reconstruct the masked fragment. The contrastive semantic-alignment modeling is designed to align the visual representation with corresponding phrase embeddings. By showing experimental results on the benchmark dataset, we demonstrate that transformer architecture-based navigation agent pre-trained with our proposed CSAP outperforms existing methods on both SR and SPL scores.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124529297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
Proceedings of the 30th ACM International Conference on Multimedia
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1