Proceedings of the 30th ACM International Conference on Multimedia最新文献

英文中文

3D Human Mesh Reconstruction by Learning to Sample Joint Adaptive Tokens for Transformers 通过学习采样变形金刚关节自适应令牌进行三维人体网格重建

Proceedings of the 30th ACM International Conference on Multimedia

Pub Date : 2022-10-10 DOI: 10.1145/3503161.3548133

Youze Xue, Jiansheng Chen, Yudong Zhang, Cheng Yu, Huimin Ma, Hongbing Ma

Reconstructing 3D human mesh from a single RGB image is a challenging task due to the inherent depth ambiguity. Researchers commonly use convolutional neural networks to extract features and then apply spatial aggregation on the feature maps to explore the embedded 3D cues in the 2D image. Recently, two methods of spatial aggregation, the transformers and the spatial attention, are adopted to achieve the state-of-the-art performance, whereas they both have limitations. The use of transformers helps modelling long-term dependency across different joints whereas the grid tokens are not adaptive for the positions and shapes of human joints in different images. On the contrary, the spatial attention focuses on joint-specific features. However, the non-local information of the body is ignored by the concentrated attention maps. To address these issues, we propose a Learnable Sampling module to generate joint adaptive tokens and then use transformers to aggregate global information. Feature vectors are sampled accordingly from the feature maps to form the tokens of different joints. The sampling weights are predicted by a learnable network so that the model can learn to sample joint-related features adaptively. Our adaptive tokens are explicitly correlated with human joints, so that more effective modeling of global dependency among different human joints can be achieved. To validate the effectiveness of our method, we conduct experiments on several popular datasets including Human3.6M and 3DPW. Our method achieves lower reconstruction errors in terms of both the vertex-based metric and the joint-based metric compared to previous state of the arts. The codes and the trained models are released at https://github.com/thuxyz19/Learnable-Sampling.

由于固有的深度模糊性，从单个RGB图像重建三维人体网格是一项具有挑战性的任务。研究人员通常使用卷积神经网络提取特征，然后在特征映射上应用空间聚合来探索2D图像中嵌入的3D线索。近年来，为了达到最先进的性能，人们采用了两种空间聚集方法:变压器和空间关注，但这两种方法都有局限性。变压器的使用有助于模拟不同关节之间的长期依赖关系，而网格标记不能适应不同图像中人体关节的位置和形状。相反，空间注意力集中在关节特征上。然而，身体的非局部信息被集中的注意图所忽略。为了解决这些问题，我们提出了一个可学习的采样模块来生成联合自适应令牌，然后使用变压器来聚合全局信息。从特征映射中抽取相应的特征向量，形成不同节点的标记。通过可学习网络预测采样权值，使模型能够自适应学习对关节相关特征进行采样。我们的自适应标记与人体关节显式相关，因此可以更有效地建模不同人体关节之间的全局依赖关系。为了验证我们方法的有效性，我们在几个流行的数据集上进行了实验，包括Human3.6M和3DPW。与以前的技术相比，我们的方法在基于顶点的度量和基于关节的度量方面都实现了更低的重建误差。代码和训练过的模型发布在https://github.com/thuxyz19/Learnable-Sampling。

{"title":"3D Human Mesh Reconstruction by Learning to Sample Joint Adaptive Tokens for Transformers","authors":"Youze Xue, Jiansheng Chen, Yudong Zhang, Cheng Yu, Huimin Ma, Hongbing Ma","doi":"10.1145/3503161.3548133","DOIUrl":"https://doi.org/10.1145/3503161.3548133","url":null,"abstract":"Reconstructing 3D human mesh from a single RGB image is a challenging task due to the inherent depth ambiguity. Researchers commonly use convolutional neural networks to extract features and then apply spatial aggregation on the feature maps to explore the embedded 3D cues in the 2D image. Recently, two methods of spatial aggregation, the transformers and the spatial attention, are adopted to achieve the state-of-the-art performance, whereas they both have limitations. The use of transformers helps modelling long-term dependency across different joints whereas the grid tokens are not adaptive for the positions and shapes of human joints in different images. On the contrary, the spatial attention focuses on joint-specific features. However, the non-local information of the body is ignored by the concentrated attention maps. To address these issues, we propose a Learnable Sampling module to generate joint adaptive tokens and then use transformers to aggregate global information. Feature vectors are sampled accordingly from the feature maps to form the tokens of different joints. The sampling weights are predicted by a learnable network so that the model can learn to sample joint-related features adaptively. Our adaptive tokens are explicitly correlated with human joints, so that more effective modeling of global dependency among different human joints can be achieved. To validate the effectiveness of our method, we conduct experiments on several popular datasets including Human3.6M and 3DPW. Our method achieves lower reconstruction errors in terms of both the vertex-based metric and the joint-based metric compared to previous state of the arts. The codes and the trained models are released at https://github.com/thuxyz19/Learnable-Sampling.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126224761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Progressive Tree-Structured Prototype Network for End-to-End Image Captioning 端到端图像字幕的渐进式树结构原型网络

Proceedings of the 30th ACM International Conference on Multimedia

Pub Date : 2022-10-10 DOI: 10.1145/3503161.3548024

Pengpeng Zeng, Jinkuan Zhu, Jingkuan Song, Lianli Gao

Studies of image captioning are shifting towards a trend of a fully end-to-end paradigm by leveraging powerful visual pre-trained models and transformer-based generation architecture for more flexible model training and faster inference speed. State-of-the-art approaches simply extract isolated concepts or attributes to assist description generation. However, such approaches do not consider the hierarchical semantic structure in the textual domain, which leads to an unpredictable mapping between visual representations and concept words. To this end, we propose a novel Progressive Tree-Structured prototype Network (dubbed PTSN), which is the first attempt to narrow down the scope of prediction words with appropriate semantics by modeling the hierarchical textual semantics. Specifically, we design a novel embedding method called tree-structured prototype, producing a set of hierarchical representative embeddings which capture the hierarchical semantic structure in textual space. To utilize such tree-structured prototypes into visual cognition, we also propose a progressive aggregation module to exploit semantic relationships within the image and prototypes. By applying our PTSN to the end-to-end captioning framework, extensive experiments conducted on MSCOCO dataset show that our method achieves a new state-of-the-art performance with 144.2% (single model) and 146.5% (ensemble of 4 models) CIDEr scores on 'Karpathy' split and 141.4% (c5) and 143.9% (c40) CIDEr scores on the official online test server. Trained models and source code have been released at: https://github.com/NovaMind-Z/PTSN.

通过利用强大的视觉预训练模型和基于变压器的生成架构来实现更灵活的模型训练和更快的推理速度，图像字幕的研究正朝着完全端到端范式的趋势转变。最先进的方法只是提取孤立的概念或属性来辅助描述生成。然而，这些方法没有考虑文本域的分层语义结构，导致视觉表示和概念词之间的映射不可预测。为此，我们提出了一种新的渐进式树状结构原型网络(Progressive Tree-Structured prototype Network，简称PTSN)，这是首次尝试通过对分层文本语义的建模来缩小具有适当语义的预测词的范围。具体而言，我们设计了一种新颖的树状结构原型嵌入方法，生成了一组层次代表性的嵌入，这些嵌入捕获了文本空间中的层次语义结构。为了利用这种树状结构的原型进行视觉认知，我们还提出了一个渐进聚合模块来利用图像和原型之间的语义关系。通过将我们的PTSN应用于端到端字幕框架，在MSCOCO数据集上进行的大量实验表明，我们的方法在“Karpathy”分裂上的CIDEr分数为144.2%(单个模型)和146.5%(4个模型的集合)，在官方在线测试服务器上的CIDEr分数为141.4% (c5)和143.9% (c40)，达到了新的最先进的性能。经过训练的模型和源代码已在https://github.com/NovaMind-Z/PTSN上发布。

{"title":"Progressive Tree-Structured Prototype Network for End-to-End Image Captioning","authors":"Pengpeng Zeng, Jinkuan Zhu, Jingkuan Song, Lianli Gao","doi":"10.1145/3503161.3548024","DOIUrl":"https://doi.org/10.1145/3503161.3548024","url":null,"abstract":"Studies of image captioning are shifting towards a trend of a fully end-to-end paradigm by leveraging powerful visual pre-trained models and transformer-based generation architecture for more flexible model training and faster inference speed. State-of-the-art approaches simply extract isolated concepts or attributes to assist description generation. However, such approaches do not consider the hierarchical semantic structure in the textual domain, which leads to an unpredictable mapping between visual representations and concept words. To this end, we propose a novel Progressive Tree-Structured prototype Network (dubbed PTSN), which is the first attempt to narrow down the scope of prediction words with appropriate semantics by modeling the hierarchical textual semantics. Specifically, we design a novel embedding method called tree-structured prototype, producing a set of hierarchical representative embeddings which capture the hierarchical semantic structure in textual space. To utilize such tree-structured prototypes into visual cognition, we also propose a progressive aggregation module to exploit semantic relationships within the image and prototypes. By applying our PTSN to the end-to-end captioning framework, extensive experiments conducted on MSCOCO dataset show that our method achieves a new state-of-the-art performance with 144.2% (single model) and 146.5% (ensemble of 4 models) CIDEr scores on 'Karpathy' split and 141.4% (c5) and 143.9% (c40) CIDEr scores on the official online test server. Trained models and source code have been released at: https://github.com/NovaMind-Z/PTSN.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126426753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Confederated Learning: Going Beyond Centralization 联合学习:超越集中化

Proceedings of the 30th ACM International Conference on Multimedia

Pub Date : 2022-10-10 DOI: 10.1145/3503161.3548157

Zitai Wang, Qianqian Xu, Ke Ma, Xiaochun Cao, Qingming Huang

Traditional machine learning implicitly assumes that a single entity (e.g., a person or an organization) could complete all the jobs of the whole learning process: data collection, algorithm design, parameter selection, and model evaluation. However, many practical scenarios require cooperation among entities, and existing paradigms fail to meet cost, privacy, or security requirements and so on. In this paper, we consider a generalized paradigm: different roles are granted multiple permissions to complete their corresponding jobs, called Confederated Learning. Systematic analysis shows that confederated learning generalizes traditional machine learning and the existing distributed paradigms like federation learning. Then, we study an application scenario of confederated learning which could inspire future research in the context of cooperation between different entities. Three methods are proposed as the first trial for the cooperated learning under restricted conditions. Empirical results on three datasets validate the effectiveness of the proposed methods.

传统的机器学习隐含地假设单个实体(例如，一个人或一个组织)可以完成整个学习过程中的所有工作:数据收集、算法设计、参数选择和模型评估。然而，许多实际场景需要实体之间的合作，而现有的范例无法满足成本、隐私或安全需求等。在本文中，我们考虑了一个广义的范例:不同的角色被授予多个权限来完成他们相应的工作，称为联合学习。系统分析表明，联合学习对传统的机器学习和现有的分布式学习模式如联合学习进行了推广。在此基础上，研究了联合学习的应用场景，为未来在不同实体间合作背景下的研究提供借鉴。提出了三种方法作为约束条件下协同学习的初步尝试。在三个数据集上的实证结果验证了所提方法的有效性。

引用次数: 1

Dynamically Adjust Word Representations Using Unaligned Multimodal Information 使用未对齐的多模态信息动态调整单词表示

Proceedings of the 30th ACM International Conference on Multimedia

Pub Date : 2022-10-10 DOI: 10.1145/3503161.3548137

Jiwei Guo, Jiajia Tang, Weichen Dai, Yu Ding, Wanzeng Kong

Multimodal Sentiment Analysis is a promising research area for modeling multiple heterogeneous modalities. Two major challenges that exist in this area are a) multimodal data is unaligned in nature due to the different sampling rates of each modality, and b) long-range dependencies between elements across modalities. These challenges increase the difficulty of conducting efficient multimodal fusion. In this work, we propose a novel end-to-end network named Cross Hyper-modality Fusion Network (CHFN). The CHFN is an interpretable Transformer-based neural model that provides an efficient framework for fusing unaligned multimodal sequences. The heart of our model is to dynamically adjust word representations in different non-verbal contexts using unaligned multimodal sequences. It is concerned with the influence of non-verbal behavioral information at the scale of the entire utterances and then integrates this influence into verbal expression. We conducted experiments on both publicly available multimodal sentiment analysis datasets CMU-MOSI and CMU-MOSEI. The experiment results demonstrate that our model surpasses state-of-the-art models. In addition, we visualize the learned interactions between language modality and non-verbal behavior information and explore the underlying dynamics of multimodal language data.

多模态情感分析是对多异构模态进行建模的一个很有前途的研究领域。该领域存在的两个主要挑战是:a)由于每个模态的采样率不同，多模态数据本质上是不对齐的;b)跨模态元素之间的长期依赖关系。这些挑战增加了进行高效多模态融合的难度。在这项工作中，我们提出了一种新的端到端网络，称为交叉超模态融合网络(CHFN)。CHFN是一个可解释的基于变压器的神经模型，为融合未对齐的多模态序列提供了一个有效的框架。该模型的核心是使用未对齐的多模态序列动态调整不同非语言上下文中的单词表示。它关注非言语行为信息在整个话语尺度上的影响，然后将这种影响整合到言语表达中。我们在两个公开可用的多模态情感分析数据集CMU-MOSI和CMU-MOSEI上进行了实验。实验结果表明，我们的模型优于现有的模型。此外，我们还可视化了语言模态与非语言行为信息之间的相互作用，并探索了多模态语言数据的潜在动态。

{"title":"Dynamically Adjust Word Representations Using Unaligned Multimodal Information","authors":"Jiwei Guo, Jiajia Tang, Weichen Dai, Yu Ding, Wanzeng Kong","doi":"10.1145/3503161.3548137","DOIUrl":"https://doi.org/10.1145/3503161.3548137","url":null,"abstract":"Multimodal Sentiment Analysis is a promising research area for modeling multiple heterogeneous modalities. Two major challenges that exist in this area are a) multimodal data is unaligned in nature due to the different sampling rates of each modality, and b) long-range dependencies between elements across modalities. These challenges increase the difficulty of conducting efficient multimodal fusion. In this work, we propose a novel end-to-end network named Cross Hyper-modality Fusion Network (CHFN). The CHFN is an interpretable Transformer-based neural model that provides an efficient framework for fusing unaligned multimodal sequences. The heart of our model is to dynamically adjust word representations in different non-verbal contexts using unaligned multimodal sequences. It is concerned with the influence of non-verbal behavioral information at the scale of the entire utterances and then integrates this influence into verbal expression. We conducted experiments on both publicly available multimodal sentiment analysis datasets CMU-MOSI and CMU-MOSEI. The experiment results demonstrate that our model surpasses state-of-the-art models. In addition, we visualize the learned interactions between language modality and non-verbal behavior information and explore the underlying dynamics of multimodal language data.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128137624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Gloss Semantic-Enhanced Network with Online Back-Translation for Sign Language Production 用于手语生产的光泽语义增强网络与在线反翻译

Proceedings of the 30th ACM International Conference on Multimedia

Pub Date : 2022-10-10 DOI: 10.1145/3503161.3547830

Shengeng Tang, Richang Hong, Dan Guo, Meng Wang

Sign Language Production (SLP) aims to generate the visual appearance of sign language according to the spoken language, in which a key procedure is to translate sign Gloss to Pose (G2P). Existing G2P methods mainly focus on regression prediction of posture coordinates, namely closely fitting the ground truth. In this paper, we provide a new viewpoint: a Gloss semantic-Enhanced Network is proposed with Online Back-Translation (GEN-OBT) for G2P in the SLP task. Specifically, GEN-OBT consists of a gloss encoder, a pose decoder, and an online reverse gloss decoder. In the gloss encoder based on the transformer, we design a learnable gloss token without any prior knowledge of gloss, to explore the global contextual dependency of the entire gloss sequence. During sign pose generation, the gloss token is aggregated onto the existing generated poses as gloss guidance. Then, the aggregated features are interacted with the entire gloss embedding vectors to generate the next pose. Furthermore, we design a CTC-based reverse decoder to convert the generated poses backward into glosses, which guarantees the semantic consistency during the processes of gloss-to-pose and pose-to-gloss. Extensive experiments on the challenging PHOENIX14T benchmark demonstrate that the proposed GEN-OBT outperforms the state-of-the-art models. Visualization results further validate the interpretability of our method.

手语生产(Sign Language Production, SLP)旨在根据口头语言生成手语的视觉外观，其中一个关键步骤是将手语的光泽转化为姿态(G2P)。现有的G2P方法主要侧重于姿态坐标的回归预测，即紧密拟合地面真值。本文提出了一种新的观点:针对SLP任务中的G2P，提出了一种带有在线回翻译(GEN-OBT)的Gloss语义增强网络。具体来说，GEN-OBT由一个光泽编码器、一个姿态解码器和一个在线反向光泽解码器组成。在基于转换器的光泽编码器中，我们设计了一个可学习的光泽令牌，无需任何先前的光泽知识，以探索整个光泽序列的全局上下文依赖性。在手势姿态生成过程中，光泽令牌被聚合到现有生成的姿态上作为光泽指导。然后，将聚合的特征与整个光泽嵌入向量交互以生成下一个姿态。此外，我们设计了一个基于ctc的反向解码器，将生成的姿态反向转换为光泽，保证了光泽到姿态和姿态到光泽过程中的语义一致性。在具有挑战性的PHOENIX14T基准测试上进行的大量实验表明，所提出的GEN-OBT优于最先进的模型。可视化结果进一步验证了我们方法的可解释性。

{"title":"Gloss Semantic-Enhanced Network with Online Back-Translation for Sign Language Production","authors":"Shengeng Tang, Richang Hong, Dan Guo, Meng Wang","doi":"10.1145/3503161.3547830","DOIUrl":"https://doi.org/10.1145/3503161.3547830","url":null,"abstract":"Sign Language Production (SLP) aims to generate the visual appearance of sign language according to the spoken language, in which a key procedure is to translate sign Gloss to Pose (G2P). Existing G2P methods mainly focus on regression prediction of posture coordinates, namely closely fitting the ground truth. In this paper, we provide a new viewpoint: a Gloss semantic-Enhanced Network is proposed with Online Back-Translation (GEN-OBT) for G2P in the SLP task. Specifically, GEN-OBT consists of a gloss encoder, a pose decoder, and an online reverse gloss decoder. In the gloss encoder based on the transformer, we design a learnable gloss token without any prior knowledge of gloss, to explore the global contextual dependency of the entire gloss sequence. During sign pose generation, the gloss token is aggregated onto the existing generated poses as gloss guidance. Then, the aggregated features are interacted with the entire gloss embedding vectors to generate the next pose. Furthermore, we design a CTC-based reverse decoder to convert the generated poses backward into glosses, which guarantees the semantic consistency during the processes of gloss-to-pose and pose-to-gloss. Extensive experiments on the challenging PHOENIX14T benchmark demonstrate that the proposed GEN-OBT outperforms the state-of-the-art models. Visualization results further validate the interpretability of our method.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115842509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

CLIPTexture: Text-Driven Texture Synthesis CLIPTexture:文本驱动的纹理合成

Proceedings of the 30th ACM International Conference on Multimedia

Pub Date : 2022-10-10 DOI: 10.1145/3503161.3548146

Yiren Song

Can artificial intelligence create textures with artistic value according to human language control? Existing texture synthesis methods require example texture input. However, in many practical situations, users don't have satisfying textures but tell designers about their needs through simple sketches and verbal descriptions. This paper proposes a novel texture synthesis framework based on the CLIP, which models the texture synthesis problem as an optimization process and realizes text-driven texture synthesis by minimizing the distance between the input image and the text prompt in latent space. Our method performs zero-shot image manipulation successfully even between unseen domains. We implement texture synthesis using two different optimization methods, the TextureNet and Diffvg, demonstrating the generality of CLIPTexture. Extensive experiments confirmed the robust and superior manipulation performance of our methods compared to the existing baselines.

人工智能能否根据人类的语言控制创造出具有艺术价值的纹理?现有的纹理合成方法需要输入样本纹理。然而，在许多实际情况下，用户并没有令人满意的纹理，而是通过简单的草图和语言描述来告诉设计师他们的需求。本文提出了一种基于CLIP的纹理合成框架，该框架将纹理合成问题建模为一个优化过程，通过最小化潜在空间中输入图像与文本提示之间的距离来实现文本驱动的纹理合成。我们的方法即使在不可见的域之间也能成功地进行零镜头图像处理。我们使用TextureNet和Diffvg两种不同的优化方法实现纹理合成，展示了CLIPTexture的通用性。大量的实验证实了我们的方法与现有基线相比的鲁棒性和优越的操作性能。

引用次数: 2

Query-driven Generative Network for Document Information Extraction in the Wild 基于查询驱动的生成网络的野外文档信息提取

Proceedings of the 30th ACM International Conference on Multimedia

Pub Date : 2022-10-10 DOI: 10.1145/3503161.3547877

H. Cao, Xin Li, Jiefeng Ma, Deqiang Jiang, Antai Guo, Yiqing Hu, Hao Liu, Yinsong Liu, Bo Ren

This paper focuses on solving Document Information Extraction (DIE) in the wild problem, which is rarely explored before. In contrast to existing studies mainly tailored for document cases in known templates with predefined layouts and keys under the ideal input without OCR errors involved, we aim to build up a more practical DIE paradigm for real-world scenarios where input document images may contain unknown layouts and keys in the scenes of the problematic OCR results. To achieve this goal, we propose a novel architecture, termed Query-driven Generative Network (QGN), which is equipped with two consecutive modules, i.e., Layout Context-aware Module (LCM) and Structured Generation Module (SGM). Given a document image with unseen layouts and fields, the former LCM yields the value prefix candidates serving as the query prompts for the SGM to generate the final key-value pairs even with OCR noise. To further investigate the potential of our method, we create a new large-scale dataset, named LArge-scale STructured Documents (LastDoc4000), containing 4,000 documents with 1,511 layouts and 3,500 different keys. In experiments, we demonstrate that our QGN consistently achieves the best F1-score on the new LastDoc4000 dataset by at most 30.32% absolute improvement. A more comprehensive experimental analysis and experiments on other public benchmarks also verify the effectiveness and robustness of our proposed method for the wild DIE task.

本文重点解决了文献信息抽取(DIE)的野外问题，这是以往很少有人探讨的问题。与现有的研究主要针对已知模板下的文档案例进行定制，在理想输入下具有预定义的布局和键，没有涉及OCR错误，我们的目标是为实际场景中输入文档图像可能包含未知布局和键的场景建立一个更实用的DIE范例。为了实现这一目标，我们提出了一种新的架构，称为查询驱动生成网络(QGN)，它配备了两个连续的模块，即布局上下文感知模块(LCM)和结构化生成模块(SGM)。给定具有未见布局和字段的文档图像，前一种LCM生成值前缀候选者，作为SGM生成最终键值对的查询提示，即使有OCR噪声。为了进一步研究我们的方法的潜力，我们创建了一个新的大规模数据集，名为大规模结构化文档(LastDoc4000)，其中包含有1,511种布局和3,500个不同键的4,000个文档。在实验中，我们证明了我们的QGN在新的LastDoc4000数据集上始终达到最佳f1分数，绝对提高最多30.32%。更全面的实验分析和在其他公共基准上的实验也验证了我们提出的方法对野生DIE任务的有效性和鲁棒性。

{"title":"Query-driven Generative Network for Document Information Extraction in the Wild","authors":"H. Cao, Xin Li, Jiefeng Ma, Deqiang Jiang, Antai Guo, Yiqing Hu, Hao Liu, Yinsong Liu, Bo Ren","doi":"10.1145/3503161.3547877","DOIUrl":"https://doi.org/10.1145/3503161.3547877","url":null,"abstract":"This paper focuses on solving Document Information Extraction (DIE) in the wild problem, which is rarely explored before. In contrast to existing studies mainly tailored for document cases in known templates with predefined layouts and keys under the ideal input without OCR errors involved, we aim to build up a more practical DIE paradigm for real-world scenarios where input document images may contain unknown layouts and keys in the scenes of the problematic OCR results. To achieve this goal, we propose a novel architecture, termed Query-driven Generative Network (QGN), which is equipped with two consecutive modules, i.e., Layout Context-aware Module (LCM) and Structured Generation Module (SGM). Given a document image with unseen layouts and fields, the former LCM yields the value prefix candidates serving as the query prompts for the SGM to generate the final key-value pairs even with OCR noise. To further investigate the potential of our method, we create a new large-scale dataset, named LArge-scale STructured Documents (LastDoc4000), containing 4,000 documents with 1,511 layouts and 3,500 different keys. In experiments, we demonstrate that our QGN consistently achieves the best F1-score on the new LastDoc4000 dataset by at most 30.32% absolute improvement. A more comprehensive experimental analysis and experiments on other public benchmarks also verify the effectiveness and robustness of our proposed method for the wild DIE task.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132209621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

APCCPA '22: 1st International Workshop on Advances in Point Cloud Compression, Processing and Analysis 第一届点云压缩、处理和分析国际研讨会

Proceedings of the 30th ACM International Conference on Multimedia

Pub Date : 2022-10-10 DOI: 10.1145/3503161.3554780

Wei Gao, Ge Li, Hui Yuan, R. Hamzaoui, Zhu Li, Shan Liu

Point clouds are attracting much attention from academia, industry and standardization organizations such as MPEG, JPEG, and AVS. 3D Point clouds consisting of thousands or even millions of points with attributes can represent real-world objects and scenes in a way that enables an improved immersive visual experience and facilitates complex 3D vision tasks. In addition to various point cloud analysis and processing tasks (e.g., segmentation, classification, 3D object detection, registration), efficient compression for these large-scale 3D visual data is essential to make point cloud applications more effective. This workshop focuses on point cloud processing, analy sis, and compression in challenging situations to further improve visual experience and machine vision performance. Both learning-based and non-learning-based perception-oriented optimization algorithms for compression and processing are solicited. Contributions that advance the state-of-the-art in analysis tasks, are also welcomed.

点云正受到学术界、工业界和标准化组织(如MPEG、JPEG和AVS)的广泛关注。由数千甚至数百万个具有属性的点组成的3D点云可以以一种改进的沉浸式视觉体验和促进复杂3D视觉任务的方式表示现实世界的对象和场景。除了各种点云分析和处理任务(例如，分割，分类，3D物体检测，配准)之外，对这些大规模3D视觉数据进行有效压缩是使点云应用更有效的必要条件。本次研讨会的重点是在具有挑战性的情况下点云的处理、分析和压缩，以进一步提高视觉体验和机器视觉性能。提出了基于学习和非基于学习的面向感知的压缩和处理优化算法。也欢迎对分析任务中先进技术的贡献。

引用次数: 0

Collaboration Superpowers: The Process of Crafting an Interactive Storytelling Animation 协作超级力量:制作互动故事动画的过程

Proceedings of the 30th ACM International Conference on Multimedia

Pub Date : 2022-10-10 DOI: 10.1145/3503161.3549963

Sofia Hinckel Dias, Sara Rodrigues Silva, Beatriz Rodrigues Silva, Rui Nóbrega

Interactive storytelling enables watchers to change the story through an exploratory navigation style. We propose to showcase a collaborative screen to investigate the process of crafting an interactive storytelling animation through the metaphors that built it - from a pre-established database, the watcher can help to create different outputs (e.g. changing sound, color, camera movement, gloss and surface, background and characters). The result is an F-curve graph, time versus animated position, clustering a new layer of added semantic information about the reshaped story.

互动式故事叙述使观众能够通过探索式的导航风格来改变故事。我们建议展示一个协作屏幕，通过构建它的隐喻来研究制作交互式故事动画的过程-从预先建立的数据库中，观察者可以帮助创建不同的输出(例如改变声音，颜色，摄像机运动，光泽和表面，背景和角色)。结果是一个f曲线图，时间与动画位置，聚类了一个关于重塑故事的添加语义信息的新层。

引用次数: 0

Global-Local Cross-View Fisher Discrimination for View-Invariant Action Recognition 基于全局-局部交叉视Fisher判别的视不变动作识别

Proceedings of the 30th ACM International Conference on Multimedia

Pub Date : 2022-10-10 DOI: 10.1145/3503161.3548280

Lingling Gao, Yanli Ji, Yang Yang, Heng Tao Shen

View change brings a significant challenge to action representation and recognition due to pose occlusion and deformation. We propose a Global-Local Cross-View Fisher Discrimination (GL-CVFD) algorithm to tackle this problem. In the GL-CVFD approach, we firstly capture the motion trajectory of body joints in action sequences as feature input to weaken the effect of view change. Secondly, we design a Global-Local Cross-View Representation (CVR) learning module, which builds global-level and local-level graphs to link body parts and joints between different views. It can enhance the cross-view information interaction and obtain an effective view-common action representation. Thirdly, we present a Cross-View Fisher Discrimination (CVFD) module, which performs a view-differential operation to separate view-specific action features and modifies the Fisher discriminator to implement view-semantic Fisher contrastive learning. It operates by pulling and pushing on view-specific and view-common action features in the view term to guarantee the validity of the CVR module, then distinguishes view-common action features in the semantic term for view-invariant recognition. Extensive and fair evaluations are implemented in the UESTC, NTU 60, and NTU 120 datasets. Experiment results illustrate that our proposed approach achieves encouraging performance in skeleton-based view-invariant action recognition.

由于姿态遮挡和变形，视角变化给动作表示和识别带来了巨大的挑战。我们提出了一种全局-局部交叉视图Fisher判别(GL-CVFD)算法来解决这个问题。在GL-CVFD方法中，我们首先捕获动作序列中身体关节的运动轨迹作为特征输入，以减弱视角变化的影响。其次，我们设计了一个全局-局部交叉视图表示(CVR)学习模块，该模块构建全局级和局部级图来连接不同视图之间的身体部位和关节。它可以增强跨视图信息交互，获得有效的视图共同动作表示。第三，我们提出了一个跨视图Fisher判别(Cross-View Fisher Discrimination, CVFD)模块，该模块执行视图差分操作来分离特定于视图的动作特征，并修改Fisher判别器来实现视图语义Fisher对比学习。该方法通过对视图项中的视图特定和视图公共动作特征进行推拉操作来保证CVR模块的有效性，然后对语义项中的视图公共动作特征进行区分，实现视图不变识别。在电子科技大学、NTU 60和NTU 120数据集中进行了广泛和公平的评估。实验结果表明，我们提出的方法在基于骨架的视觉不变动作识别中取得了令人鼓舞的性能。

{"title":"Global-Local Cross-View Fisher Discrimination for View-Invariant Action Recognition","authors":"Lingling Gao, Yanli Ji, Yang Yang, Heng Tao Shen","doi":"10.1145/3503161.3548280","DOIUrl":"https://doi.org/10.1145/3503161.3548280","url":null,"abstract":"View change brings a significant challenge to action representation and recognition due to pose occlusion and deformation. We propose a Global-Local Cross-View Fisher Discrimination (GL-CVFD) algorithm to tackle this problem. In the GL-CVFD approach, we firstly capture the motion trajectory of body joints in action sequences as feature input to weaken the effect of view change. Secondly, we design a Global-Local Cross-View Representation (CVR) learning module, which builds global-level and local-level graphs to link body parts and joints between different views. It can enhance the cross-view information interaction and obtain an effective view-common action representation. Thirdly, we present a Cross-View Fisher Discrimination (CVFD) module, which performs a view-differential operation to separate view-specific action features and modifies the Fisher discriminator to implement view-semantic Fisher contrastive learning. It operates by pulling and pushing on view-specific and view-common action features in the view term to guarantee the validity of the CVR module, then distinguishes view-common action features in the semantic term for view-invariant recognition. Extensive and fair evaluations are implemented in the UESTC, NTU 60, and NTU 120 datasets. Experiment results illustrate that our proposed approach achieves encouraging performance in skeleton-based view-invariant action recognition.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134195804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 30th ACM International Conference on Multimedia

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀