Reconstructing 3D human mesh from a single RGB image is a challenging task due to the inherent depth ambiguity. Researchers commonly use convolutional neural networks to extract features and then apply spatial aggregation on the feature maps to explore the embedded 3D cues in the 2D image. Recently, two methods of spatial aggregation, the transformers and the spatial attention, are adopted to achieve the state-of-the-art performance, whereas they both have limitations. The use of transformers helps modelling long-term dependency across different joints whereas the grid tokens are not adaptive for the positions and shapes of human joints in different images. On the contrary, the spatial attention focuses on joint-specific features. However, the non-local information of the body is ignored by the concentrated attention maps. To address these issues, we propose a Learnable Sampling module to generate joint adaptive tokens and then use transformers to aggregate global information. Feature vectors are sampled accordingly from the feature maps to form the tokens of different joints. The sampling weights are predicted by a learnable network so that the model can learn to sample joint-related features adaptively. Our adaptive tokens are explicitly correlated with human joints, so that more effective modeling of global dependency among different human joints can be achieved. To validate the effectiveness of our method, we conduct experiments on several popular datasets including Human3.6M and 3DPW. Our method achieves lower reconstruction errors in terms of both the vertex-based metric and the joint-based metric compared to previous state of the arts. The codes and the trained models are released at https://github.com/thuxyz19/Learnable-Sampling.
{"title":"3D Human Mesh Reconstruction by Learning to Sample Joint Adaptive Tokens for Transformers","authors":"Youze Xue, Jiansheng Chen, Yudong Zhang, Cheng Yu, Huimin Ma, Hongbing Ma","doi":"10.1145/3503161.3548133","DOIUrl":"https://doi.org/10.1145/3503161.3548133","url":null,"abstract":"Reconstructing 3D human mesh from a single RGB image is a challenging task due to the inherent depth ambiguity. Researchers commonly use convolutional neural networks to extract features and then apply spatial aggregation on the feature maps to explore the embedded 3D cues in the 2D image. Recently, two methods of spatial aggregation, the transformers and the spatial attention, are adopted to achieve the state-of-the-art performance, whereas they both have limitations. The use of transformers helps modelling long-term dependency across different joints whereas the grid tokens are not adaptive for the positions and shapes of human joints in different images. On the contrary, the spatial attention focuses on joint-specific features. However, the non-local information of the body is ignored by the concentrated attention maps. To address these issues, we propose a Learnable Sampling module to generate joint adaptive tokens and then use transformers to aggregate global information. Feature vectors are sampled accordingly from the feature maps to form the tokens of different joints. The sampling weights are predicted by a learnable network so that the model can learn to sample joint-related features adaptively. Our adaptive tokens are explicitly correlated with human joints, so that more effective modeling of global dependency among different human joints can be achieved. To validate the effectiveness of our method, we conduct experiments on several popular datasets including Human3.6M and 3DPW. Our method achieves lower reconstruction errors in terms of both the vertex-based metric and the joint-based metric compared to previous state of the arts. The codes and the trained models are released at https://github.com/thuxyz19/Learnable-Sampling.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126224761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Studies of image captioning are shifting towards a trend of a fully end-to-end paradigm by leveraging powerful visual pre-trained models and transformer-based generation architecture for more flexible model training and faster inference speed. State-of-the-art approaches simply extract isolated concepts or attributes to assist description generation. However, such approaches do not consider the hierarchical semantic structure in the textual domain, which leads to an unpredictable mapping between visual representations and concept words. To this end, we propose a novel Progressive Tree-Structured prototype Network (dubbed PTSN), which is the first attempt to narrow down the scope of prediction words with appropriate semantics by modeling the hierarchical textual semantics. Specifically, we design a novel embedding method called tree-structured prototype, producing a set of hierarchical representative embeddings which capture the hierarchical semantic structure in textual space. To utilize such tree-structured prototypes into visual cognition, we also propose a progressive aggregation module to exploit semantic relationships within the image and prototypes. By applying our PTSN to the end-to-end captioning framework, extensive experiments conducted on MSCOCO dataset show that our method achieves a new state-of-the-art performance with 144.2% (single model) and 146.5% (ensemble of 4 models) CIDEr scores on 'Karpathy' split and 141.4% (c5) and 143.9% (c40) CIDEr scores on the official online test server. Trained models and source code have been released at: https://github.com/NovaMind-Z/PTSN.
{"title":"Progressive Tree-Structured Prototype Network for End-to-End Image Captioning","authors":"Pengpeng Zeng, Jinkuan Zhu, Jingkuan Song, Lianli Gao","doi":"10.1145/3503161.3548024","DOIUrl":"https://doi.org/10.1145/3503161.3548024","url":null,"abstract":"Studies of image captioning are shifting towards a trend of a fully end-to-end paradigm by leveraging powerful visual pre-trained models and transformer-based generation architecture for more flexible model training and faster inference speed. State-of-the-art approaches simply extract isolated concepts or attributes to assist description generation. However, such approaches do not consider the hierarchical semantic structure in the textual domain, which leads to an unpredictable mapping between visual representations and concept words. To this end, we propose a novel Progressive Tree-Structured prototype Network (dubbed PTSN), which is the first attempt to narrow down the scope of prediction words with appropriate semantics by modeling the hierarchical textual semantics. Specifically, we design a novel embedding method called tree-structured prototype, producing a set of hierarchical representative embeddings which capture the hierarchical semantic structure in textual space. To utilize such tree-structured prototypes into visual cognition, we also propose a progressive aggregation module to exploit semantic relationships within the image and prototypes. By applying our PTSN to the end-to-end captioning framework, extensive experiments conducted on MSCOCO dataset show that our method achieves a new state-of-the-art performance with 144.2% (single model) and 146.5% (ensemble of 4 models) CIDEr scores on 'Karpathy' split and 141.4% (c5) and 143.9% (c40) CIDEr scores on the official online test server. Trained models and source code have been released at: https://github.com/NovaMind-Z/PTSN.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126426753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zitai Wang, Qianqian Xu, Ke Ma, Xiaochun Cao, Qingming Huang
Traditional machine learning implicitly assumes that a single entity (e.g., a person or an organization) could complete all the jobs of the whole learning process: data collection, algorithm design, parameter selection, and model evaluation. However, many practical scenarios require cooperation among entities, and existing paradigms fail to meet cost, privacy, or security requirements and so on. In this paper, we consider a generalized paradigm: different roles are granted multiple permissions to complete their corresponding jobs, called Confederated Learning. Systematic analysis shows that confederated learning generalizes traditional machine learning and the existing distributed paradigms like federation learning. Then, we study an application scenario of confederated learning which could inspire future research in the context of cooperation between different entities. Three methods are proposed as the first trial for the cooperated learning under restricted conditions. Empirical results on three datasets validate the effectiveness of the proposed methods.
{"title":"Confederated Learning: Going Beyond Centralization","authors":"Zitai Wang, Qianqian Xu, Ke Ma, Xiaochun Cao, Qingming Huang","doi":"10.1145/3503161.3548157","DOIUrl":"https://doi.org/10.1145/3503161.3548157","url":null,"abstract":"Traditional machine learning implicitly assumes that a single entity (e.g., a person or an organization) could complete all the jobs of the whole learning process: data collection, algorithm design, parameter selection, and model evaluation. However, many practical scenarios require cooperation among entities, and existing paradigms fail to meet cost, privacy, or security requirements and so on. In this paper, we consider a generalized paradigm: different roles are granted multiple permissions to complete their corresponding jobs, called Confederated Learning. Systematic analysis shows that confederated learning generalizes traditional machine learning and the existing distributed paradigms like federation learning. Then, we study an application scenario of confederated learning which could inspire future research in the context of cooperation between different entities. Three methods are proposed as the first trial for the cooperated learning under restricted conditions. Empirical results on three datasets validate the effectiveness of the proposed methods.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"188 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128059803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiwei Guo, Jiajia Tang, Weichen Dai, Yu Ding, Wanzeng Kong
Multimodal Sentiment Analysis is a promising research area for modeling multiple heterogeneous modalities. Two major challenges that exist in this area are a) multimodal data is unaligned in nature due to the different sampling rates of each modality, and b) long-range dependencies between elements across modalities. These challenges increase the difficulty of conducting efficient multimodal fusion. In this work, we propose a novel end-to-end network named Cross Hyper-modality Fusion Network (CHFN). The CHFN is an interpretable Transformer-based neural model that provides an efficient framework for fusing unaligned multimodal sequences. The heart of our model is to dynamically adjust word representations in different non-verbal contexts using unaligned multimodal sequences. It is concerned with the influence of non-verbal behavioral information at the scale of the entire utterances and then integrates this influence into verbal expression. We conducted experiments on both publicly available multimodal sentiment analysis datasets CMU-MOSI and CMU-MOSEI. The experiment results demonstrate that our model surpasses state-of-the-art models. In addition, we visualize the learned interactions between language modality and non-verbal behavior information and explore the underlying dynamics of multimodal language data.
{"title":"Dynamically Adjust Word Representations Using Unaligned Multimodal Information","authors":"Jiwei Guo, Jiajia Tang, Weichen Dai, Yu Ding, Wanzeng Kong","doi":"10.1145/3503161.3548137","DOIUrl":"https://doi.org/10.1145/3503161.3548137","url":null,"abstract":"Multimodal Sentiment Analysis is a promising research area for modeling multiple heterogeneous modalities. Two major challenges that exist in this area are a) multimodal data is unaligned in nature due to the different sampling rates of each modality, and b) long-range dependencies between elements across modalities. These challenges increase the difficulty of conducting efficient multimodal fusion. In this work, we propose a novel end-to-end network named Cross Hyper-modality Fusion Network (CHFN). The CHFN is an interpretable Transformer-based neural model that provides an efficient framework for fusing unaligned multimodal sequences. The heart of our model is to dynamically adjust word representations in different non-verbal contexts using unaligned multimodal sequences. It is concerned with the influence of non-verbal behavioral information at the scale of the entire utterances and then integrates this influence into verbal expression. We conducted experiments on both publicly available multimodal sentiment analysis datasets CMU-MOSI and CMU-MOSEI. The experiment results demonstrate that our model surpasses state-of-the-art models. In addition, we visualize the learned interactions between language modality and non-verbal behavior information and explore the underlying dynamics of multimodal language data.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128137624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sign Language Production (SLP) aims to generate the visual appearance of sign language according to the spoken language, in which a key procedure is to translate sign Gloss to Pose (G2P). Existing G2P methods mainly focus on regression prediction of posture coordinates, namely closely fitting the ground truth. In this paper, we provide a new viewpoint: a Gloss semantic-Enhanced Network is proposed with Online Back-Translation (GEN-OBT) for G2P in the SLP task. Specifically, GEN-OBT consists of a gloss encoder, a pose decoder, and an online reverse gloss decoder. In the gloss encoder based on the transformer, we design a learnable gloss token without any prior knowledge of gloss, to explore the global contextual dependency of the entire gloss sequence. During sign pose generation, the gloss token is aggregated onto the existing generated poses as gloss guidance. Then, the aggregated features are interacted with the entire gloss embedding vectors to generate the next pose. Furthermore, we design a CTC-based reverse decoder to convert the generated poses backward into glosses, which guarantees the semantic consistency during the processes of gloss-to-pose and pose-to-gloss. Extensive experiments on the challenging PHOENIX14T benchmark demonstrate that the proposed GEN-OBT outperforms the state-of-the-art models. Visualization results further validate the interpretability of our method.
手语生产(Sign Language Production, SLP)旨在根据口头语言生成手语的视觉外观,其中一个关键步骤是将手语的光泽转化为姿态(G2P)。现有的G2P方法主要侧重于姿态坐标的回归预测,即紧密拟合地面真值。本文提出了一种新的观点:针对SLP任务中的G2P,提出了一种带有在线回翻译(GEN-OBT)的Gloss语义增强网络。具体来说,GEN-OBT由一个光泽编码器、一个姿态解码器和一个在线反向光泽解码器组成。在基于转换器的光泽编码器中,我们设计了一个可学习的光泽令牌,无需任何先前的光泽知识,以探索整个光泽序列的全局上下文依赖性。在手势姿态生成过程中,光泽令牌被聚合到现有生成的姿态上作为光泽指导。然后,将聚合的特征与整个光泽嵌入向量交互以生成下一个姿态。此外,我们设计了一个基于ctc的反向解码器,将生成的姿态反向转换为光泽,保证了光泽到姿态和姿态到光泽过程中的语义一致性。在具有挑战性的PHOENIX14T基准测试上进行的大量实验表明,所提出的GEN-OBT优于最先进的模型。可视化结果进一步验证了我们方法的可解释性。
{"title":"Gloss Semantic-Enhanced Network with Online Back-Translation for Sign Language Production","authors":"Shengeng Tang, Richang Hong, Dan Guo, Meng Wang","doi":"10.1145/3503161.3547830","DOIUrl":"https://doi.org/10.1145/3503161.3547830","url":null,"abstract":"Sign Language Production (SLP) aims to generate the visual appearance of sign language according to the spoken language, in which a key procedure is to translate sign Gloss to Pose (G2P). Existing G2P methods mainly focus on regression prediction of posture coordinates, namely closely fitting the ground truth. In this paper, we provide a new viewpoint: a Gloss semantic-Enhanced Network is proposed with Online Back-Translation (GEN-OBT) for G2P in the SLP task. Specifically, GEN-OBT consists of a gloss encoder, a pose decoder, and an online reverse gloss decoder. In the gloss encoder based on the transformer, we design a learnable gloss token without any prior knowledge of gloss, to explore the global contextual dependency of the entire gloss sequence. During sign pose generation, the gloss token is aggregated onto the existing generated poses as gloss guidance. Then, the aggregated features are interacted with the entire gloss embedding vectors to generate the next pose. Furthermore, we design a CTC-based reverse decoder to convert the generated poses backward into glosses, which guarantees the semantic consistency during the processes of gloss-to-pose and pose-to-gloss. Extensive experiments on the challenging PHOENIX14T benchmark demonstrate that the proposed GEN-OBT outperforms the state-of-the-art models. Visualization results further validate the interpretability of our method.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115842509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Can artificial intelligence create textures with artistic value according to human language control? Existing texture synthesis methods require example texture input. However, in many practical situations, users don't have satisfying textures but tell designers about their needs through simple sketches and verbal descriptions. This paper proposes a novel texture synthesis framework based on the CLIP, which models the texture synthesis problem as an optimization process and realizes text-driven texture synthesis by minimizing the distance between the input image and the text prompt in latent space. Our method performs zero-shot image manipulation successfully even between unseen domains. We implement texture synthesis using two different optimization methods, the TextureNet and Diffvg, demonstrating the generality of CLIPTexture. Extensive experiments confirmed the robust and superior manipulation performance of our methods compared to the existing baselines.
{"title":"CLIPTexture: Text-Driven Texture Synthesis","authors":"Yiren Song","doi":"10.1145/3503161.3548146","DOIUrl":"https://doi.org/10.1145/3503161.3548146","url":null,"abstract":"Can artificial intelligence create textures with artistic value according to human language control? Existing texture synthesis methods require example texture input. However, in many practical situations, users don't have satisfying textures but tell designers about their needs through simple sketches and verbal descriptions. This paper proposes a novel texture synthesis framework based on the CLIP, which models the texture synthesis problem as an optimization process and realizes text-driven texture synthesis by minimizing the distance between the input image and the text prompt in latent space. Our method performs zero-shot image manipulation successfully even between unseen domains. We implement texture synthesis using two different optimization methods, the TextureNet and Diffvg, demonstrating the generality of CLIPTexture. Extensive experiments confirmed the robust and superior manipulation performance of our methods compared to the existing baselines.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115868290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Cao, Xin Li, Jiefeng Ma, Deqiang Jiang, Antai Guo, Yiqing Hu, Hao Liu, Yinsong Liu, Bo Ren
This paper focuses on solving Document Information Extraction (DIE) in the wild problem, which is rarely explored before. In contrast to existing studies mainly tailored for document cases in known templates with predefined layouts and keys under the ideal input without OCR errors involved, we aim to build up a more practical DIE paradigm for real-world scenarios where input document images may contain unknown layouts and keys in the scenes of the problematic OCR results. To achieve this goal, we propose a novel architecture, termed Query-driven Generative Network (QGN), which is equipped with two consecutive modules, i.e., Layout Context-aware Module (LCM) and Structured Generation Module (SGM). Given a document image with unseen layouts and fields, the former LCM yields the value prefix candidates serving as the query prompts for the SGM to generate the final key-value pairs even with OCR noise. To further investigate the potential of our method, we create a new large-scale dataset, named LArge-scale STructured Documents (LastDoc4000), containing 4,000 documents with 1,511 layouts and 3,500 different keys. In experiments, we demonstrate that our QGN consistently achieves the best F1-score on the new LastDoc4000 dataset by at most 30.32% absolute improvement. A more comprehensive experimental analysis and experiments on other public benchmarks also verify the effectiveness and robustness of our proposed method for the wild DIE task.
{"title":"Query-driven Generative Network for Document Information Extraction in the Wild","authors":"H. Cao, Xin Li, Jiefeng Ma, Deqiang Jiang, Antai Guo, Yiqing Hu, Hao Liu, Yinsong Liu, Bo Ren","doi":"10.1145/3503161.3547877","DOIUrl":"https://doi.org/10.1145/3503161.3547877","url":null,"abstract":"This paper focuses on solving Document Information Extraction (DIE) in the wild problem, which is rarely explored before. In contrast to existing studies mainly tailored for document cases in known templates with predefined layouts and keys under the ideal input without OCR errors involved, we aim to build up a more practical DIE paradigm for real-world scenarios where input document images may contain unknown layouts and keys in the scenes of the problematic OCR results. To achieve this goal, we propose a novel architecture, termed Query-driven Generative Network (QGN), which is equipped with two consecutive modules, i.e., Layout Context-aware Module (LCM) and Structured Generation Module (SGM). Given a document image with unseen layouts and fields, the former LCM yields the value prefix candidates serving as the query prompts for the SGM to generate the final key-value pairs even with OCR noise. To further investigate the potential of our method, we create a new large-scale dataset, named LArge-scale STructured Documents (LastDoc4000), containing 4,000 documents with 1,511 layouts and 3,500 different keys. In experiments, we demonstrate that our QGN consistently achieves the best F1-score on the new LastDoc4000 dataset by at most 30.32% absolute improvement. A more comprehensive experimental analysis and experiments on other public benchmarks also verify the effectiveness and robustness of our proposed method for the wild DIE task.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132209621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei Gao, Ge Li, Hui Yuan, R. Hamzaoui, Zhu Li, Shan Liu
Point clouds are attracting much attention from academia, industry and standardization organizations such as MPEG, JPEG, and AVS. 3D Point clouds consisting of thousands or even millions of points with attributes can represent real-world objects and scenes in a way that enables an improved immersive visual experience and facilitates complex 3D vision tasks. In addition to various point cloud analysis and processing tasks (e.g., segmentation, classification, 3D object detection, registration), efficient compression for these large-scale 3D visual data is essential to make point cloud applications more effective. This workshop focuses on point cloud processing, analy sis, and compression in challenging situations to further improve visual experience and machine vision performance. Both learning-based and non-learning-based perception-oriented optimization algorithms for compression and processing are solicited. Contributions that advance the state-of-the-art in analysis tasks, are also welcomed.
{"title":"APCCPA '22: 1st International Workshop on Advances in Point Cloud Compression, Processing and Analysis","authors":"Wei Gao, Ge Li, Hui Yuan, R. Hamzaoui, Zhu Li, Shan Liu","doi":"10.1145/3503161.3554780","DOIUrl":"https://doi.org/10.1145/3503161.3554780","url":null,"abstract":"Point clouds are attracting much attention from academia, industry and standardization organizations such as MPEG, JPEG, and AVS. 3D Point clouds consisting of thousands or even millions of points with attributes can represent real-world objects and scenes in a way that enables an improved immersive visual experience and facilitates complex 3D vision tasks. In addition to various point cloud analysis and processing tasks (e.g., segmentation, classification, 3D object detection, registration), efficient compression for these large-scale 3D visual data is essential to make point cloud applications more effective. This workshop focuses on point cloud processing, analy sis, and compression in challenging situations to further improve visual experience and machine vision performance. Both learning-based and non-learning-based perception-oriented optimization algorithms for compression and processing are solicited. Contributions that advance the state-of-the-art in analysis tasks, are also welcomed.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132222383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sofia Hinckel Dias, Sara Rodrigues Silva, Beatriz Rodrigues Silva, Rui Nóbrega
Interactive storytelling enables watchers to change the story through an exploratory navigation style. We propose to showcase a collaborative screen to investigate the process of crafting an interactive storytelling animation through the metaphors that built it - from a pre-established database, the watcher can help to create different outputs (e.g. changing sound, color, camera movement, gloss and surface, background and characters). The result is an F-curve graph, time versus animated position, clustering a new layer of added semantic information about the reshaped story.
{"title":"Collaboration Superpowers: The Process of Crafting an Interactive Storytelling Animation","authors":"Sofia Hinckel Dias, Sara Rodrigues Silva, Beatriz Rodrigues Silva, Rui Nóbrega","doi":"10.1145/3503161.3549963","DOIUrl":"https://doi.org/10.1145/3503161.3549963","url":null,"abstract":"Interactive storytelling enables watchers to change the story through an exploratory navigation style. We propose to showcase a collaborative screen to investigate the process of crafting an interactive storytelling animation through the metaphors that built it - from a pre-established database, the watcher can help to create different outputs (e.g. changing sound, color, camera movement, gloss and surface, background and characters). The result is an F-curve graph, time versus animated position, clustering a new layer of added semantic information about the reshaped story.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130152396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
View change brings a significant challenge to action representation and recognition due to pose occlusion and deformation. We propose a Global-Local Cross-View Fisher Discrimination (GL-CVFD) algorithm to tackle this problem. In the GL-CVFD approach, we firstly capture the motion trajectory of body joints in action sequences as feature input to weaken the effect of view change. Secondly, we design a Global-Local Cross-View Representation (CVR) learning module, which builds global-level and local-level graphs to link body parts and joints between different views. It can enhance the cross-view information interaction and obtain an effective view-common action representation. Thirdly, we present a Cross-View Fisher Discrimination (CVFD) module, which performs a view-differential operation to separate view-specific action features and modifies the Fisher discriminator to implement view-semantic Fisher contrastive learning. It operates by pulling and pushing on view-specific and view-common action features in the view term to guarantee the validity of the CVR module, then distinguishes view-common action features in the semantic term for view-invariant recognition. Extensive and fair evaluations are implemented in the UESTC, NTU 60, and NTU 120 datasets. Experiment results illustrate that our proposed approach achieves encouraging performance in skeleton-based view-invariant action recognition.
{"title":"Global-Local Cross-View Fisher Discrimination for View-Invariant Action Recognition","authors":"Lingling Gao, Yanli Ji, Yang Yang, Heng Tao Shen","doi":"10.1145/3503161.3548280","DOIUrl":"https://doi.org/10.1145/3503161.3548280","url":null,"abstract":"View change brings a significant challenge to action representation and recognition due to pose occlusion and deformation. We propose a Global-Local Cross-View Fisher Discrimination (GL-CVFD) algorithm to tackle this problem. In the GL-CVFD approach, we firstly capture the motion trajectory of body joints in action sequences as feature input to weaken the effect of view change. Secondly, we design a Global-Local Cross-View Representation (CVR) learning module, which builds global-level and local-level graphs to link body parts and joints between different views. It can enhance the cross-view information interaction and obtain an effective view-common action representation. Thirdly, we present a Cross-View Fisher Discrimination (CVFD) module, which performs a view-differential operation to separate view-specific action features and modifies the Fisher discriminator to implement view-semantic Fisher contrastive learning. It operates by pulling and pushing on view-specific and view-common action features in the view term to guarantee the validity of the CVR module, then distinguishes view-common action features in the semantic term for view-invariant recognition. Extensive and fair evaluations are implemented in the UESTC, NTU 60, and NTU 120 datasets. Experiment results illustrate that our proposed approach achieves encouraging performance in skeleton-based view-invariant action recognition.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134195804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}