Proceedings of the 30th ACM International Conference on Multimedia最新文献_第6页

Few-Shot Model Agnostic Federated Learning 少镜头模型不可知论联邦学习

Proceedings of the 30th ACM International Conference on Multimedia

Pub Date : 2022-10-10 DOI: 10.1145/3503161.3548764

Wenke Huang, Mang Ye, Bo Du, Xiand Gao

Federated learning has received increasing attention for its ability to collaborative learning without leaking privacy. Promising advances have been achieved under the assumption that participants share the same model structure. However, when participants independently customize their models, models suffer communication barriers, which leads the model heterogeneity problem. Moreover, in real scenarios, the data held by participants is often limited, making the local models trained only on private data present poor performance. Consequently, this paper studies a new challenging problem, namely few-shot model agnostic federated learning, where the local participants design their independent models from their limited private datasets. Considering the scarcity of the private data, we propose to utilize the abundant public available datasets for bridging the gap between local private participants. However, its usage also brings in two problems: inconsistent labels and large domain gap between the public and private datasets. To address these issues, this paper presents a novel framework with two main parts: 1) model agnostic federated learning, it performs public-private communication by unifying the model prediction outputs on the shared public datasets; 2) latent embedding adaptation, it addresses the domain gap with an adversarial learning scheme to discriminate the public and private domains. Together with theoretical generalization bound analysis, comprehensive experiments under various settings have verified our advantage over existing methods. It provides a simple but effective baseline for future advancement. The code is available at https://github.com/WenkeHuang/FSMAFL.

联邦学习因其不泄露隐私的协同学习能力而受到越来越多的关注。在参与者共享相同模型结构的假设下，已经取得了可喜的进展。然而，当参与者独立定制他们的模型时，模型存在沟通障碍，从而导致模型异构问题。此外，在实际场景中，参与者所持有的数据往往是有限的，使得仅在私有数据上训练的局部模型表现不佳。因此，本文研究了一个新的具有挑战性的问题，即少镜头模型不可知论联邦学习，其中局部参与者从他们有限的私有数据集设计他们的独立模型。考虑到私人数据的稀缺性，我们建议利用丰富的公共可用数据集来弥合本地私人参与者之间的差距。然而，它的使用也带来了两个问题:标签不一致和公共和私有数据集之间的大领域差距。为了解决这些问题，本文提出了一个新的框架，主要包括两个部分:1)模型不可知的联邦学习，它通过统一共享公共数据集上的模型预测输出来进行公私通信;2)潜嵌入自适应，利用对抗学习方案区分公共和私有领域，解决领域差距问题。结合理论泛化界分析，各种设置下的综合实验验证了我们优于现有方法的优势。它为未来的发展提供了一个简单而有效的基准。代码可在https://github.com/WenkeHuang/FSMAFL上获得。

{"title":"Few-Shot Model Agnostic Federated Learning","authors":"Wenke Huang, Mang Ye, Bo Du, Xiand Gao","doi":"10.1145/3503161.3548764","DOIUrl":"https://doi.org/10.1145/3503161.3548764","url":null,"abstract":"Federated learning has received increasing attention for its ability to collaborative learning without leaking privacy. Promising advances have been achieved under the assumption that participants share the same model structure. However, when participants independently customize their models, models suffer communication barriers, which leads the model heterogeneity problem. Moreover, in real scenarios, the data held by participants is often limited, making the local models trained only on private data present poor performance. Consequently, this paper studies a new challenging problem, namely few-shot model agnostic federated learning, where the local participants design their independent models from their limited private datasets. Considering the scarcity of the private data, we propose to utilize the abundant public available datasets for bridging the gap between local private participants. However, its usage also brings in two problems: inconsistent labels and large domain gap between the public and private datasets. To address these issues, this paper presents a novel framework with two main parts: 1) model agnostic federated learning, it performs public-private communication by unifying the model prediction outputs on the shared public datasets; 2) latent embedding adaptation, it addresses the domain gap with an adversarial learning scheme to discriminate the public and private domains. Together with theoretical generalization bound analysis, comprehensive experiments under various settings have verified our advantage over existing methods. It provides a simple but effective baseline for future advancement. The code is available at https://github.com/WenkeHuang/FSMAFL.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131811587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

ABPN: Apex and Boundary Perception Network for Micro- and Macro-Expression Spotting ABPN:用于微观和宏观表情识别的顶点和边界感知网络

Proceedings of the 30th ACM International Conference on Multimedia

Pub Date : 2022-10-10 DOI: 10.1145/3503161.3551599

Wenhao Leng, Sirui Zhao, Yiming Zhang, Shiifeng Liu, Xinglong Mao, Hongya Wang, Tong Xu, Enhong Chen

Recently, Micro expression~(ME) has achieved remarkable progress in a wide range of applications, since it's an involuntary facial expression that reflects personal psychological state truly. In the procedure of ME analysis, spotting ME is an essential step, and is non trivial to be detected from a long interval video because of the short duration and low intensity issues. To alleviate this problem, in this paper, we propose a novel Micro- and Macro-Expression~(MaE) Spotting framework based on Apex and Boundary Perception Network~(ABPN), which mainly consists of three parts, i.e., video encoding module ~(VEM), probability evaluation module~(PEM), and expression proposal generation module~(EPGM). Firstly, we adopt Main Directional Mean Optical Flow (MDMO) algorithm and calculate optical flow differences to extract facial motion features in VEM, which can alleviate the impact of head movement and other areas of the face on ME spotting. Then, we extract temporal features with one-dimension convolutional layers and introduce PEM to infer the auxiliary probability that each frame belongs to an apex or boundary frame. With these frame-level auxiliary probabilities, the EPGM further combines the frames from different categories to generate expression proposals for the accurate localization. Besides, we conduct comprehensive experiments on MEGC2022 spotting task, and demonstrate that our proposed method achieves significant improvement with the comparison of state-of-the-art baselines on rm CAS(ME)2 and SAMM-LV datasets. The implemented code is also publicly available at https://github.com/wenhaocold/USTC_ME_Spotting.

微表情(Micro expression~ ME)是一种真实反映个人心理状态的不自觉的面部表情，近年来在广泛的应用中取得了显著的进展。在ME分析过程中，识别ME是至关重要的一步，由于短时间和低强度的问题，从长间隔视频中检测出ME是非常重要的。为了解决这一问题，本文提出了一种基于顶点和边界感知网络(ABPN)的宏微表情识别框架，该框架主要由视频编码模块(VEM)、概率评估模块(PEM)和表情建议生成模块(EPGM)三部分组成。首先，我们采用主方向平均光流(MDMO)算法，通过计算光流差提取VEM中的面部运动特征，减轻头部运动和面部其他区域对ME识别的影响。然后，我们利用一维卷积层提取时间特征，并引入PEM来推断每帧属于顶点帧或边界帧的辅助概率。利用这些帧级辅助概率，EPGM进一步将不同类别的帧组合在一起，生成精确定位的表达建议。此外，我们在MEGC2022点对任务上进行了全面的实验，并在rm CAS(ME)2和SAMM-LV数据集上比较了最先进的基线，证明了我们提出的方法取得了显著的改进。实现的代码也可以在https://github.com/wenhaocold/USTC_ME_Spotting上公开获得。

{"title":"ABPN: Apex and Boundary Perception Network for Micro- and Macro-Expression Spotting","authors":"Wenhao Leng, Sirui Zhao, Yiming Zhang, Shiifeng Liu, Xinglong Mao, Hongya Wang, Tong Xu, Enhong Chen","doi":"10.1145/3503161.3551599","DOIUrl":"https://doi.org/10.1145/3503161.3551599","url":null,"abstract":"Recently, Micro expression~(ME) has achieved remarkable progress in a wide range of applications, since it's an involuntary facial expression that reflects personal psychological state truly. In the procedure of ME analysis, spotting ME is an essential step, and is non trivial to be detected from a long interval video because of the short duration and low intensity issues. To alleviate this problem, in this paper, we propose a novel Micro- and Macro-Expression~(MaE) Spotting framework based on Apex and Boundary Perception Network~(ABPN), which mainly consists of three parts, i.e., video encoding module ~(VEM), probability evaluation module~(PEM), and expression proposal generation module~(EPGM). Firstly, we adopt Main Directional Mean Optical Flow (MDMO) algorithm and calculate optical flow differences to extract facial motion features in VEM, which can alleviate the impact of head movement and other areas of the face on ME spotting. Then, we extract temporal features with one-dimension convolutional layers and introduce PEM to infer the auxiliary probability that each frame belongs to an apex or boundary frame. With these frame-level auxiliary probabilities, the EPGM further combines the frames from different categories to generate expression proposals for the accurate localization. Besides, we conduct comprehensive experiments on MEGC2022 spotting task, and demonstrate that our proposed method achieves significant improvement with the comparison of state-of-the-art baselines on rm CAS(ME)2 and SAMM-LV datasets. The implemented code is also publicly available at https://github.com/wenhaocold/USTC_ME_Spotting.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131261256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Improving Fusion of Region Features and Grid Features via Two-Step Interaction for Image-Text Retrieval 基于两步交互改进区域特征与网格特征融合的图像-文本检索

Proceedings of the 30th ACM International Conference on Multimedia

Pub Date : 2022-10-10 DOI: 10.1145/3503161.3548223

Dongqing Wu, Huihui Li, Cang Gu, Lei Guo, Hang Liu

In recent years, region features extracted from object detection networks have been widely used in the image-text retrieval task. However, they lack rich background and contextual information, which makes it difficult to match words describing global concepts in sentences. Meanwhile, the region features also lose the details of objects in the image. Fortunately, these disadvantages of region features are the advantages of grid features. In this paper, we propose a novel framework, which fuses the region features and grid features through a two-step interaction strategy, thus extracting a more comprehensive image representation for image-text retrieval. Concretely, in the first step, a joint graph with spatial information constraints is constructed, where all region features and grid features are represented as graph nodes. By modeling the relationships using the joint graph, the information can be passed edge-wise. In the second step, we propose a Cross-attention Gated Fusion module, which further explores the complex interactions between region features and grid features, and then adaptively fuses different types of features. With these two steps, our model can fully realize the complementary advantages of region features and grid features. In addition, we propose a Multi-Attention Pooling module to better aggregate the fused region features and grid features. Extensive experiments on two public datasets, including Flickr30K and MS-COCO, demonstrate that our model achieves the state-of-the-art and pushes the performance of image-text retrieval to a new height.

近年来，从目标检测网络中提取的区域特征被广泛应用于图像文本检索任务中。然而，它们缺乏丰富的背景和上下文信息，这使得难以匹配描述句子中全局概念的单词。同时，区域特征也丢失了图像中物体的细节。幸运的是，区域特征的这些缺点正是网格特征的优点。在本文中，我们提出了一种新的框架，通过两步交互策略融合区域特征和网格特征，从而提取更全面的图像表示用于图像文本检索。具体而言，第一步构建具有空间信息约束的联合图，将所有区域特征和网格特征表示为图节点;通过使用联合图对关系进行建模，可以沿边传递信息。第二步，提出交叉关注门控融合模块，进一步探索区域特征与网格特征之间复杂的相互作用，自适应融合不同类型的特征。通过这两个步骤，我们的模型可以充分实现区域特征和网格特征的互补优势。此外，我们提出了一个多注意力池模块，以更好地聚合融合的区域特征和网格特征。在两个公共数据集(包括Flickr30K和MS-COCO)上的大量实验表明，我们的模型达到了最先进的水平，并将图像文本检索的性能推向了一个新的高度。

{"title":"Improving Fusion of Region Features and Grid Features via Two-Step Interaction for Image-Text Retrieval","authors":"Dongqing Wu, Huihui Li, Cang Gu, Lei Guo, Hang Liu","doi":"10.1145/3503161.3548223","DOIUrl":"https://doi.org/10.1145/3503161.3548223","url":null,"abstract":"In recent years, region features extracted from object detection networks have been widely used in the image-text retrieval task. However, they lack rich background and contextual information, which makes it difficult to match words describing global concepts in sentences. Meanwhile, the region features also lose the details of objects in the image. Fortunately, these disadvantages of region features are the advantages of grid features. In this paper, we propose a novel framework, which fuses the region features and grid features through a two-step interaction strategy, thus extracting a more comprehensive image representation for image-text retrieval. Concretely, in the first step, a joint graph with spatial information constraints is constructed, where all region features and grid features are represented as graph nodes. By modeling the relationships using the joint graph, the information can be passed edge-wise. In the second step, we propose a Cross-attention Gated Fusion module, which further explores the complex interactions between region features and grid features, and then adaptively fuses different types of features. With these two steps, our model can fully realize the complementary advantages of region features and grid features. In addition, we propose a Multi-Attention Pooling module to better aggregate the fused region features and grid features. Extensive experiments on two public datasets, including Flickr30K and MS-COCO, demonstrate that our model achieves the state-of-the-art and pushes the performance of image-text retrieval to a new height.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133489487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Interact with Open Scenes: A Life-long Evolution Framework for Interactive Segmentation Models 与开放场景交互:交互式分割模型的终身进化框架

Proceedings of the 30th ACM International Conference on Multimedia

Pub Date : 2022-10-10 DOI: 10.1145/3503161.3548131

Ruitong Gan, Junsong Fan, Yuxi Wang, Zhaoxiang Zhang

Existing interactive segmentation methods mainly focus on optimizing user interacting strategies, as well as making better use of clicks provided by users. However, the intention of the interactive segmentation model is to obtain high-quality masks with limited user interactions, which are supposed to be applied to unlabeled new images. But most existing methods overlooked the generalization ability of their models when witnessing new target scenes. To overcome this problem, we propose a life-long evolution framework for interactive models in this paper, which provides a possible solution for dealing with dynamic target scenes with one single model. Given several target scenes and an initial model trained with labels on the limited closed dataset, our framework arranges sequentially evolution steps on each target set. Specifically, we propose an interactive-prototype module to generate and refine pseudo masks, and apply a feature alignment module in order to adapt the model to a new target scene and keep the performance on previous images at the same time. All evolution steps above do not require ground truth labels as supervision. We conduct thorough experiments on PASCAL VOC, Cityscapes, and COCO datasets, demonstrating the effectiveness of our framework in solving new target datasets and maintaining performance on previous scenes at the same time.

现有的交互细分方法主要侧重于优化用户交互策略，更好地利用用户提供的点击量。然而，交互式分割模型的目的是在有限的用户交互下获得高质量的掩码，这些掩码应该应用于未标记的新图像。但现有的方法大多忽略了模型在观察新目标场景时的泛化能力。为了克服这一问题，本文提出了交互式模型的终身进化框架，为单一模型处理动态目标场景提供了可能的解决方案。给定几个目标场景和一个在有限封闭数据集上用标签训练的初始模型，我们的框架在每个目标集上安排顺序的进化步骤。具体来说，我们提出了一个交互原型模块来生成和改进伪掩模，并应用了一个特征对齐模块，以使模型适应新的目标场景，同时保持之前图像的性能。以上所有进化步骤都不需要地面真相标签作为监督。我们在PASCAL VOC、cityscape和COCO数据集上进行了彻底的实验，证明了我们的框架在解决新的目标数据集的同时保持了在以前场景上的性能的有效性。

{"title":"Interact with Open Scenes: A Life-long Evolution Framework for Interactive Segmentation Models","authors":"Ruitong Gan, Junsong Fan, Yuxi Wang, Zhaoxiang Zhang","doi":"10.1145/3503161.3548131","DOIUrl":"https://doi.org/10.1145/3503161.3548131","url":null,"abstract":"Existing interactive segmentation methods mainly focus on optimizing user interacting strategies, as well as making better use of clicks provided by users. However, the intention of the interactive segmentation model is to obtain high-quality masks with limited user interactions, which are supposed to be applied to unlabeled new images. But most existing methods overlooked the generalization ability of their models when witnessing new target scenes. To overcome this problem, we propose a life-long evolution framework for interactive models in this paper, which provides a possible solution for dealing with dynamic target scenes with one single model. Given several target scenes and an initial model trained with labels on the limited closed dataset, our framework arranges sequentially evolution steps on each target set. Specifically, we propose an interactive-prototype module to generate and refine pseudo masks, and apply a feature alignment module in order to adapt the model to a new target scene and keep the performance on previous images at the same time. All evolution steps above do not require ground truth labels as supervision. We conduct thorough experiments on PASCAL VOC, Cityscapes, and COCO datasets, demonstrating the effectiveness of our framework in solving new target datasets and maintaining performance on previous scenes at the same time.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132766544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Self-supervised Exclusive Learning for 3D Segmentation with Cross-Modal Unsupervised Domain Adaptation 基于跨模态无监督域自适应的三维分割自监督学习

Proceedings of the 30th ACM International Conference on Multimedia

Pub Date : 2022-10-10 DOI: 10.1145/3503161.3547987

Yachao Zhang, Miaoyu Li, Yuan Xie, Cuihua Li, Cong Wang, Zhizhong Zhang, Yanyun Qu

2D-3D unsupervised domain adaptation (UDA) tackles the lack of annotations in a new domain by capitalizing the relationship between 2D and 3D data. Existing methods achieve considerable improvements by performing cross-modality alignment in a modality-agnostic way, failing to exploit modality-specific characteristic for modeling complementarity. In this paper, we present self-supervised exclusive learning for cross-modal semantic segmentation under the UDA scenario, which avoids the prohibitive annotation. Specifically, two self-supervised tasks are designed, named "plane-to-spatial'' and "discrete-to-textured''. The former helps the 2D network branch improve the perception of spatial metrics, and the latter supplements structured texture information for the 3D network branch. In this way, modality-specific exclusive information can be effectively learned, and the complementarity of multi-modality is strengthened, resulting in a robust network to different domains. With the help of the self-supervised tasks supervision, we introduce a mixed domain to enhance the perception of the target domain by mixing the patches of the source and target domain samples. Besides, we propose a domain-category adversarial learning with category-wise discriminators by constructing the category prototypes for learning domain-invariant features. We evaluate our method on various multi-modality domain adaptation settings, where our results significantly outperform both uni-modality and multi-modality state-of-the-art competitors.

2D-3D无监督域适应(UDA)通过利用2D和3D数据之间的关系来解决新域中缺乏注释的问题。现有方法通过以模态不可知的方式执行跨模态对齐实现了相当大的改进，但未能利用模态特定的特性来进行建模互补。在本文中，我们提出了在UDA场景下的跨模态语义分割的自监督排他学习，避免了禁止标注。具体来说，设计了两个自监督任务，分别是“平面到空间”和“离散到纹理”。前者帮助二维网络分支提高对空间度量的感知，后者为三维网络分支补充结构化纹理信息。这样可以有效地学习特定于模态的独占信息，增强多模态的互补性，形成对不同领域的鲁棒性网络。借助自监督任务监督，引入混合域，通过混合源域和目标域样本的patch增强目标域的感知能力。此外，我们还通过构造范畴原型来学习领域不变特征，提出了一种具有范畴智能判别器的领域-范畴对抗学习方法。我们在各种多模态领域适应设置中评估了我们的方法，其中我们的结果明显优于单模态和多模态最先进的竞争对手。

{"title":"Self-supervised Exclusive Learning for 3D Segmentation with Cross-Modal Unsupervised Domain Adaptation","authors":"Yachao Zhang, Miaoyu Li, Yuan Xie, Cuihua Li, Cong Wang, Zhizhong Zhang, Yanyun Qu","doi":"10.1145/3503161.3547987","DOIUrl":"https://doi.org/10.1145/3503161.3547987","url":null,"abstract":"2D-3D unsupervised domain adaptation (UDA) tackles the lack of annotations in a new domain by capitalizing the relationship between 2D and 3D data. Existing methods achieve considerable improvements by performing cross-modality alignment in a modality-agnostic way, failing to exploit modality-specific characteristic for modeling complementarity. In this paper, we present self-supervised exclusive learning for cross-modal semantic segmentation under the UDA scenario, which avoids the prohibitive annotation. Specifically, two self-supervised tasks are designed, named \"plane-to-spatial'' and \"discrete-to-textured''. The former helps the 2D network branch improve the perception of spatial metrics, and the latter supplements structured texture information for the 3D network branch. In this way, modality-specific exclusive information can be effectively learned, and the complementarity of multi-modality is strengthened, resulting in a robust network to different domains. With the help of the self-supervised tasks supervision, we introduce a mixed domain to enhance the perception of the target domain by mixing the patches of the source and target domain samples. Besides, we propose a domain-category adversarial learning with category-wise discriminators by constructing the category prototypes for learning domain-invariant features. We evaluate our method on various multi-modality domain adaptation settings, where our results significantly outperform both uni-modality and multi-modality state-of-the-art competitors.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131277743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

ReFu: Refine and Fuse the Unobserved View for Detail-Preserving Single-Image 3D Human Reconstruction ReFu:细化和融合未观察到的视图，以保留细节的单图像3D人体重建

Proceedings of the 30th ACM International Conference on Multimedia

Pub Date : 2022-10-10 DOI: 10.1145/3503161.3547971

Gyumin Shim, M. Lee, J. Choo

Single-image 3D human reconstruction aims to reconstruct the 3D textured surface of the human body given a single image. While implicit function-based methods recently achieved reasonable reconstruction performance, they still bear limitations showing degraded quality in both surface geometry and texture from an unobserved view. In response, to generate a realistic textured surface, we propose ReFu, a coarse-to-fine approach that refines the projected backside view image and fuses the refined image to predict the final human body. To suppress the diffused occupancy that causes noise in projection images and reconstructed meshes, we propose to train occupancy probability by simultaneously utilizing 2D and 3D supervisions with occupancy-based volume rendering. We also introduce a refinement architecture that generates detail-preserving backside-view images with front-to-back warping. Extensive experiments demonstrate that our method achieves state-of-the-art performance in 3D human reconstruction from a single image, showing enhanced geometry and texture quality from an unobserved view.

单图像三维人体重建的目的是在给定单幅图像的情况下重建人体的三维纹理表面。虽然基于隐式函数的方法最近取得了合理的重建性能，但从未观察到的角度来看，它们仍然存在表面几何和纹理质量下降的局限性。因此，为了生成逼真的纹理表面，我们提出了ReFu，这是一种从粗到精的方法，它对投影的后视图图像进行细化，并融合细化后的图像来预测最终的人体。为了抑制在投影图像和重建网格中引起噪声的扩散占用，我们提出通过同时利用基于占用的体绘制的2D和3D监督来训练占用概率。我们还介绍了一种细化架构，该架构生成具有前后扭曲的保留细节的背面视图图像。大量实验表明，我们的方法在单幅图像中实现了最先进的3D人体重建性能，从未观察到的视图中显示增强的几何和纹理质量。

引用次数: 5

Adaptively Learning Low-high Frequency Information Integration for Pan-sharpening 泛锐化中自适应学习的低高频信息集成

Proceedings of the 30th ACM International Conference on Multimedia

Pub Date : 2022-10-10 DOI: 10.1145/3503161.3547924

Man Zhou, Jie Huang, Chongyi Li, Huan Yu, Keyu Yan, Naishan Zheng, Fengmei Zhao

Pan-sharpening aims to generate high-spatial resolution multi-spectral (MS) image by fusing high-spatial resolution panchromatic (PAN) image and its corresponding low-spatial resolution MS image. Despite the remarkable progress, most existing pan-sharpening methods only work in the spatial domain and rarely explore the potential solutions in the frequency domain. In this paper, we propose a novel pan-sharpening framework by adaptively learning low-high frequency information integration in the spatial and frequency dual domains. It consists of three key designs: mask prediction sub-network, low-frequency learning sub-network and high-frequency learning sub-network. Specifically, the first is responsible for measuring the modality-aware frequency information difference of PAN and MS images and further predicting the low-high frequency boundary in the form of a two-dimensional mask. In view of the mask, the second adaptively picks out the corresponding low-frequency components of different modalities and then restores the expected low-frequency one by spatial and frequency dual domains information integration while the third combines the above refined low-frequency and the original high-frequency for the latent high-frequency reconstruction. In this way, the low-high frequency information is adaptively learned, thus leading to the pleasing results. Extensive experiments validate the effectiveness of the proposed network and demonstrate the favorable performance against other state-of-the-art methods. The source code will be released at https://github.com/manman1995/pansharpening.

泛锐化是将高空间分辨率全色图像与其对应的低空间分辨率MS图像融合，生成高空间分辨率的多光谱图像。尽管取得了显著的进展，但大多数现有的泛锐化方法只在空间域工作，很少探索频域的潜在解决方案。本文提出了一种新的泛锐化框架，通过自适应学习在空间和频率对偶域中的低高频信息集成。它包括三个关键设计:掩码预测子网络、低频学习子网络和高频学习子网络。具体来说，第一种方法是测量PAN和MS图像的模态感知频率信息差异，并以二维掩模的形式预测低高频边界。针对掩模，第二种方法自适应地提取出不同模态对应的低频分量，然后通过空间和频率双域信息集成恢复到期望的低频分量，第三种方法将上述细化的低频与原始高频结合起来进行潜在高频重建。这样，低高频信息被自适应学习，从而得到令人满意的结果。大量的实验验证了所提出的网络的有效性，并展示了与其他最先进的方法相比的良好性能。源代码将在https://github.com/manman1995/pansharpening上发布。

{"title":"Adaptively Learning Low-high Frequency Information Integration for Pan-sharpening","authors":"Man Zhou, Jie Huang, Chongyi Li, Huan Yu, Keyu Yan, Naishan Zheng, Fengmei Zhao","doi":"10.1145/3503161.3547924","DOIUrl":"https://doi.org/10.1145/3503161.3547924","url":null,"abstract":"Pan-sharpening aims to generate high-spatial resolution multi-spectral (MS) image by fusing high-spatial resolution panchromatic (PAN) image and its corresponding low-spatial resolution MS image. Despite the remarkable progress, most existing pan-sharpening methods only work in the spatial domain and rarely explore the potential solutions in the frequency domain. In this paper, we propose a novel pan-sharpening framework by adaptively learning low-high frequency information integration in the spatial and frequency dual domains. It consists of three key designs: mask prediction sub-network, low-frequency learning sub-network and high-frequency learning sub-network. Specifically, the first is responsible for measuring the modality-aware frequency information difference of PAN and MS images and further predicting the low-high frequency boundary in the form of a two-dimensional mask. In view of the mask, the second adaptively picks out the corresponding low-frequency components of different modalities and then restores the expected low-frequency one by spatial and frequency dual domains information integration while the third combines the above refined low-frequency and the original high-frequency for the latent high-frequency reconstruction. In this way, the low-high frequency information is adaptively learned, thus leading to the pleasing results. Extensive experiments validate the effectiveness of the proposed network and demonstrate the favorable performance against other state-of-the-art methods. The source code will be released at https://github.com/manman1995/pansharpening.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"164 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115407652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Global Meets Local: Effective Multi-Label Image Classification via Category-Aware Weak Supervision 全局满足局部:基于类别感知弱监督的有效多标签图像分类

Proceedings of the 30th ACM International Conference on Multimedia

Pub Date : 2022-10-10 DOI: 10.1145/3503161.3547834

Jiawei Zhan, J. Liu, Wei Tang, Guannan Jiang, Xi Wang, Bin-Bin Gao, Tianliang Zhang, Wenlong Wu, Wei Zhang, Chengjie Wang, Yuan Xie

Multi-label image classification, which can be categorized into label-dependency and region-based methods, is a challenging problem due to the complex underlying object layouts. Although region-based methods are less likely to encounter issues with model generalizability than label-dependency methods, they often generate hundreds of meaningless or noisy proposals with non-discriminative information, and the contextual dependency among the localized regions is often ignored or over-simplified. This paper builds a unified framework to perform effective noisy-proposal suppression and to interact between global and local features for robust feature learning. Specifically, we propose category-aware weak supervision to concentrate on non-existent categories so as to provide deterministic information for local feature learning, restricting the local branch to focus on more high-quality regions of interest. Moreover, we develop a cross-granularity attention module to explore the complementary information between global and local features, which can build the high-order feature correlation containing not only global-to-local, but also local-to-local relations. Both advantages guarantee a boost in the performance of the whole network. Extensive experiments on two large-scale datasets (MS-COCO and VOC 2007) demonstrate that our framework achieves superior performance over state-of-the-art methods.

多标签图像分类由于其底层对象布局复杂，是一个具有挑战性的问题，可分为基于标签的方法和基于区域的方法。尽管与标签依赖方法相比，基于区域的方法不太可能遇到模型泛化问题，但它们通常会生成数百个具有非歧视性信息的无意义或嘈杂的建议，并且往往忽略或过度简化局部区域之间的上下文依赖关系。本文建立了一个统一的框架来有效地抑制噪声建议，并在全局和局部特征之间进行交互，以实现鲁棒特征学习。具体来说，我们提出了类别感知弱监督，集中于不存在的类别，从而为局部特征学习提供确定性信息，限制局部分支集中于更多高质量的感兴趣区域。此外，我们开发了一个跨粒度关注模块来探索全局和局部特征之间的互补信息，该模块可以构建既包含全局到局部关系，又包含局部到局部关系的高阶特征相关性。这两种优势都保证了整个网络性能的提升。在两个大型数据集(MS-COCO和VOC 2007)上进行的大量实验表明，我们的框架比最先进的方法具有更优越的性能。

{"title":"Global Meets Local: Effective Multi-Label Image Classification via Category-Aware Weak Supervision","authors":"Jiawei Zhan, J. Liu, Wei Tang, Guannan Jiang, Xi Wang, Bin-Bin Gao, Tianliang Zhang, Wenlong Wu, Wei Zhang, Chengjie Wang, Yuan Xie","doi":"10.1145/3503161.3547834","DOIUrl":"https://doi.org/10.1145/3503161.3547834","url":null,"abstract":"Multi-label image classification, which can be categorized into label-dependency and region-based methods, is a challenging problem due to the complex underlying object layouts. Although region-based methods are less likely to encounter issues with model generalizability than label-dependency methods, they often generate hundreds of meaningless or noisy proposals with non-discriminative information, and the contextual dependency among the localized regions is often ignored or over-simplified. This paper builds a unified framework to perform effective noisy-proposal suppression and to interact between global and local features for robust feature learning. Specifically, we propose category-aware weak supervision to concentrate on non-existent categories so as to provide deterministic information for local feature learning, restricting the local branch to focus on more high-quality regions of interest. Moreover, we develop a cross-granularity attention module to explore the complementary information between global and local features, which can build the high-order feature correlation containing not only global-to-local, but also local-to-local relations. Both advantages guarantee a boost in the performance of the whole network. Extensive experiments on two large-scale datasets (MS-COCO and VOC 2007) demonstrate that our framework achieves superior performance over state-of-the-art methods.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124286627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Progressive Unsupervised Learning of Local Descriptors 局部描述符的渐进无监督学习

Proceedings of the 30th ACM International Conference on Multimedia

Pub Date : 2022-10-10 DOI: 10.1145/3503161.3547792

Wu‐ru Wang, Lei Zhang, Hua Huang

Training tuple construction is a crucial step in unsupervised local descriptor learning. Existing approaches perform this step relying on heuristics, which suffer from inaccurate supervision signals and struggle to achieve the desired performance. To address the problem, this work presents DescPro, an unsupervised approach that progressively explores both accurate and informative training tuples for model optimization without using heuristics. Specifically, DescPro consists of a Robust Cluster Assignment (RCA) method to infer pairwise relationships by clustering reliable samples with the increasingly powerful CNN model, and a Similarity-weighted Positive Sampling (SPS) strategy to select informative positive pairs for training tuple construction. Extensive experimental results show that, with the collaboration of the above two modules, DescPro can outperform state-of-the-art unsupervised local descriptors and even rival competitive supervised ones on standard benchmarks.

训练元组构造是无监督局部描述符学习的关键步骤。现有的方法依靠启发式来执行这一步骤，这种方法受到不准确的监督信号的影响，难以达到预期的性能。为了解决这个问题，这项工作提出了DescPro，这是一种无监督的方法，可以在不使用启发式的情况下逐步探索模型优化的准确和信息训练元组。具体来说，DescPro由鲁棒聚类分配(RCA)方法和相似加权正采样(SPS)策略组成，前者通过使用日益强大的CNN模型聚类可靠样本来推断成对关系，后者选择信息丰富的正对来构建训练元组。大量的实验结果表明，在上述两个模块的协作下，DescPro可以在标准基准上优于最先进的无监督局部描述符，甚至可以与竞争的有监督局部描述符相媲美。

引用次数: 0

Rethinking Open-World Object Detection in Autonomous Driving Scenarios 对自动驾驶场景中开放世界目标检测的再思考

Proceedings of the 30th ACM International Conference on Multimedia

Pub Date : 2022-10-10 DOI: 10.1145/3503161.3548165

Zeyu Ma, Yang Yang, Guoqing Wang, Xing Xu, Heng Tao Shen, Mingxing Zhang

Existing object detection models have been demonstrated to successfully discriminate and localize the predefined object categories under the seen or similar situations. However, the open-world object detection as required by autonomous driving perception systems refers to recognizing unseen objects under various scenarios. On the one hand, the knowledge gap between seen and unseen object categories poses extreme challenges for models trained with supervision only from the seen object categories. On the other hand, the domain differences across different scenarios also cause an additional urge to take the domain gap into consideration by aligning the sample or label distribution. Aimed at resolving these two challenges simultaneously, we firstly design a pre-training model to formulate the mappings between visual images and semantic embeddings from the extra annotations as guidance to link the seen and unseen object categories through a self-supervised manner. Within this formulation, the domain adaptation is then utilized for extracting the domain-agnostic feature representations and alleviating the misdetection of unseen objects caused by the domain appearance changes. As a result, the more realistic and practical open-world object detection problem is visited and resolved by our novel formulation, which could detect the unseen categories from unseen domains without any bounding box annotations while there is no obvious performance drop in detecting the seen categories. We are the first to formulate a unified model for open-world task and establish a new state-of-the-art performance for this challenge.

现有的目标检测模型已经被证明可以在已知或相似的情况下成功地区分和定位预定义的目标类别。而自动驾驶感知系统所要求的开放世界物体检测，是指对各种场景下看不见的物体进行识别。一方面，可见和不可见对象类别之间的知识差距对仅从可见对象类别进行监督训练的模型提出了极大的挑战。另一方面，不同场景之间的领域差异也会导致通过对齐样本或标签分布来考虑领域差距的额外冲动。为了同时解决这两个挑战，我们首先设计了一个预训练模型，从额外的注释中形成视觉图像和语义嵌入之间的映射，作为指导，通过自监督的方式将可见和未见的对象类别联系起来。在此公式中，利用领域自适应提取与领域无关的特征表示，减轻因领域外观变化而导致的未见物体的误检测。该方法能够在不需要任何边界框标注的情况下，从不可见的域中检测出不可见的类别，并且在检测可见类别时没有明显的性能下降，从而解决了更加现实和实用的开放世界目标检测问题。我们是第一个为开放世界任务制定统一模型并为这一挑战建立新的最先进性能的公司。

{"title":"Rethinking Open-World Object Detection in Autonomous Driving Scenarios","authors":"Zeyu Ma, Yang Yang, Guoqing Wang, Xing Xu, Heng Tao Shen, Mingxing Zhang","doi":"10.1145/3503161.3548165","DOIUrl":"https://doi.org/10.1145/3503161.3548165","url":null,"abstract":"Existing object detection models have been demonstrated to successfully discriminate and localize the predefined object categories under the seen or similar situations. However, the open-world object detection as required by autonomous driving perception systems refers to recognizing unseen objects under various scenarios. On the one hand, the knowledge gap between seen and unseen object categories poses extreme challenges for models trained with supervision only from the seen object categories. On the other hand, the domain differences across different scenarios also cause an additional urge to take the domain gap into consideration by aligning the sample or label distribution. Aimed at resolving these two challenges simultaneously, we firstly design a pre-training model to formulate the mappings between visual images and semantic embeddings from the extra annotations as guidance to link the seen and unseen object categories through a self-supervised manner. Within this formulation, the domain adaptation is then utilized for extracting the domain-agnostic feature representations and alleviating the misdetection of unseen objects caused by the domain appearance changes. As a result, the more realistic and practical open-world object detection problem is visited and resolved by our novel formulation, which could detect the unseen categories from unseen domains without any bounding box annotations while there is no obvious performance drop in detecting the seen categories. We are the first to formulate a unified model for open-world task and establish a new state-of-the-art performance for this challenge.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114350839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10