Federated learning has received increasing attention for its ability to collaborative learning without leaking privacy. Promising advances have been achieved under the assumption that participants share the same model structure. However, when participants independently customize their models, models suffer communication barriers, which leads the model heterogeneity problem. Moreover, in real scenarios, the data held by participants is often limited, making the local models trained only on private data present poor performance. Consequently, this paper studies a new challenging problem, namely few-shot model agnostic federated learning, where the local participants design their independent models from their limited private datasets. Considering the scarcity of the private data, we propose to utilize the abundant public available datasets for bridging the gap between local private participants. However, its usage also brings in two problems: inconsistent labels and large domain gap between the public and private datasets. To address these issues, this paper presents a novel framework with two main parts: 1) model agnostic federated learning, it performs public-private communication by unifying the model prediction outputs on the shared public datasets; 2) latent embedding adaptation, it addresses the domain gap with an adversarial learning scheme to discriminate the public and private domains. Together with theoretical generalization bound analysis, comprehensive experiments under various settings have verified our advantage over existing methods. It provides a simple but effective baseline for future advancement. The code is available at https://github.com/WenkeHuang/FSMAFL.
{"title":"Few-Shot Model Agnostic Federated Learning","authors":"Wenke Huang, Mang Ye, Bo Du, Xiand Gao","doi":"10.1145/3503161.3548764","DOIUrl":"https://doi.org/10.1145/3503161.3548764","url":null,"abstract":"Federated learning has received increasing attention for its ability to collaborative learning without leaking privacy. Promising advances have been achieved under the assumption that participants share the same model structure. However, when participants independently customize their models, models suffer communication barriers, which leads the model heterogeneity problem. Moreover, in real scenarios, the data held by participants is often limited, making the local models trained only on private data present poor performance. Consequently, this paper studies a new challenging problem, namely few-shot model agnostic federated learning, where the local participants design their independent models from their limited private datasets. Considering the scarcity of the private data, we propose to utilize the abundant public available datasets for bridging the gap between local private participants. However, its usage also brings in two problems: inconsistent labels and large domain gap between the public and private datasets. To address these issues, this paper presents a novel framework with two main parts: 1) model agnostic federated learning, it performs public-private communication by unifying the model prediction outputs on the shared public datasets; 2) latent embedding adaptation, it addresses the domain gap with an adversarial learning scheme to discriminate the public and private domains. Together with theoretical generalization bound analysis, comprehensive experiments under various settings have verified our advantage over existing methods. It provides a simple but effective baseline for future advancement. The code is available at https://github.com/WenkeHuang/FSMAFL.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131811587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, Micro expression~(ME) has achieved remarkable progress in a wide range of applications, since it's an involuntary facial expression that reflects personal psychological state truly. In the procedure of ME analysis, spotting ME is an essential step, and is non trivial to be detected from a long interval video because of the short duration and low intensity issues. To alleviate this problem, in this paper, we propose a novel Micro- and Macro-Expression~(MaE) Spotting framework based on Apex and Boundary Perception Network~(ABPN), which mainly consists of three parts, i.e., video encoding module ~(VEM), probability evaluation module~(PEM), and expression proposal generation module~(EPGM). Firstly, we adopt Main Directional Mean Optical Flow (MDMO) algorithm and calculate optical flow differences to extract facial motion features in VEM, which can alleviate the impact of head movement and other areas of the face on ME spotting. Then, we extract temporal features with one-dimension convolutional layers and introduce PEM to infer the auxiliary probability that each frame belongs to an apex or boundary frame. With these frame-level auxiliary probabilities, the EPGM further combines the frames from different categories to generate expression proposals for the accurate localization. Besides, we conduct comprehensive experiments on MEGC2022 spotting task, and demonstrate that our proposed method achieves significant improvement with the comparison of state-of-the-art baselines on rm CAS(ME)2 and SAMM-LV datasets. The implemented code is also publicly available at https://github.com/wenhaocold/USTC_ME_Spotting.
{"title":"ABPN: Apex and Boundary Perception Network for Micro- and Macro-Expression Spotting","authors":"Wenhao Leng, Sirui Zhao, Yiming Zhang, Shiifeng Liu, Xinglong Mao, Hongya Wang, Tong Xu, Enhong Chen","doi":"10.1145/3503161.3551599","DOIUrl":"https://doi.org/10.1145/3503161.3551599","url":null,"abstract":"Recently, Micro expression~(ME) has achieved remarkable progress in a wide range of applications, since it's an involuntary facial expression that reflects personal psychological state truly. In the procedure of ME analysis, spotting ME is an essential step, and is non trivial to be detected from a long interval video because of the short duration and low intensity issues. To alleviate this problem, in this paper, we propose a novel Micro- and Macro-Expression~(MaE) Spotting framework based on Apex and Boundary Perception Network~(ABPN), which mainly consists of three parts, i.e., video encoding module ~(VEM), probability evaluation module~(PEM), and expression proposal generation module~(EPGM). Firstly, we adopt Main Directional Mean Optical Flow (MDMO) algorithm and calculate optical flow differences to extract facial motion features in VEM, which can alleviate the impact of head movement and other areas of the face on ME spotting. Then, we extract temporal features with one-dimension convolutional layers and introduce PEM to infer the auxiliary probability that each frame belongs to an apex or boundary frame. With these frame-level auxiliary probabilities, the EPGM further combines the frames from different categories to generate expression proposals for the accurate localization. Besides, we conduct comprehensive experiments on MEGC2022 spotting task, and demonstrate that our proposed method achieves significant improvement with the comparison of state-of-the-art baselines on rm CAS(ME)2 and SAMM-LV datasets. The implemented code is also publicly available at https://github.com/wenhaocold/USTC_ME_Spotting.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131261256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dongqing Wu, Huihui Li, Cang Gu, Lei Guo, Hang Liu
In recent years, region features extracted from object detection networks have been widely used in the image-text retrieval task. However, they lack rich background and contextual information, which makes it difficult to match words describing global concepts in sentences. Meanwhile, the region features also lose the details of objects in the image. Fortunately, these disadvantages of region features are the advantages of grid features. In this paper, we propose a novel framework, which fuses the region features and grid features through a two-step interaction strategy, thus extracting a more comprehensive image representation for image-text retrieval. Concretely, in the first step, a joint graph with spatial information constraints is constructed, where all region features and grid features are represented as graph nodes. By modeling the relationships using the joint graph, the information can be passed edge-wise. In the second step, we propose a Cross-attention Gated Fusion module, which further explores the complex interactions between region features and grid features, and then adaptively fuses different types of features. With these two steps, our model can fully realize the complementary advantages of region features and grid features. In addition, we propose a Multi-Attention Pooling module to better aggregate the fused region features and grid features. Extensive experiments on two public datasets, including Flickr30K and MS-COCO, demonstrate that our model achieves the state-of-the-art and pushes the performance of image-text retrieval to a new height.
{"title":"Improving Fusion of Region Features and Grid Features via Two-Step Interaction for Image-Text Retrieval","authors":"Dongqing Wu, Huihui Li, Cang Gu, Lei Guo, Hang Liu","doi":"10.1145/3503161.3548223","DOIUrl":"https://doi.org/10.1145/3503161.3548223","url":null,"abstract":"In recent years, region features extracted from object detection networks have been widely used in the image-text retrieval task. However, they lack rich background and contextual information, which makes it difficult to match words describing global concepts in sentences. Meanwhile, the region features also lose the details of objects in the image. Fortunately, these disadvantages of region features are the advantages of grid features. In this paper, we propose a novel framework, which fuses the region features and grid features through a two-step interaction strategy, thus extracting a more comprehensive image representation for image-text retrieval. Concretely, in the first step, a joint graph with spatial information constraints is constructed, where all region features and grid features are represented as graph nodes. By modeling the relationships using the joint graph, the information can be passed edge-wise. In the second step, we propose a Cross-attention Gated Fusion module, which further explores the complex interactions between region features and grid features, and then adaptively fuses different types of features. With these two steps, our model can fully realize the complementary advantages of region features and grid features. In addition, we propose a Multi-Attention Pooling module to better aggregate the fused region features and grid features. Extensive experiments on two public datasets, including Flickr30K and MS-COCO, demonstrate that our model achieves the state-of-the-art and pushes the performance of image-text retrieval to a new height.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133489487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Existing interactive segmentation methods mainly focus on optimizing user interacting strategies, as well as making better use of clicks provided by users. However, the intention of the interactive segmentation model is to obtain high-quality masks with limited user interactions, which are supposed to be applied to unlabeled new images. But most existing methods overlooked the generalization ability of their models when witnessing new target scenes. To overcome this problem, we propose a life-long evolution framework for interactive models in this paper, which provides a possible solution for dealing with dynamic target scenes with one single model. Given several target scenes and an initial model trained with labels on the limited closed dataset, our framework arranges sequentially evolution steps on each target set. Specifically, we propose an interactive-prototype module to generate and refine pseudo masks, and apply a feature alignment module in order to adapt the model to a new target scene and keep the performance on previous images at the same time. All evolution steps above do not require ground truth labels as supervision. We conduct thorough experiments on PASCAL VOC, Cityscapes, and COCO datasets, demonstrating the effectiveness of our framework in solving new target datasets and maintaining performance on previous scenes at the same time.
{"title":"Interact with Open Scenes: A Life-long Evolution Framework for Interactive Segmentation Models","authors":"Ruitong Gan, Junsong Fan, Yuxi Wang, Zhaoxiang Zhang","doi":"10.1145/3503161.3548131","DOIUrl":"https://doi.org/10.1145/3503161.3548131","url":null,"abstract":"Existing interactive segmentation methods mainly focus on optimizing user interacting strategies, as well as making better use of clicks provided by users. However, the intention of the interactive segmentation model is to obtain high-quality masks with limited user interactions, which are supposed to be applied to unlabeled new images. But most existing methods overlooked the generalization ability of their models when witnessing new target scenes. To overcome this problem, we propose a life-long evolution framework for interactive models in this paper, which provides a possible solution for dealing with dynamic target scenes with one single model. Given several target scenes and an initial model trained with labels on the limited closed dataset, our framework arranges sequentially evolution steps on each target set. Specifically, we propose an interactive-prototype module to generate and refine pseudo masks, and apply a feature alignment module in order to adapt the model to a new target scene and keep the performance on previous images at the same time. All evolution steps above do not require ground truth labels as supervision. We conduct thorough experiments on PASCAL VOC, Cityscapes, and COCO datasets, demonstrating the effectiveness of our framework in solving new target datasets and maintaining performance on previous scenes at the same time.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132766544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
2D-3D unsupervised domain adaptation (UDA) tackles the lack of annotations in a new domain by capitalizing the relationship between 2D and 3D data. Existing methods achieve considerable improvements by performing cross-modality alignment in a modality-agnostic way, failing to exploit modality-specific characteristic for modeling complementarity. In this paper, we present self-supervised exclusive learning for cross-modal semantic segmentation under the UDA scenario, which avoids the prohibitive annotation. Specifically, two self-supervised tasks are designed, named "plane-to-spatial'' and "discrete-to-textured''. The former helps the 2D network branch improve the perception of spatial metrics, and the latter supplements structured texture information for the 3D network branch. In this way, modality-specific exclusive information can be effectively learned, and the complementarity of multi-modality is strengthened, resulting in a robust network to different domains. With the help of the self-supervised tasks supervision, we introduce a mixed domain to enhance the perception of the target domain by mixing the patches of the source and target domain samples. Besides, we propose a domain-category adversarial learning with category-wise discriminators by constructing the category prototypes for learning domain-invariant features. We evaluate our method on various multi-modality domain adaptation settings, where our results significantly outperform both uni-modality and multi-modality state-of-the-art competitors.
{"title":"Self-supervised Exclusive Learning for 3D Segmentation with Cross-Modal Unsupervised Domain Adaptation","authors":"Yachao Zhang, Miaoyu Li, Yuan Xie, Cuihua Li, Cong Wang, Zhizhong Zhang, Yanyun Qu","doi":"10.1145/3503161.3547987","DOIUrl":"https://doi.org/10.1145/3503161.3547987","url":null,"abstract":"2D-3D unsupervised domain adaptation (UDA) tackles the lack of annotations in a new domain by capitalizing the relationship between 2D and 3D data. Existing methods achieve considerable improvements by performing cross-modality alignment in a modality-agnostic way, failing to exploit modality-specific characteristic for modeling complementarity. In this paper, we present self-supervised exclusive learning for cross-modal semantic segmentation under the UDA scenario, which avoids the prohibitive annotation. Specifically, two self-supervised tasks are designed, named \"plane-to-spatial'' and \"discrete-to-textured''. The former helps the 2D network branch improve the perception of spatial metrics, and the latter supplements structured texture information for the 3D network branch. In this way, modality-specific exclusive information can be effectively learned, and the complementarity of multi-modality is strengthened, resulting in a robust network to different domains. With the help of the self-supervised tasks supervision, we introduce a mixed domain to enhance the perception of the target domain by mixing the patches of the source and target domain samples. Besides, we propose a domain-category adversarial learning with category-wise discriminators by constructing the category prototypes for learning domain-invariant features. We evaluate our method on various multi-modality domain adaptation settings, where our results significantly outperform both uni-modality and multi-modality state-of-the-art competitors.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131277743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Single-image 3D human reconstruction aims to reconstruct the 3D textured surface of the human body given a single image. While implicit function-based methods recently achieved reasonable reconstruction performance, they still bear limitations showing degraded quality in both surface geometry and texture from an unobserved view. In response, to generate a realistic textured surface, we propose ReFu, a coarse-to-fine approach that refines the projected backside view image and fuses the refined image to predict the final human body. To suppress the diffused occupancy that causes noise in projection images and reconstructed meshes, we propose to train occupancy probability by simultaneously utilizing 2D and 3D supervisions with occupancy-based volume rendering. We also introduce a refinement architecture that generates detail-preserving backside-view images with front-to-back warping. Extensive experiments demonstrate that our method achieves state-of-the-art performance in 3D human reconstruction from a single image, showing enhanced geometry and texture quality from an unobserved view.
{"title":"ReFu: Refine and Fuse the Unobserved View for Detail-Preserving Single-Image 3D Human Reconstruction","authors":"Gyumin Shim, M. Lee, J. Choo","doi":"10.1145/3503161.3547971","DOIUrl":"https://doi.org/10.1145/3503161.3547971","url":null,"abstract":"Single-image 3D human reconstruction aims to reconstruct the 3D textured surface of the human body given a single image. While implicit function-based methods recently achieved reasonable reconstruction performance, they still bear limitations showing degraded quality in both surface geometry and texture from an unobserved view. In response, to generate a realistic textured surface, we propose ReFu, a coarse-to-fine approach that refines the projected backside view image and fuses the refined image to predict the final human body. To suppress the diffused occupancy that causes noise in projection images and reconstructed meshes, we propose to train occupancy probability by simultaneously utilizing 2D and 3D supervisions with occupancy-based volume rendering. We also introduce a refinement architecture that generates detail-preserving backside-view images with front-to-back warping. Extensive experiments demonstrate that our method achieves state-of-the-art performance in 3D human reconstruction from a single image, showing enhanced geometry and texture quality from an unobserved view.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133767509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Man Zhou, Jie Huang, Chongyi Li, Huan Yu, Keyu Yan, Naishan Zheng, Fengmei Zhao
Pan-sharpening aims to generate high-spatial resolution multi-spectral (MS) image by fusing high-spatial resolution panchromatic (PAN) image and its corresponding low-spatial resolution MS image. Despite the remarkable progress, most existing pan-sharpening methods only work in the spatial domain and rarely explore the potential solutions in the frequency domain. In this paper, we propose a novel pan-sharpening framework by adaptively learning low-high frequency information integration in the spatial and frequency dual domains. It consists of three key designs: mask prediction sub-network, low-frequency learning sub-network and high-frequency learning sub-network. Specifically, the first is responsible for measuring the modality-aware frequency information difference of PAN and MS images and further predicting the low-high frequency boundary in the form of a two-dimensional mask. In view of the mask, the second adaptively picks out the corresponding low-frequency components of different modalities and then restores the expected low-frequency one by spatial and frequency dual domains information integration while the third combines the above refined low-frequency and the original high-frequency for the latent high-frequency reconstruction. In this way, the low-high frequency information is adaptively learned, thus leading to the pleasing results. Extensive experiments validate the effectiveness of the proposed network and demonstrate the favorable performance against other state-of-the-art methods. The source code will be released at https://github.com/manman1995/pansharpening.
{"title":"Adaptively Learning Low-high Frequency Information Integration for Pan-sharpening","authors":"Man Zhou, Jie Huang, Chongyi Li, Huan Yu, Keyu Yan, Naishan Zheng, Fengmei Zhao","doi":"10.1145/3503161.3547924","DOIUrl":"https://doi.org/10.1145/3503161.3547924","url":null,"abstract":"Pan-sharpening aims to generate high-spatial resolution multi-spectral (MS) image by fusing high-spatial resolution panchromatic (PAN) image and its corresponding low-spatial resolution MS image. Despite the remarkable progress, most existing pan-sharpening methods only work in the spatial domain and rarely explore the potential solutions in the frequency domain. In this paper, we propose a novel pan-sharpening framework by adaptively learning low-high frequency information integration in the spatial and frequency dual domains. It consists of three key designs: mask prediction sub-network, low-frequency learning sub-network and high-frequency learning sub-network. Specifically, the first is responsible for measuring the modality-aware frequency information difference of PAN and MS images and further predicting the low-high frequency boundary in the form of a two-dimensional mask. In view of the mask, the second adaptively picks out the corresponding low-frequency components of different modalities and then restores the expected low-frequency one by spatial and frequency dual domains information integration while the third combines the above refined low-frequency and the original high-frequency for the latent high-frequency reconstruction. In this way, the low-high frequency information is adaptively learned, thus leading to the pleasing results. Extensive experiments validate the effectiveness of the proposed network and demonstrate the favorable performance against other state-of-the-art methods. The source code will be released at https://github.com/manman1995/pansharpening.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"164 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115407652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-label image classification, which can be categorized into label-dependency and region-based methods, is a challenging problem due to the complex underlying object layouts. Although region-based methods are less likely to encounter issues with model generalizability than label-dependency methods, they often generate hundreds of meaningless or noisy proposals with non-discriminative information, and the contextual dependency among the localized regions is often ignored or over-simplified. This paper builds a unified framework to perform effective noisy-proposal suppression and to interact between global and local features for robust feature learning. Specifically, we propose category-aware weak supervision to concentrate on non-existent categories so as to provide deterministic information for local feature learning, restricting the local branch to focus on more high-quality regions of interest. Moreover, we develop a cross-granularity attention module to explore the complementary information between global and local features, which can build the high-order feature correlation containing not only global-to-local, but also local-to-local relations. Both advantages guarantee a boost in the performance of the whole network. Extensive experiments on two large-scale datasets (MS-COCO and VOC 2007) demonstrate that our framework achieves superior performance over state-of-the-art methods.
{"title":"Global Meets Local: Effective Multi-Label Image Classification via Category-Aware Weak Supervision","authors":"Jiawei Zhan, J. Liu, Wei Tang, Guannan Jiang, Xi Wang, Bin-Bin Gao, Tianliang Zhang, Wenlong Wu, Wei Zhang, Chengjie Wang, Yuan Xie","doi":"10.1145/3503161.3547834","DOIUrl":"https://doi.org/10.1145/3503161.3547834","url":null,"abstract":"Multi-label image classification, which can be categorized into label-dependency and region-based methods, is a challenging problem due to the complex underlying object layouts. Although region-based methods are less likely to encounter issues with model generalizability than label-dependency methods, they often generate hundreds of meaningless or noisy proposals with non-discriminative information, and the contextual dependency among the localized regions is often ignored or over-simplified. This paper builds a unified framework to perform effective noisy-proposal suppression and to interact between global and local features for robust feature learning. Specifically, we propose category-aware weak supervision to concentrate on non-existent categories so as to provide deterministic information for local feature learning, restricting the local branch to focus on more high-quality regions of interest. Moreover, we develop a cross-granularity attention module to explore the complementary information between global and local features, which can build the high-order feature correlation containing not only global-to-local, but also local-to-local relations. Both advantages guarantee a boost in the performance of the whole network. Extensive experiments on two large-scale datasets (MS-COCO and VOC 2007) demonstrate that our framework achieves superior performance over state-of-the-art methods.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124286627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Training tuple construction is a crucial step in unsupervised local descriptor learning. Existing approaches perform this step relying on heuristics, which suffer from inaccurate supervision signals and struggle to achieve the desired performance. To address the problem, this work presents DescPro, an unsupervised approach that progressively explores both accurate and informative training tuples for model optimization without using heuristics. Specifically, DescPro consists of a Robust Cluster Assignment (RCA) method to infer pairwise relationships by clustering reliable samples with the increasingly powerful CNN model, and a Similarity-weighted Positive Sampling (SPS) strategy to select informative positive pairs for training tuple construction. Extensive experimental results show that, with the collaboration of the above two modules, DescPro can outperform state-of-the-art unsupervised local descriptors and even rival competitive supervised ones on standard benchmarks.
{"title":"Progressive Unsupervised Learning of Local Descriptors","authors":"Wu‐ru Wang, Lei Zhang, Hua Huang","doi":"10.1145/3503161.3547792","DOIUrl":"https://doi.org/10.1145/3503161.3547792","url":null,"abstract":"Training tuple construction is a crucial step in unsupervised local descriptor learning. Existing approaches perform this step relying on heuristics, which suffer from inaccurate supervision signals and struggle to achieve the desired performance. To address the problem, this work presents DescPro, an unsupervised approach that progressively explores both accurate and informative training tuples for model optimization without using heuristics. Specifically, DescPro consists of a Robust Cluster Assignment (RCA) method to infer pairwise relationships by clustering reliable samples with the increasingly powerful CNN model, and a Similarity-weighted Positive Sampling (SPS) strategy to select informative positive pairs for training tuple construction. Extensive experimental results show that, with the collaboration of the above two modules, DescPro can outperform state-of-the-art unsupervised local descriptors and even rival competitive supervised ones on standard benchmarks.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"340 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124309178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zeyu Ma, Yang Yang, Guoqing Wang, Xing Xu, Heng Tao Shen, Mingxing Zhang
Existing object detection models have been demonstrated to successfully discriminate and localize the predefined object categories under the seen or similar situations. However, the open-world object detection as required by autonomous driving perception systems refers to recognizing unseen objects under various scenarios. On the one hand, the knowledge gap between seen and unseen object categories poses extreme challenges for models trained with supervision only from the seen object categories. On the other hand, the domain differences across different scenarios also cause an additional urge to take the domain gap into consideration by aligning the sample or label distribution. Aimed at resolving these two challenges simultaneously, we firstly design a pre-training model to formulate the mappings between visual images and semantic embeddings from the extra annotations as guidance to link the seen and unseen object categories through a self-supervised manner. Within this formulation, the domain adaptation is then utilized for extracting the domain-agnostic feature representations and alleviating the misdetection of unseen objects caused by the domain appearance changes. As a result, the more realistic and practical open-world object detection problem is visited and resolved by our novel formulation, which could detect the unseen categories from unseen domains without any bounding box annotations while there is no obvious performance drop in detecting the seen categories. We are the first to formulate a unified model for open-world task and establish a new state-of-the-art performance for this challenge.
{"title":"Rethinking Open-World Object Detection in Autonomous Driving Scenarios","authors":"Zeyu Ma, Yang Yang, Guoqing Wang, Xing Xu, Heng Tao Shen, Mingxing Zhang","doi":"10.1145/3503161.3548165","DOIUrl":"https://doi.org/10.1145/3503161.3548165","url":null,"abstract":"Existing object detection models have been demonstrated to successfully discriminate and localize the predefined object categories under the seen or similar situations. However, the open-world object detection as required by autonomous driving perception systems refers to recognizing unseen objects under various scenarios. On the one hand, the knowledge gap between seen and unseen object categories poses extreme challenges for models trained with supervision only from the seen object categories. On the other hand, the domain differences across different scenarios also cause an additional urge to take the domain gap into consideration by aligning the sample or label distribution. Aimed at resolving these two challenges simultaneously, we firstly design a pre-training model to formulate the mappings between visual images and semantic embeddings from the extra annotations as guidance to link the seen and unseen object categories through a self-supervised manner. Within this formulation, the domain adaptation is then utilized for extracting the domain-agnostic feature representations and alleviating the misdetection of unseen objects caused by the domain appearance changes. As a result, the more realistic and practical open-world object detection problem is visited and resolved by our novel formulation, which could detect the unseen categories from unseen domains without any bounding box annotations while there is no obvious performance drop in detecting the seen categories. We are the first to formulate a unified model for open-world task and establish a new state-of-the-art performance for this challenge.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114350839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}