Existing non-rigid shape matching methods mainly involve two disadvantages. (a) Local details and global features of shapes can not be carefully explored. (b) A satisfactory trade-off between the matching accuracy and computational efficiency can be hardly achieved. To address these issues, we propose a local-global commutative preserving functional map (LGCP) for shape correspondence. The core of LGCP involves an intra-segment geometric submodel and a local-global commutative preserving submodel, which accomplishes the segment-to-segment matching and the point-to-point matching tasks, respectively. The first submodel consists of an ICP similarity term and two geometric similarity terms which guarantee the correct correspondence of segments of two shapes, while the second submodel guarantees the bijectivity of the correspondence on both the shape level and the segment level. Experimental results on both segment-to-segment matching and point-to-point matching show that, LGCP not only generate quite accurate matching results, but also exhibit a satisfactory portability and a high efficiency.
{"title":"A Local-Global Commutative Preserving Functional Map for Shape Correspondence","authors":"Qianxing Li, Shaofan Wang, Dehui Kong, Baocai Yin","doi":"10.1145/3469877.3490593","DOIUrl":"https://doi.org/10.1145/3469877.3490593","url":null,"abstract":"Existing non-rigid shape matching methods mainly involve two disadvantages. (a) Local details and global features of shapes can not be carefully explored. (b) A satisfactory trade-off between the matching accuracy and computational efficiency can be hardly achieved. To address these issues, we propose a local-global commutative preserving functional map (LGCP) for shape correspondence. The core of LGCP involves an intra-segment geometric submodel and a local-global commutative preserving submodel, which accomplishes the segment-to-segment matching and the point-to-point matching tasks, respectively. The first submodel consists of an ICP similarity term and two geometric similarity terms which guarantee the correct correspondence of segments of two shapes, while the second submodel guarantees the bijectivity of the correspondence on both the shape level and the segment level. Experimental results on both segment-to-segment matching and point-to-point matching show that, LGCP not only generate quite accurate matching results, but also exhibit a satisfactory portability and a high efficiency.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127294868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tensor robust principal component analysis (TRPCA) is an important algorithm for color image denoising by treating the whole image as a tensor and shrinking all singular values equally. In this paper, to improve the denoising performance of TRPCA, we propose a variant of TRPCA model. Specifically, we first introduce a nonconvex TRPCA (N-TRPCA) model which can shrink large singular values more and shrink small singular values less, so that the physical meanings of different singular values can be preserved. To take advantage of the structural redundancy of an image, we further group similar patches as a tensor according to nonlocal prior, and then apply the N-TRPCA model on this tensor. The denoised image can be obtained by aggregating all processed tensors. Experimental results demonstrate the superiority of the proposed denoising method beyond state-of-the-arts.
{"title":"Color Image Denoising via Tensor Robust PCA with Nonconvex and Nonlocal Regularization","authors":"Xiaoyu Geng, Q. Guo, Cai-ming Zhang","doi":"10.1145/3469877.3493592","DOIUrl":"https://doi.org/10.1145/3469877.3493592","url":null,"abstract":"Tensor robust principal component analysis (TRPCA) is an important algorithm for color image denoising by treating the whole image as a tensor and shrinking all singular values equally. In this paper, to improve the denoising performance of TRPCA, we propose a variant of TRPCA model. Specifically, we first introduce a nonconvex TRPCA (N-TRPCA) model which can shrink large singular values more and shrink small singular values less, so that the physical meanings of different singular values can be preserved. To take advantage of the structural redundancy of an image, we further group similar patches as a tensor according to nonlocal prior, and then apply the N-TRPCA model on this tensor. The denoised image can be obtained by aggregating all processed tensors. Experimental results demonstrate the superiority of the proposed denoising method beyond state-of-the-arts.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129796591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The video understanding capability of video recognition models has been significantly improved by the development of deep learning techniques and various video datasets available. However, video recognition models are still vulnerable to invisible perturbations, which limits the use of deep video recognition models in the real world. We present a new benchmark for the robustness of action recognition classifiers to general corruptions, and show that a supervised contrastive learning framework is effective in obtaining discriminative and stable video representations, and makes deep video recognition models robust to general input corruptions. Experiments on the action recognition task for corrupted videos show the high robustness of the proposed method on the UCF101 and HMDB51 datasets with various common corruptions.
{"title":"Making Video Recognition Models Robust to Common Corruptions With Supervised Contrastive Learning","authors":"Tomu Hirata, Yusuke Mukuta, Tatsuya Harada","doi":"10.1145/3469877.3497692","DOIUrl":"https://doi.org/10.1145/3469877.3497692","url":null,"abstract":"The video understanding capability of video recognition models has been significantly improved by the development of deep learning techniques and various video datasets available. However, video recognition models are still vulnerable to invisible perturbations, which limits the use of deep video recognition models in the real world. We present a new benchmark for the robustness of action recognition classifiers to general corruptions, and show that a supervised contrastive learning framework is effective in obtaining discriminative and stable video representations, and makes deep video recognition models robust to general input corruptions. Experiments on the action recognition task for corrupted videos show the high robustness of the proposed method on the UCF101 and HMDB51 datasets with various common corruptions.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121820972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multimodal social event detection has been attracting tremendous research attention in recent years, due to that it provides comprehensive and complementary understanding of social events and is important to public security and administration. Most existing works have been focusing on the fusion of multimodal information, especially for single image and text fusion. Such single image-text pair processing breaks the correlations between images of the same post and may affect the accuracy of event detection. In this work, we propose to focus attention across multiple images for multimodal event detection, which is also more reasonable for tweets with short text and multiple images. Towards this end, we elaborate a novel Multi-Image Focusing Network (MIFN) to connect text content with visual aspects in multiple images. Our MIFN consists of a feature extractor, a multi-focal network and an event classifier. The multi-focal network implements a focal attention across all the images, and fuses the most related regions with texts as multimodal representation. The event classifier finally predict the social event class based on the multimodal representations. To evaluate the effectiveness of our proposed approach, we conduct extensive experiments on a commonly-used disaster dataset. The experimental results demonstrate that, in both humanitarian event detection task and its variant of hurricane disaster, the proposed MIFN outperforms all the baselines. The ablation studies also exhibit the ability to filter the irrelevant regions across images which results in improving the accuracy of multimodal event detection.
{"title":"Focusing Attention across Multiple Images for Multimodal Event Detection","authors":"Yangyang Li, Jun Li, Hao Jin, Liang Peng","doi":"10.1145/3469877.3495642","DOIUrl":"https://doi.org/10.1145/3469877.3495642","url":null,"abstract":"Multimodal social event detection has been attracting tremendous research attention in recent years, due to that it provides comprehensive and complementary understanding of social events and is important to public security and administration. Most existing works have been focusing on the fusion of multimodal information, especially for single image and text fusion. Such single image-text pair processing breaks the correlations between images of the same post and may affect the accuracy of event detection. In this work, we propose to focus attention across multiple images for multimodal event detection, which is also more reasonable for tweets with short text and multiple images. Towards this end, we elaborate a novel Multi-Image Focusing Network (MIFN) to connect text content with visual aspects in multiple images. Our MIFN consists of a feature extractor, a multi-focal network and an event classifier. The multi-focal network implements a focal attention across all the images, and fuses the most related regions with texts as multimodal representation. The event classifier finally predict the social event class based on the multimodal representations. To evaluate the effectiveness of our proposed approach, we conduct extensive experiments on a commonly-used disaster dataset. The experimental results demonstrate that, in both humanitarian event detection task and its variant of hurricane disaster, the proposed MIFN outperforms all the baselines. The ablation studies also exhibit the ability to filter the irrelevant regions across images which results in improving the accuracy of multimodal event detection.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121994578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Image paragraph captioning, a task to generate the paragraph description for a given image, usually requires mining and organizing linguistic counterparts from abundant visual clues. Limited by sequential decoding perspective, previous methods have difficulty in organizing the visual clues holistically or capturing the structural nature of linguistic descriptions. In this paper, we propose a novel tree-structured visual paragraph decoder network, called Splitting to Tree Decoder (S2TD) to address this problem. The key idea is to model the paragraph decoding process as a top-down binary tree expansion. S2TD consists of three modules: a split module, a score module, and a word-level RNN. The split module iteratively splits ancestral visual representations into two parts through a gating mechanism. To determine the tree topology, the score module uses cosine similarity to evaluate the nodes splitting. A novel tree structure loss is proposed to enable end-to-end learning. After the tree expansion, the word-level RNN decodes leaf nodes into sentences forming a coherent paragraph. Extensive experiments are conducted on the Stanford benchmark dataset. The experimental results show promising performance of our proposed S2TD.
{"title":"S2TD: A Tree-Structured Decoder for Image Paragraph Captioning","authors":"Yihui Shi, Yun Liu, Fangxiang Feng, Ruifan Li, Zhanyu Ma, Xiaojie Wang","doi":"10.1145/3469877.3490585","DOIUrl":"https://doi.org/10.1145/3469877.3490585","url":null,"abstract":"Image paragraph captioning, a task to generate the paragraph description for a given image, usually requires mining and organizing linguistic counterparts from abundant visual clues. Limited by sequential decoding perspective, previous methods have difficulty in organizing the visual clues holistically or capturing the structural nature of linguistic descriptions. In this paper, we propose a novel tree-structured visual paragraph decoder network, called Splitting to Tree Decoder (S2TD) to address this problem. The key idea is to model the paragraph decoding process as a top-down binary tree expansion. S2TD consists of three modules: a split module, a score module, and a word-level RNN. The split module iteratively splits ancestral visual representations into two parts through a gating mechanism. To determine the tree topology, the score module uses cosine similarity to evaluate the nodes splitting. A novel tree structure loss is proposed to enable end-to-end learning. After the tree expansion, the word-level RNN decodes leaf nodes into sentences forming a coherent paragraph. Extensive experiments are conducted on the Stanford benchmark dataset. The experimental results show promising performance of our proposed S2TD.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128338125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jinxing Pan, Xiaoshan Yang, Yi Huang, Changsheng Xu
Activity recognition based on egocentric multimodal data collected by wearable devices has become increasingly popular recently. However, conventional activity recognition methods face the dilemma of the lack of large-scale labeled egocentric multimodal datasets due to the high cost of data collection. In this paper, we propose a new task of few-shot egocentric multimodal activity recognition, which has at least two significant challenges. On the one hand, it is difficult to extract effective features from the multimodal data sequences of video and sensor signals due to the scarcity of the samples. On the other hand, how to robustly recognize novel activity classes with very few labeled samples becomes another more critical challenge due to the complexity of the multimodal data. To resolve the challenges, we propose a two-stream graph network, which consists of a heterogeneous graph-based multimodal association module and a knowledge-aware activity classifier module. The former uses a heterogeneous graph network to comprehensively capture the dynamic and complementary information contained in the multimodal data stream. The latter learns robust activity classifiers through knowledge propagation among the classifier parameters of different classes. In addition, we adopt episodic training strategy to improve the generalization ability of the proposed few-shot activity recognition model. Experiments on two public datasets show that the proposed model achieves better performances than other baseline models.
{"title":"Few-shot Egocentric Multimodal Activity Recognition","authors":"Jinxing Pan, Xiaoshan Yang, Yi Huang, Changsheng Xu","doi":"10.1145/3469877.3490603","DOIUrl":"https://doi.org/10.1145/3469877.3490603","url":null,"abstract":"Activity recognition based on egocentric multimodal data collected by wearable devices has become increasingly popular recently. However, conventional activity recognition methods face the dilemma of the lack of large-scale labeled egocentric multimodal datasets due to the high cost of data collection. In this paper, we propose a new task of few-shot egocentric multimodal activity recognition, which has at least two significant challenges. On the one hand, it is difficult to extract effective features from the multimodal data sequences of video and sensor signals due to the scarcity of the samples. On the other hand, how to robustly recognize novel activity classes with very few labeled samples becomes another more critical challenge due to the complexity of the multimodal data. To resolve the challenges, we propose a two-stream graph network, which consists of a heterogeneous graph-based multimodal association module and a knowledge-aware activity classifier module. The former uses a heterogeneous graph network to comprehensively capture the dynamic and complementary information contained in the multimodal data stream. The latter learns robust activity classifiers through knowledge propagation among the classifier parameters of different classes. In addition, we adopt episodic training strategy to improve the generalization ability of the proposed few-shot activity recognition model. Experiments on two public datasets show that the proposed model achieves better performances than other baseline models.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"75 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134155535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Letian Wang, Xiushan Nie, Quan Zhou, Yang Shi, Xingbo Liu
Hashing can compress heterogeneous high-dimensional data into compact binary codes. For most existing hash methods, they first predetermine a fixed length for the hash code and then train the model based on this fixed length. However, when the task requirements change, these methods need to retrain the model for a new length of hash codes, which increases time cost. To address this issue, we propose a deep supervised hashing method, called deep multiple length hashing(DMLH), which can learn multiple length hash codes simultaneously based on a multi-task learning network. This proposed DMLH can well utilize the relationships with a hard parameter sharing-based multi-task network. Specifically, in DMLH, the multiple hash codes with different lengths are regarded as different views of the same sample. Furthermore, we introduce a type of mutual information loss to mine the association among hash codes of different lengths. Extensive experiments have indicated that DMLH outperforms most existing models, verifying its effectiveness.
{"title":"Deep Multiple Length Hashing via Multi-task Learning","authors":"Letian Wang, Xiushan Nie, Quan Zhou, Yang Shi, Xingbo Liu","doi":"10.1145/3469877.3493591","DOIUrl":"https://doi.org/10.1145/3469877.3493591","url":null,"abstract":"Hashing can compress heterogeneous high-dimensional data into compact binary codes. For most existing hash methods, they first predetermine a fixed length for the hash code and then train the model based on this fixed length. However, when the task requirements change, these methods need to retrain the model for a new length of hash codes, which increases time cost. To address this issue, we propose a deep supervised hashing method, called deep multiple length hashing(DMLH), which can learn multiple length hash codes simultaneously based on a multi-task learning network. This proposed DMLH can well utilize the relationships with a hard parameter sharing-based multi-task network. Specifically, in DMLH, the multiple hash codes with different lengths are regarded as different views of the same sample. Furthermore, we introduce a type of mutual information loss to mine the association among hash codes of different lengths. Extensive experiments have indicated that DMLH outperforms most existing models, verifying its effectiveness.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133121924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The goal of Zero-shot Learning (ZSL) is to recognize categories that are not seen during the training process. The traditional method is to learn an embedding space and map visual features and semantic features to this common space. However, this method inevitably encounters the bias problem, i.e., unseen instances are often incorrectly recognized as the seen classes. Some attempts are made by proposing another paradigm, which uses generative models to hallucinate the features of unseen samples. However, the generative models often suffer from instability issues, making it impractical for them to generate fine-grained features of unseen samples, thus resulting in very limited improvement. To resolve this, a Semantic Enhanced Cross-modal GAN (SECM GAN) is proposed by imposing the cross-modal association for improving the semantic and discriminative property of the generated features. Specifically, we first train a cross-modal embedding model called Semantic Enhanced Cross-modal Model (SECM), which is constrained by discrimination and semantics. Then we train our generative model based on Generative Adversarial Network (GAN) called SECM GAN, in which the generator generates cross-modal features, and the discriminator distinguishes true cross-modal features from generated cross-modal features. We deploy SECM as a weak constraint of GAN, which makes reliance on GAN get reduced. We evaluate extensive experiments on three widely used ZSL datasets to demonstrate the superiority of our framework.
{"title":"Semantic Enhanced Cross-modal GAN for Zero-shot Learning","authors":"Haotian Sun, Jiwei Wei, Yang Yang, Xing Xu","doi":"10.1145/3469877.3490581","DOIUrl":"https://doi.org/10.1145/3469877.3490581","url":null,"abstract":"The goal of Zero-shot Learning (ZSL) is to recognize categories that are not seen during the training process. The traditional method is to learn an embedding space and map visual features and semantic features to this common space. However, this method inevitably encounters the bias problem, i.e., unseen instances are often incorrectly recognized as the seen classes. Some attempts are made by proposing another paradigm, which uses generative models to hallucinate the features of unseen samples. However, the generative models often suffer from instability issues, making it impractical for them to generate fine-grained features of unseen samples, thus resulting in very limited improvement. To resolve this, a Semantic Enhanced Cross-modal GAN (SECM GAN) is proposed by imposing the cross-modal association for improving the semantic and discriminative property of the generated features. Specifically, we first train a cross-modal embedding model called Semantic Enhanced Cross-modal Model (SECM), which is constrained by discrimination and semantics. Then we train our generative model based on Generative Adversarial Network (GAN) called SECM GAN, in which the generator generates cross-modal features, and the discriminator distinguishes true cross-modal features from generated cross-modal features. We deploy SECM as a weak constraint of GAN, which makes reliance on GAN get reduced. We evaluate extensive experiments on three widely used ZSL datasets to demonstrate the superiority of our framework.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134226502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The various data and privacy regulations introduced around the globe, require data to be stored in a secure and privacy-preserving fashion. Non-compliance with these regulations come with major consequences. This has led to the formation of huge data silos within organizations leading to difficult data analysis along with an increased risk of a data breach. Isolating data also prevents collaborative research. To address this, we present Private-Share, a framework that would enable secure sharing of large scale data. In order to achieve this goal, Private-Share leverages the recent advances in blockchain technology specifically the InterPlanetary File System and Ethereum.
{"title":"Private-Share: A Secure and Privacy-Preserving De-Centralized Framework for Large Scale Data Sharing","authors":"Arun Zachariah, Maha M AlRasheed","doi":"10.1145/3469877.3493588","DOIUrl":"https://doi.org/10.1145/3469877.3493588","url":null,"abstract":"The various data and privacy regulations introduced around the globe, require data to be stored in a secure and privacy-preserving fashion. Non-compliance with these regulations come with major consequences. This has led to the formation of huge data silos within organizations leading to difficult data analysis along with an increased risk of a data breach. Isolating data also prevents collaborative research. To address this, we present Private-Share, a framework that would enable secure sharing of large scale data. In order to achieve this goal, Private-Share leverages the recent advances in blockchain technology specifically the InterPlanetary File System and Ethereum.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131839506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Donnaphat Trakulwaranont, Marc A. Kastner, S. Satoh
Virtual try-on systems became popular for visualizing outfits, due to the importance of individual fashion in many communities. The objective of such a system is to transfer a piece of clothing to another person while preserving its detail and characteristics. To generate a realistic in-the-wild image, it needs visual optimization of the clothing, background and target person, making this task still very challenging. In this paper, we develop a method that generates realistic try-on images with unpaired images from in-the-wild datasets. Our proposed method starts with generating a mock-up paired image using geometric transfer. Then, the target’s pose information is adjusted using a modified pose-attention module. We combine a reconstruction and a content loss to preserve the detail and style of the transferred clothing, background and the target person. We evaluate the approach on the Fashionpedia dataset and can show a promising performance over a baseline approach.
{"title":"Pose-aware Outfit Transfer between Unpaired in-the-wild Fashion Images","authors":"Donnaphat Trakulwaranont, Marc A. Kastner, S. Satoh","doi":"10.1145/3469877.3490569","DOIUrl":"https://doi.org/10.1145/3469877.3490569","url":null,"abstract":"Virtual try-on systems became popular for visualizing outfits, due to the importance of individual fashion in many communities. The objective of such a system is to transfer a piece of clothing to another person while preserving its detail and characteristics. To generate a realistic in-the-wild image, it needs visual optimization of the clothing, background and target person, making this task still very challenging. In this paper, we develop a method that generates realistic try-on images with unpaired images from in-the-wild datasets. Our proposed method starts with generating a mock-up paired image using geometric transfer. Then, the target’s pose information is adjusted using a modified pose-attention module. We combine a reconstruction and a content loss to preserve the detail and style of the transferred clothing, background and the target person. We evaluate the approach on the Fashionpedia dataset and can show a promising performance over a baseline approach.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131313298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}