Yong Zhao, Haifeng Chen, H. Sahli, Ke Lu, D. Jiang
We present a method to rig 3D faces via Action Units (AUs), viewpoint and light direction, from single input image. Existing 3D methods for face synthesis and animation rely heavily on 3D morphable model (3DMM), which was built on 3D data and cannot provide intuitive expression parameters, while AU-driven 2D methods cannot handle head pose and lighting effect. We bridge the gap by integrating a recent 3D reconstruction method with 2D AU-driven method in a semi-supervised fashion. Built upon the auto-encoding 3D face reconstruction model that decouples depth, albedo, viewpoint and light without any supervision, we further decouple expression from identity for depth and albedo with a novel conditional feature translation module and pretrained critics for AU intensity estimation and image classification. Novel objective functions are designed using unlabeled in-the-wild images and in-door images with AU labels. We also leverage uncertainty losses to model the probably changing AU region of images as input noise for synthesis, and model the noisy AU intensity labels for intensity estimation of the AU critic. Experiments with face editing and animation on four datasets show that, compared with six state-of-the-art methods, our proposed method is superior and effective on expression consistency, identity similarity and pose similarity.
{"title":"Uncertainty-Aware Semi-Supervised Learning of 3D Face Rigging from Single Image","authors":"Yong Zhao, Haifeng Chen, H. Sahli, Ke Lu, D. Jiang","doi":"10.1145/3503161.3548285","DOIUrl":"https://doi.org/10.1145/3503161.3548285","url":null,"abstract":"We present a method to rig 3D faces via Action Units (AUs), viewpoint and light direction, from single input image. Existing 3D methods for face synthesis and animation rely heavily on 3D morphable model (3DMM), which was built on 3D data and cannot provide intuitive expression parameters, while AU-driven 2D methods cannot handle head pose and lighting effect. We bridge the gap by integrating a recent 3D reconstruction method with 2D AU-driven method in a semi-supervised fashion. Built upon the auto-encoding 3D face reconstruction model that decouples depth, albedo, viewpoint and light without any supervision, we further decouple expression from identity for depth and albedo with a novel conditional feature translation module and pretrained critics for AU intensity estimation and image classification. Novel objective functions are designed using unlabeled in-the-wild images and in-door images with AU labels. We also leverage uncertainty losses to model the probably changing AU region of images as input noise for synthesis, and model the noisy AU intensity labels for intensity estimation of the AU critic. Experiments with face editing and animation on four datasets show that, compared with six state-of-the-art methods, our proposed method is superior and effective on expression consistency, identity similarity and pose similarity.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"2021 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114506554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Man is by nature a social animal. One important facet of human evolution is through narrative imagination, be it fictional or factual, and to tell the tale to other individuals. The factual narrative, such as news, journalism, field report, etc., is based on real-world events and often requires extensive human efforts to create. In the era of big data where video capture devices are commonly available everywhere, a massive amount of raw videos (including life-logging, dashcam or surveillance footage) are generated daily. As a result, it is rather impossible for humans to digest and analyze these video data. This paper reviews the problem of computational narrative generation where a goal-driven narrative (in the form of text with or without video) is generated from a single or multiple long videos. Importantly, the narrative generation problem makes itself distinguished from the existing literature by its focus on a comprehensive understanding of user goal, narrative structure and open-domain input. We tentatively outline a general narrative generation framework and discuss the potential research problems and challenges in this direction. Informed by the real-world impact of narrative generation, we then illustrate several practical use cases in Video Logging as a Service platform which enables users to get more out of the data through a goal-driven intelligent storytelling AI agent.
{"title":"Compute to Tell the Tale: Goal-Driven Narrative Generation","authors":"Yongkang Wong, Shaojing Fan, Yangyang Guo, Ziwei Xu, Karen Stephen, Rishabh Sheoran, Anusha Bhamidipati, Vivek Barsopia, Jianquan Liu, Mohan S. Kankanhalli","doi":"10.1145/3503161.3549202","DOIUrl":"https://doi.org/10.1145/3503161.3549202","url":null,"abstract":"Man is by nature a social animal. One important facet of human evolution is through narrative imagination, be it fictional or factual, and to tell the tale to other individuals. The factual narrative, such as news, journalism, field report, etc., is based on real-world events and often requires extensive human efforts to create. In the era of big data where video capture devices are commonly available everywhere, a massive amount of raw videos (including life-logging, dashcam or surveillance footage) are generated daily. As a result, it is rather impossible for humans to digest and analyze these video data. This paper reviews the problem of computational narrative generation where a goal-driven narrative (in the form of text with or without video) is generated from a single or multiple long videos. Importantly, the narrative generation problem makes itself distinguished from the existing literature by its focus on a comprehensive understanding of user goal, narrative structure and open-domain input. We tentatively outline a general narrative generation framework and discuss the potential research problems and challenges in this direction. Informed by the real-world impact of narrative generation, we then illustrate several practical use cases in Video Logging as a Service platform which enables users to get more out of the data through a goal-driven intelligent storytelling AI agent.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122018503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-view video collaborative analysis is an important task and has many applications in multimedia community. However, it always requires the given multiple videos to be temporally synchronized. Existing methods commonly synchronize the videos by the wired communication, which may hinder the practical application in real world, especially for moving cameras. In this paper, we focus on the human-centric video analysis and propose a self-supervised framework for the automatic multi-camera video synchronization. Specifically, we develop SeSyn-Net with the 2D human pose as input for feature embedding and design a series of self-supervised losses to effectively extract the view-invariant but time-discriminative representation for video synchronization. We also build two new datasets for the performance evaluation. Extensive experimental results verify the effectiveness of our method, which achieves the superior performance compared to both the classical and state-of-the-art methods.
{"title":"Self-Supervised Human Pose based Multi-Camera Video Synchronization","authors":"Liqiang Yin, Ruize Han, Wei Feng, Song Wang","doi":"10.1145/3503161.3547766","DOIUrl":"https://doi.org/10.1145/3503161.3547766","url":null,"abstract":"Multi-view video collaborative analysis is an important task and has many applications in multimedia community. However, it always requires the given multiple videos to be temporally synchronized. Existing methods commonly synchronize the videos by the wired communication, which may hinder the practical application in real world, especially for moving cameras. In this paper, we focus on the human-centric video analysis and propose a self-supervised framework for the automatic multi-camera video synchronization. Specifically, we develop SeSyn-Net with the 2D human pose as input for feature embedding and design a series of self-supervised losses to effectively extract the view-invariant but time-discriminative representation for video synchronization. We also build two new datasets for the performance evaluation. Extensive experimental results verify the effectiveness of our method, which achieves the superior performance compared to both the classical and state-of-the-art methods.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"20 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116824770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yunpeng Tan, Fang Liu, Bowei Li, Zheng Zhang, Bo Zhang
Popularity of social media is an important symbol of its communication power. Predictions of social media popularity have tremendous business and social value. In this paper, we propose an efficient multimodal data processing framework, which can comprehensively extract the multi-view features from multimodal social media data and achieve accurate popularity prediction. We utilize Transformer and sliding window average to extract time series features of posts, utilize CatBoost to calculate the importance of different features, and integrate important features extracted from multiple views for accurate prediction of social media popularity. We evaluate our proposed approach with the Social Media Prediction Dataset. Experimental results show that our approach achieves excellent performance in the social media popularity prediction task.
{"title":"An Efficient Multi-View Multimodal Data Processing Framework for Social Media Popularity Prediction","authors":"Yunpeng Tan, Fang Liu, Bowei Li, Zheng Zhang, Bo Zhang","doi":"10.1145/3503161.3551607","DOIUrl":"https://doi.org/10.1145/3503161.3551607","url":null,"abstract":"Popularity of social media is an important symbol of its communication power. Predictions of social media popularity have tremendous business and social value. In this paper, we propose an efficient multimodal data processing framework, which can comprehensively extract the multi-view features from multimodal social media data and achieve accurate popularity prediction. We utilize Transformer and sliding window average to extract time series features of posts, utilize CatBoost to calculate the importance of different features, and integrate important features extracted from multiple views for accurate prediction of social media popularity. We evaluate our proposed approach with the Social Media Prediction Dataset. Experimental results show that our approach achieves excellent performance in the social media popularity prediction task.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"2014 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116848040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuan Zhao, Xin Tong, Zichong Zhu, Jianda Sheng, Lei Dai, Lingling Xu, Xuehai Xia, Y. Jiang, Jiao Li
Micro-expressions (MEs) spotting is popular in some fields, for example, criminal investigation and business communication. But it is still a challenging task to spot the onset and offset of MEs accurately in long videos. This paper refines every step of the workflow before feature extraction, which can reduce error propagation. The workflow takes the advantage of high-quality alignment method, more accurate landmark detector, and also more robust optical flow estimation. Besides, Bayesian optimization hybrid with Nash equilibrium is constructed to search for the optimal parameters. It uses two players to optimize two types of parameters, one player is used to control the ME peak spotting, and another for optical flow field extraction. The algorithm can reduce the search space for each player with better generalization. Finally, our spotting method is evaluated on MEGC2022 spotting task, and achieves F1-score 0.3564 on CAS(ME)3-UNSEEN and F1-score 0.3265 on SAMM-UNSEEN.
{"title":"Rethinking Optical Flow Methods for Micro-Expression Spotting","authors":"Yuan Zhao, Xin Tong, Zichong Zhu, Jianda Sheng, Lei Dai, Lingling Xu, Xuehai Xia, Y. Jiang, Jiao Li","doi":"10.1145/3503161.3551602","DOIUrl":"https://doi.org/10.1145/3503161.3551602","url":null,"abstract":"Micro-expressions (MEs) spotting is popular in some fields, for example, criminal investigation and business communication. But it is still a challenging task to spot the onset and offset of MEs accurately in long videos. This paper refines every step of the workflow before feature extraction, which can reduce error propagation. The workflow takes the advantage of high-quality alignment method, more accurate landmark detector, and also more robust optical flow estimation. Besides, Bayesian optimization hybrid with Nash equilibrium is constructed to search for the optimal parameters. It uses two players to optimize two types of parameters, one player is used to control the ME peak spotting, and another for optical flow field extraction. The algorithm can reduce the search space for each player with better generalization. Finally, our spotting method is evaluated on MEGC2022 spotting task, and achieves F1-score 0.3564 on CAS(ME)3-UNSEEN and F1-score 0.3265 on SAMM-UNSEEN.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"69 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128793264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Blind image deblurring is still a challenging problem due to the inherent ill-posed properties. To improve the deblurring performance, many supervised methods have been proposed. However, obtaining labeled samples from a specific distribution (or a domain) is usually expensive, and the data-driven training-based model also cannot be generalized to the blurry images in all domains. These challenges have given birth to certain unsupervised deblurring methods. However, there is a great chromatic aberration between the latent and original images, directly degrading the performance. In this paper, we therefore propose a novel unsupervised color retention network termed CRNet to perform blind motion deblurring. In addition, new concepts of blur offset estimation and adaptive blur correction are proposed to retain the color information when deblurring. As a result, unlike the previous studies, CRNet does not learn a mapping directly from the blurry image to the restored latent image, but from the blurry image to a motion offset. An adaptive blur correction operation is then performed on the blurry image to restore the latent image, thereby retaining the color information of the original image to the greatest extent. To further effectively retain the color information and extract the blur information, we also propose a new module called pyramid global blur feature perception (PGBFP). To quantitatively prove the effectiveness of our network in color retention, we propose a novel chromatic aberration quantization metrics in line with the human perception. Extensive quantitative and visualization experiments show that CRNet can obtain the state-of-the-art performance in unsupervised deblurring tasks.
{"title":"CRNet: Unsupervised Color Retention Network for Blind Motion Deblurring","authors":"Suiyi Zhao, Zhao Zhang, Richang Hong, Mingliang Xu, Haijun Zhang, Meng Wang, Shuicheng Yan","doi":"10.1145/3503161.3547962","DOIUrl":"https://doi.org/10.1145/3503161.3547962","url":null,"abstract":"Blind image deblurring is still a challenging problem due to the inherent ill-posed properties. To improve the deblurring performance, many supervised methods have been proposed. However, obtaining labeled samples from a specific distribution (or a domain) is usually expensive, and the data-driven training-based model also cannot be generalized to the blurry images in all domains. These challenges have given birth to certain unsupervised deblurring methods. However, there is a great chromatic aberration between the latent and original images, directly degrading the performance. In this paper, we therefore propose a novel unsupervised color retention network termed CRNet to perform blind motion deblurring. In addition, new concepts of blur offset estimation and adaptive blur correction are proposed to retain the color information when deblurring. As a result, unlike the previous studies, CRNet does not learn a mapping directly from the blurry image to the restored latent image, but from the blurry image to a motion offset. An adaptive blur correction operation is then performed on the blurry image to restore the latent image, thereby retaining the color information of the original image to the greatest extent. To further effectively retain the color information and extract the blur information, we also propose a new module called pyramid global blur feature perception (PGBFP). To quantitatively prove the effectiveness of our network in color retention, we propose a novel chromatic aberration quantization metrics in line with the human perception. Extensive quantitative and visualization experiments show that CRNet can obtain the state-of-the-art performance in unsupervised deblurring tasks.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128567410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data owners have the right to request for deleting their data from a machine learning (ML) model. In response, a naïve way is to retrain the model with the original dataset excluding the data to forget, which is however unrealistic as the required dataset may no longer be available and the retraining process is usually computationally expensive. To cope with this reality, machine unlearning has recently attained much attention, which aims to enable data removal from a trained ML model responding to deletion requests, without retraining the model from scratch or full access to the original training dataset. Existing unlearning methods mainly focus on handling conventional ML methods, while unlearning deep neural networks (DNNs) based models remains underexplored, especially for the ones trained on large-scale datasets. In this paper, we make the first attempt to realize data forgetting on deep models for image retrieval. Image retrieval targets at searching relevant data to the query according to similarity measures. Intuitively, unlearning a deep image retrieval model can be achieved by breaking down its ability of similarity modeling on the data to forget. To this end, we propose a generative scrubbing (GS) method that learns a generator to craft noisy data to manipulate the model weights. A novel framework is designed consisting of the generator and the target retrieval model, where a pair of coupled static and dynamic learning procedures are performed simultaneously. This novel learning strategy effectively enables the generated noisy data to fade away the memory of the model on the data to forget whilst retaining the information of the remaining data. Extensive experiments on three widely-used datasets have successfully verified the effectiveness of the proposed method.
{"title":"Machine Unlearning for Image Retrieval: A Generative Scrubbing Approach","authors":"P. Zhang, Guangdong Bai, Zi Huang, Xin-Shun Xu","doi":"10.1145/3503161.3548378","DOIUrl":"https://doi.org/10.1145/3503161.3548378","url":null,"abstract":"Data owners have the right to request for deleting their data from a machine learning (ML) model. In response, a naïve way is to retrain the model with the original dataset excluding the data to forget, which is however unrealistic as the required dataset may no longer be available and the retraining process is usually computationally expensive. To cope with this reality, machine unlearning has recently attained much attention, which aims to enable data removal from a trained ML model responding to deletion requests, without retraining the model from scratch or full access to the original training dataset. Existing unlearning methods mainly focus on handling conventional ML methods, while unlearning deep neural networks (DNNs) based models remains underexplored, especially for the ones trained on large-scale datasets. In this paper, we make the first attempt to realize data forgetting on deep models for image retrieval. Image retrieval targets at searching relevant data to the query according to similarity measures. Intuitively, unlearning a deep image retrieval model can be achieved by breaking down its ability of similarity modeling on the data to forget. To this end, we propose a generative scrubbing (GS) method that learns a generator to craft noisy data to manipulate the model weights. A novel framework is designed consisting of the generator and the target retrieval model, where a pair of coupled static and dynamic learning procedures are performed simultaneously. This novel learning strategy effectively enables the generated noisy data to fade away the memory of the model on the data to forget whilst retaining the information of the remaining data. Extensive experiments on three widely-used datasets have successfully verified the effectiveness of the proposed method.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128593452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yudong Li, Xianxu Hou, Zhe Zhao, Linlin Shen, Xuefeng Yang, Kimmo Yan
Facial analysis is an important domain in computer vision and has received extensive research attention. For numerous downstream tasks with different input/output formats and modalities, existing methods usually design task-specific architectures and train them using face datasets collected in the particular task domain. In this work, we proposed a single model, Talk2Face, to simultaneously tackle a large number of face generation and analysis tasks, e.g. text guided face synthesis, face captioning and age estimation. Specifically, we cast different tasks into a sequence-to-sequence format with the same architecture, parameters and objectives. While text and facial images are tokenized to sequences, the annotation labels of faces for different tasks are also converted to natural languages for unified representation. We collect a set of 2.3M face-text pairs from available datasets across different tasks, to train the proposed model. Uniform templates are then designed to enable the model to perform different downstream tasks, according to the task context and target. Experiments on different tasks show that our model achieves better face generation and caption performances than SOTA approaches. On age estimation and multi-attribute classification, our model reaches competitive performance with those models specially designed and trained for these particular tasks. In practice, our model is much easier to be deployed to different facial analysis related tasks. Code and dataset will be available at https://github.com/ydli-ai/Talk2Face.
{"title":"Talk2Face: A Unified Sequence-based Framework for Diverse Face Generation and Analysis Tasks","authors":"Yudong Li, Xianxu Hou, Zhe Zhao, Linlin Shen, Xuefeng Yang, Kimmo Yan","doi":"10.1145/3503161.3548205","DOIUrl":"https://doi.org/10.1145/3503161.3548205","url":null,"abstract":"Facial analysis is an important domain in computer vision and has received extensive research attention. For numerous downstream tasks with different input/output formats and modalities, existing methods usually design task-specific architectures and train them using face datasets collected in the particular task domain. In this work, we proposed a single model, Talk2Face, to simultaneously tackle a large number of face generation and analysis tasks, e.g. text guided face synthesis, face captioning and age estimation. Specifically, we cast different tasks into a sequence-to-sequence format with the same architecture, parameters and objectives. While text and facial images are tokenized to sequences, the annotation labels of faces for different tasks are also converted to natural languages for unified representation. We collect a set of 2.3M face-text pairs from available datasets across different tasks, to train the proposed model. Uniform templates are then designed to enable the model to perform different downstream tasks, according to the task context and target. Experiments on different tasks show that our model achieves better face generation and caption performances than SOTA approaches. On age estimation and multi-attribute classification, our model reaches competitive performance with those models specially designed and trained for these particular tasks. In practice, our model is much easier to be deployed to different facial analysis related tasks. Code and dataset will be available at https://github.com/ydli-ai/Talk2Face.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128958841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A number of applications (e.g., video surveillance and authentication) rely on automated face recognition to guarantee functioning of secure services, and meanwhile, have to take into account the privacy of individuals exposed under camera systems. This is the so-called Privacy-Utility trade-off. However, most existing approaches to facial privacy protection focus on removing identifiable visual information from images, leaving protected face unrecognizable to machine, which sacrifice utility for privacy. To tackle the privacy-utility challenge, we propose a novel, generic, effective, yet lightweight framework for Privacy-preserving Recognizable Obfuscation of Face images (named as PRO-Face). The framework allows one to first process a face image using any preferred obfuscation, such as image blur, pixelate and face morphing. It then leverages a Siamese network to fuse the original image with its obfuscated form, generating the final protected image visually similar to the obfuscated one from human perception (for privacy) but still recognized as the original identity by machine (for utility). The framework supports various obfuscations for facial anonymization. The face recognition can be performed accurately not only across anonymized images but also between plain and anonymized ones, based on only pre-trained recognizers. Those feature the "generic" merit of the proposed framework. In-depth objective and subjective evaluations demonstrate the effectiveness of the proposed framework in both privacy protection and utility preservation under distinct scenarios. Our source code, models and any supplementary materials are made publicly available.
{"title":"PRO-Face: A Generic Framework for Privacy-preserving Recognizable Obfuscation of Face Images","authors":"Lin Yuan, Linguo Liu, Xiao Pu, Zhao Li, Hongbo Li, Xinbo Gao","doi":"10.1145/3503161.3548202","DOIUrl":"https://doi.org/10.1145/3503161.3548202","url":null,"abstract":"A number of applications (e.g., video surveillance and authentication) rely on automated face recognition to guarantee functioning of secure services, and meanwhile, have to take into account the privacy of individuals exposed under camera systems. This is the so-called Privacy-Utility trade-off. However, most existing approaches to facial privacy protection focus on removing identifiable visual information from images, leaving protected face unrecognizable to machine, which sacrifice utility for privacy. To tackle the privacy-utility challenge, we propose a novel, generic, effective, yet lightweight framework for Privacy-preserving Recognizable Obfuscation of Face images (named as PRO-Face). The framework allows one to first process a face image using any preferred obfuscation, such as image blur, pixelate and face morphing. It then leverages a Siamese network to fuse the original image with its obfuscated form, generating the final protected image visually similar to the obfuscated one from human perception (for privacy) but still recognized as the original identity by machine (for utility). The framework supports various obfuscations for facial anonymization. The face recognition can be performed accurately not only across anonymized images but also between plain and anonymized ones, based on only pre-trained recognizers. Those feature the \"generic\" merit of the proposed framework. In-depth objective and subjective evaluations demonstrate the effectiveness of the proposed framework in both privacy protection and utility preservation under distinct scenarios. Our source code, models and any supplementary materials are made publicly available.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129003302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei Zhang, Xiaohong Zhang, Sheng Huang, Yuting Lu, Kun Wang
Medical image segmentation tasks often have more than one plausible annotation for a given input image due to its inherent ambiguity. Generating multiple plausible predictions for a single image is of interest for medical critical applications. Many methods estimate the distribution of the annotation space by developing probabilistic models to generate multiple hypotheses. However, these methods aim to improve the diversity of predictions at the expense of the more important accuracy. In this paper, we propose a novel probabilistic segmentation model, called Joint Probabilistic U-net, which successfully achieves flexible control over the two abstract conceptions of diversity and accuracy. Specifically, we (i) model the joint distribution of images and annotations to learn a latent space, which is used to decouple diversity and accuracy, and (ii) transform the Gaussian distribution in the latent space to a complex distribution to improve model's expressiveness. In addition, we explore two strategies for preventing the latent space collapse, which are effective in improving the model's performance on datasets with limited annotation. We demonstrate the effectiveness of the proposed model on two medical image datasets, i.e. LIDC-IDRI and ISBI 2016, and achieved state-of-the-art results on several metrics.
{"title":"A Probabilistic Model for Controlling Diversity and Accuracy of Ambiguous Medical Image Segmentation","authors":"Wei Zhang, Xiaohong Zhang, Sheng Huang, Yuting Lu, Kun Wang","doi":"10.1145/3503161.3548115","DOIUrl":"https://doi.org/10.1145/3503161.3548115","url":null,"abstract":"Medical image segmentation tasks often have more than one plausible annotation for a given input image due to its inherent ambiguity. Generating multiple plausible predictions for a single image is of interest for medical critical applications. Many methods estimate the distribution of the annotation space by developing probabilistic models to generate multiple hypotheses. However, these methods aim to improve the diversity of predictions at the expense of the more important accuracy. In this paper, we propose a novel probabilistic segmentation model, called Joint Probabilistic U-net, which successfully achieves flexible control over the two abstract conceptions of diversity and accuracy. Specifically, we (i) model the joint distribution of images and annotations to learn a latent space, which is used to decouple diversity and accuracy, and (ii) transform the Gaussian distribution in the latent space to a complex distribution to improve model's expressiveness. In addition, we explore two strategies for preventing the latent space collapse, which are effective in improving the model's performance on datasets with limited annotation. We demonstrate the effectiveness of the proposed model on two medical image datasets, i.e. LIDC-IDRI and ISBI 2016, and achieved state-of-the-art results on several metrics.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129705054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}