Longhua Sun, Jin Wang, Yunhui Shi, Qing Zhu, Baocai Yin
High-quality depth information has been increasingly used in many real-world multimedia applications in recent years. Due to the limitation of depth sensor and sensing technology, actually, the captured depth map usually has low resolution and black holes. In this paper, inspired by the geometric relationship between surface normal of a 3D scene and their distance from camera, we discover that surface normal map can provide more spatial geometric constraints for depth map reconstruction, as depth map is a special image with spatial information, which we called 2.5D image. To exploit this property, we propose a novel surface normal data guided depth recovery method, which uses surface normal data and observed depth value to estimate missing or interpolated depth values. Moreover, to preserve the inherent piecewise smooth characteristic of depth maps, graph Laplacian prior is applied to regularize the inverse problem of depth maps recovery and a graph Laplacian regularizer(GLR) is proposed. Finally, the spatial geometric constraint and graph Laplacian regularization are integrated into a unified optimization framework, which can be efficiently solved by conjugate gradient(CG). Extensive quantitative and qualitative evaluations compared with state-of-the-art schemes show the effectiveness and superiority of our method.
{"title":"Surface Normal Data Guided Depth Recovery with Graph Laplacian Regularization","authors":"Longhua Sun, Jin Wang, Yunhui Shi, Qing Zhu, Baocai Yin","doi":"10.1145/3338533.3366582","DOIUrl":"https://doi.org/10.1145/3338533.3366582","url":null,"abstract":"High-quality depth information has been increasingly used in many real-world multimedia applications in recent years. Due to the limitation of depth sensor and sensing technology, actually, the captured depth map usually has low resolution and black holes. In this paper, inspired by the geometric relationship between surface normal of a 3D scene and their distance from camera, we discover that surface normal map can provide more spatial geometric constraints for depth map reconstruction, as depth map is a special image with spatial information, which we called 2.5D image. To exploit this property, we propose a novel surface normal data guided depth recovery method, which uses surface normal data and observed depth value to estimate missing or interpolated depth values. Moreover, to preserve the inherent piecewise smooth characteristic of depth maps, graph Laplacian prior is applied to regularize the inverse problem of depth maps recovery and a graph Laplacian regularizer(GLR) is proposed. Finally, the spatial geometric constraint and graph Laplacian regularization are integrated into a unified optimization framework, which can be efficiently solved by conjugate gradient(CG). Extensive quantitative and qualitative evaluations compared with state-of-the-art schemes show the effectiveness and superiority of our method.","PeriodicalId":273086,"journal":{"name":"Proceedings of the ACM Multimedia Asia","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122328075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Converting near-infrared (NIR) images into color images is a challenging task due to the different characteristics of visible and NIR images. Most methods of generating color images directly from a single NIR image are limited by the scene and object categories. In this paper, we propose a novel approach to recovering object colors from multi-spectral NIR images using gray information. The multi-spectral NIR images are obtained by a 2-CCD NIR/RGB camera with narrow NIR bandpass filters of different wavelengths. The proposed approach is based on multi-spectral NIR images to estimate a conversion matrix for NIR to RGB conversion. In addition to the multi-spectral NIR images, a corresponding gray image is used as a complementary channel to estimate the conversion matrix for NIR to RGB color conversion. The conversion matrix is obtained from the ColorChecker's 24 color blocks using polynomial regression and applied to real-world scene NIR images for color recovery. The proposed approach has been evaluated by a large number of real-world scene images, and the results show that the proposed approach is simple yet effective for recovering color of objects.
{"title":"Color Recovery from Multi-Spectral NIR Images Using Gray Information","authors":"Qingtao Fu, Cheolkon Jung, Chen Su","doi":"10.1145/3338533.3368259","DOIUrl":"https://doi.org/10.1145/3338533.3368259","url":null,"abstract":"Converting near-infrared (NIR) images into color images is a challenging task due to the different characteristics of visible and NIR images. Most methods of generating color images directly from a single NIR image are limited by the scene and object categories. In this paper, we propose a novel approach to recovering object colors from multi-spectral NIR images using gray information. The multi-spectral NIR images are obtained by a 2-CCD NIR/RGB camera with narrow NIR bandpass filters of different wavelengths. The proposed approach is based on multi-spectral NIR images to estimate a conversion matrix for NIR to RGB conversion. In addition to the multi-spectral NIR images, a corresponding gray image is used as a complementary channel to estimate the conversion matrix for NIR to RGB color conversion. The conversion matrix is obtained from the ColorChecker's 24 color blocks using polynomial regression and applied to real-world scene NIR images for color recovery. The proposed approach has been evaluated by a large number of real-world scene images, and the results show that the proposed approach is simple yet effective for recovering color of objects.","PeriodicalId":273086,"journal":{"name":"Proceedings of the ACM Multimedia Asia","volume":"48 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130669564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The leading learning-based rate control method, i.e., QARC, achieves state-of-the-art performances but fails to interpret the fundamental principles, and thus lacks the abilities to further improve itself efficiently. In this paper, we propose EQARC (Explainable QARC) via reconstructing QARC's modules, aiming to demystify how QARC works. In details, we first utilize a novel hybrid attention-based CNN+GRU model to re-characterize the original quality prediction network and reasonably replace the QARC's 1D-CNN layers with 2D-CNN layers. Using trace-driven experiment, we demonstrate the superiority of EQARC over existing state-of-the-art approaches. Next, we collect several useful information from each interpretable modules and learn the insight of EQARC. Following this step, we further propose AQARC (Advanced QARC), which is the light-weighted version of QARC. Experimental results show that AQARC achieves the same performances as the QARC with an overhead reduction of 90%. In short, through learning from deep learning, we generalize a rate control method which can both reach high performance and reduce computation cost.
{"title":"Generalizing Rate Control Strategies for Realtime Video Streaming via Learning from Deep Learning","authors":"Tianchi Huang, Ruixiao Zhang, Chenglei Wu, Xin Yao, Chao Zhou, Bing Yu, Lifeng Sun","doi":"10.1145/3338533.3366606","DOIUrl":"https://doi.org/10.1145/3338533.3366606","url":null,"abstract":"The leading learning-based rate control method, i.e., QARC, achieves state-of-the-art performances but fails to interpret the fundamental principles, and thus lacks the abilities to further improve itself efficiently. In this paper, we propose EQARC (Explainable QARC) via reconstructing QARC's modules, aiming to demystify how QARC works. In details, we first utilize a novel hybrid attention-based CNN+GRU model to re-characterize the original quality prediction network and reasonably replace the QARC's 1D-CNN layers with 2D-CNN layers. Using trace-driven experiment, we demonstrate the superiority of EQARC over existing state-of-the-art approaches. Next, we collect several useful information from each interpretable modules and learn the insight of EQARC. Following this step, we further propose AQARC (Advanced QARC), which is the light-weighted version of QARC. Experimental results show that AQARC achieves the same performances as the QARC with an overhead reduction of 90%. In short, through learning from deep learning, we generalize a rate control method which can both reach high performance and reduce computation cost.","PeriodicalId":273086,"journal":{"name":"Proceedings of the ACM Multimedia Asia","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133029202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The images captured in low-light conditions are often of poor visual quality as most of details in dark regions buried. Although some advanced low-light image enhancement methods could lighten an image and its dark regions, they still cannot reveal the details in dark regions very well. This paper presents an adaptive dark region detail enhancement method for low-light images. As our method is based on the Retinex theory, we first formulate the Retinex-based low-light image enhancement problem into a Bayesian optimization framework. Then, a dark region prior is proposed and an adaptive gradient amplification strategy is designed to incorporate this prior into the illumination estimation. The dark region prior, together with the widely used spatial smooth and structure priors, leads to a dark region and structure-aware smoothness regularization term for illumination optimization. We provide a solver to this optimization and get final enhanced results after post processing. Experiments demonstrate that our method can obtain good enhancement results with better dark region details compared to several state-of-the-art methods.
{"title":"An Adaptive Dark Region Detail Enhancement Method for Low-light Images","authors":"Wengang Cheng, Caiyun Guo, Haitao Hu","doi":"10.1145/3338533.3366584","DOIUrl":"https://doi.org/10.1145/3338533.3366584","url":null,"abstract":"The images captured in low-light conditions are often of poor visual quality as most of details in dark regions buried. Although some advanced low-light image enhancement methods could lighten an image and its dark regions, they still cannot reveal the details in dark regions very well. This paper presents an adaptive dark region detail enhancement method for low-light images. As our method is based on the Retinex theory, we first formulate the Retinex-based low-light image enhancement problem into a Bayesian optimization framework. Then, a dark region prior is proposed and an adaptive gradient amplification strategy is designed to incorporate this prior into the illumination estimation. The dark region prior, together with the widely used spatial smooth and structure priors, leads to a dark region and structure-aware smoothness regularization term for illumination optimization. We provide a solver to this optimization and get final enhanced results after post processing. Experiments demonstrate that our method can obtain good enhancement results with better dark region details compared to several state-of-the-art methods.","PeriodicalId":273086,"journal":{"name":"Proceedings of the ACM Multimedia Asia","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133853269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tian Gan, Zhixin Ma, Yu Lu, Xuemeng Song, Liqiang Nie
Presentation is one of the most important and vivid methods to deliver information to audience. Apart from the content of presentation, how the speaker behaves during presentation makes a big difference. In other words, gestures, as part of the visual perception and synchronized with verbal information, express some subtle information that the voice or words alone cannot deliver. One of the most effective ways to improve presentation is to practice through feedback/suggestions by an expert. However, hiring human experts is expensive thus impractical most of the time. Towards this end, we propose a speech to gesture network (POSE) to generate exemplary body language given a vocal behavior speech as input. Specifically, we build an "expert" Speech-Gesture database based on the featured TED talk videos, and design a two-layer attentive recurrent encoder-decoder network to learn the translation from speech to gesture, as well as the hierarchical structure within gestures. Lastly, given a speech audio sequence, the appropriate gesture will be generated and visualized for a more effective communication. Both objective and subjective validation show the effectiveness of our proposed method.
{"title":"Learn to Gesture: Let Your Body Speak","authors":"Tian Gan, Zhixin Ma, Yu Lu, Xuemeng Song, Liqiang Nie","doi":"10.1145/3338533.3366602","DOIUrl":"https://doi.org/10.1145/3338533.3366602","url":null,"abstract":"Presentation is one of the most important and vivid methods to deliver information to audience. Apart from the content of presentation, how the speaker behaves during presentation makes a big difference. In other words, gestures, as part of the visual perception and synchronized with verbal information, express some subtle information that the voice or words alone cannot deliver. One of the most effective ways to improve presentation is to practice through feedback/suggestions by an expert. However, hiring human experts is expensive thus impractical most of the time. Towards this end, we propose a speech to gesture network (POSE) to generate exemplary body language given a vocal behavior speech as input. Specifically, we build an \"expert\" Speech-Gesture database based on the featured TED talk videos, and design a two-layer attentive recurrent encoder-decoder network to learn the translation from speech to gesture, as well as the hierarchical structure within gestures. Lastly, given a speech audio sequence, the appropriate gesture will be generated and visualized for a more effective communication. Both objective and subjective validation show the effectiveness of our proposed method.","PeriodicalId":273086,"journal":{"name":"Proceedings of the ACM Multimedia Asia","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131121919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenqian Zhu, R. Hu, Zhongyuan Wang, Dengshi Li, Xiyue Gao
Vehicle re-identification (re-ID) has received more attention in recent years as a significant work, making huge contribution to the intelligent video surveillance. The complex intra-class and inter-class variation of vehicle images bring huge challenges for vehicle re-ID, especially for the similar vehicle re-ID. In this paper we focus on an interesting and challenging problem, vehicle re-ID of the same/similar model. Previous works mainly focus on extracting global features using deep models, ignoring the individual loa-cal regions in vehicle front window, such as decorations and stickers attached to the windshield, that can be more discriminative for vehicle re-ID. Instead of directly embedding these regions to learn their features, we propose a Regional Structure-Aware model (RSA) to learn structure-aware cues with the position distribution of individual local regions in vehicle front window area, constructing a FW structural map space. In this map sapce, deep models are able to learn more robust and discriminative spatial structure-aware features to improve the performance for vehicle re-ID of the same/similar model. We evaluate our method on a large-scale vehicle re-ID dataset Vehicle-1M. The experimental results show that our method can achieve promising performance and outperforms several recent state-of-the-art approaches.
{"title":"Deep Structural Feature Learning: Re-Identification of simailar vehicles In Structure-Aware Map Space","authors":"Wenqian Zhu, R. Hu, Zhongyuan Wang, Dengshi Li, Xiyue Gao","doi":"10.1145/3338533.3366585","DOIUrl":"https://doi.org/10.1145/3338533.3366585","url":null,"abstract":"Vehicle re-identification (re-ID) has received more attention in recent years as a significant work, making huge contribution to the intelligent video surveillance. The complex intra-class and inter-class variation of vehicle images bring huge challenges for vehicle re-ID, especially for the similar vehicle re-ID. In this paper we focus on an interesting and challenging problem, vehicle re-ID of the same/similar model. Previous works mainly focus on extracting global features using deep models, ignoring the individual loa-cal regions in vehicle front window, such as decorations and stickers attached to the windshield, that can be more discriminative for vehicle re-ID. Instead of directly embedding these regions to learn their features, we propose a Regional Structure-Aware model (RSA) to learn structure-aware cues with the position distribution of individual local regions in vehicle front window area, constructing a FW structural map space. In this map sapce, deep models are able to learn more robust and discriminative spatial structure-aware features to improve the performance for vehicle re-ID of the same/similar model. We evaluate our method on a large-scale vehicle re-ID dataset Vehicle-1M. The experimental results show that our method can achieve promising performance and outperforms several recent state-of-the-art approaches.","PeriodicalId":273086,"journal":{"name":"Proceedings of the ACM Multimedia Asia","volume":"102 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132329156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Face inpainting is a sub-task of image inpainting designed to repair broken or occluded incomplete portraits. Due to the high complexity of face image details, inpainting on the face is more difficult. At present, face-related tasks often draw on excellent methods from face recognition and face detection, using multitasking to boost its effect. Therefore, this paper proposes to add the face prior knowledge to the existing advanced inpainting model, combined with perceptual loss and SSIM loss to improve the model repair efficiency. A new face inpainting process and algorithm is implemented, and the repair effect is improved.
{"title":"Semantic Prior Guided Face Inpainting","authors":"Zeyang Zhang, Xiaobo Zhou, Shengjie Zhao, Xiaoyan Zhang","doi":"10.1145/3338533.3366587","DOIUrl":"https://doi.org/10.1145/3338533.3366587","url":null,"abstract":"Face inpainting is a sub-task of image inpainting designed to repair broken or occluded incomplete portraits. Due to the high complexity of face image details, inpainting on the face is more difficult. At present, face-related tasks often draw on excellent methods from face recognition and face detection, using multitasking to boost its effect. Therefore, this paper proposes to add the face prior knowledge to the existing advanced inpainting model, combined with perceptual loss and SSIM loss to improve the model repair efficiency. A new face inpainting process and algorithm is implemented, and the repair effect is improved.","PeriodicalId":273086,"journal":{"name":"Proceedings of the ACM Multimedia Asia","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116252144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
My main research interests include product search and visual question answering (VQA), lying in the field of information retrieval (IR), which aims to obtain information system resources relevant to an information need from a collection. Product search focuses on the E-commerce domain and aims to retrieve products which are not only relevant to the submitted queries but also fit users' personal preferences; Visual question answering aims to provide a natural language answer for a given image and a free-form, open-ended, natural-language question about this image, which requires semantic understanding on natural language and visual content, as well as knowledge extraction and logic reasoning.
{"title":"Multimedia Information Retrieval","authors":"Yangyang Guo","doi":"10.1145/3338533.3372212","DOIUrl":"https://doi.org/10.1145/3338533.3372212","url":null,"abstract":"My main research interests include product search and visual question answering (VQA), lying in the field of information retrieval (IR), which aims to obtain information system resources relevant to an information need from a collection. Product search focuses on the E-commerce domain and aims to retrieve products which are not only relevant to the submitted queries but also fit users' personal preferences; Visual question answering aims to provide a natural language answer for a given image and a free-form, open-ended, natural-language question about this image, which requires semantic understanding on natural language and visual content, as well as knowledge extraction and logic reasoning.","PeriodicalId":273086,"journal":{"name":"Proceedings of the ACM Multimedia Asia","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127517181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, Multi-Target Multi-Camera Tracking (MTMC) makes a breakthrough due to the release of DukeMTMC and show the feasibility of related applications. However, most of the existing MTMC methods focus on the batch methods which attempt to find the global optimal solution from the entire image sequence and thus are not suitable for the real-time applications, e.g., customer tracking in unmanned stores. In this paper, we propose a low-cost online tracking algorithm, namely, Deep Multi-Fisheye-Camera Tracking (DeepMFCT) to identify the customers and locate the corresponding positions from multiple overlapping fisheye cameras. Based on any single camera tracking algorithm (e.g., Deep SORT), our proposed algorithm establishes the correlation between different single camera tracks. Owing to the lack of well-annotated multiple overlapping fisheye cameras dataset, the main challenge of this issue is to efficiently overcome the domain gap problem between normal cameras and fisheye cameras based on existed deep learning based model. To address this challenge, we integrate a single camera tracking algorithm with cross camera clustering including location information that achieves great performance on the unmanned store dataset and Hall dataset. Experimental results show that the proposed algorithm improves the baselines by at least 7% in terms of MOTA on the Hall dataset.
{"title":"Multiple Fisheye Camera Tracking via Real-Time Feature Clustering","authors":"Chon-Hou Sio, Hong-Han Shuai, Wen-Huang Cheng","doi":"10.1145/3338533.3366581","DOIUrl":"https://doi.org/10.1145/3338533.3366581","url":null,"abstract":"Recently, Multi-Target Multi-Camera Tracking (MTMC) makes a breakthrough due to the release of DukeMTMC and show the feasibility of related applications. However, most of the existing MTMC methods focus on the batch methods which attempt to find the global optimal solution from the entire image sequence and thus are not suitable for the real-time applications, e.g., customer tracking in unmanned stores. In this paper, we propose a low-cost online tracking algorithm, namely, Deep Multi-Fisheye-Camera Tracking (DeepMFCT) to identify the customers and locate the corresponding positions from multiple overlapping fisheye cameras. Based on any single camera tracking algorithm (e.g., Deep SORT), our proposed algorithm establishes the correlation between different single camera tracks. Owing to the lack of well-annotated multiple overlapping fisheye cameras dataset, the main challenge of this issue is to efficiently overcome the domain gap problem between normal cameras and fisheye cameras based on existed deep learning based model. To address this challenge, we integrate a single camera tracking algorithm with cross camera clustering including location information that achieves great performance on the unmanned store dataset and Hall dataset. Experimental results show that the proposed algorithm improves the baselines by at least 7% in terms of MOTA on the Hall dataset.","PeriodicalId":273086,"journal":{"name":"Proceedings of the ACM Multimedia Asia","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126118802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}