Yajie Zhang, Yuxuan Dai, Wei Tang, Lu Jin, Xinguang Xiang
The main challenge of fine-grained image hashing is how to learn highly discriminative hash codes to distinguish the within and between class variations. On the one hand, most of the existing methods treat sample pairs as equivalent in hash learning, ignoring the more discriminative information contained in hard sample pairs. On the other hand, in the testing phase, these methods ignore the influence of outliers on retrieval performance. In order to solve the above issues, this paper proposes a novel Self-Adaptive Hashing method, which learns discriminative hash codes by mining hard sample pairs, and improves retrieval performance by correcting outliers in the testing phase. In particular, to improve the discriminability of hash codes, a pair-weighted based loss function is proposed to enhance the learning of hash functions of hard sample pairs. Furthermore, in the testing phase, a self-adaptive module is proposed to discover and correct outliers by generating self-adaptive boundaries, thereby improving the retrieval performance. Experimental results on two widely-used fine-grained datasets demonstrate the effectiveness of the proposed method.
{"title":"Self-Adaptive Hashing for Fine-Grained Image Retrieval","authors":"Yajie Zhang, Yuxuan Dai, Wei Tang, Lu Jin, Xinguang Xiang","doi":"10.1145/3469877.3490591","DOIUrl":"https://doi.org/10.1145/3469877.3490591","url":null,"abstract":"The main challenge of fine-grained image hashing is how to learn highly discriminative hash codes to distinguish the within and between class variations. On the one hand, most of the existing methods treat sample pairs as equivalent in hash learning, ignoring the more discriminative information contained in hard sample pairs. On the other hand, in the testing phase, these methods ignore the influence of outliers on retrieval performance. In order to solve the above issues, this paper proposes a novel Self-Adaptive Hashing method, which learns discriminative hash codes by mining hard sample pairs, and improves retrieval performance by correcting outliers in the testing phase. In particular, to improve the discriminability of hash codes, a pair-weighted based loss function is proposed to enhance the learning of hash functions of hard sample pairs. Furthermore, in the testing phase, a self-adaptive module is proposed to discover and correct outliers by generating self-adaptive boundaries, thereby improving the retrieval performance. Experimental results on two widely-used fine-grained datasets demonstrate the effectiveness of the proposed method.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124046981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
One in six people have some form of visual impairment ranging from mild vision loss to total blindness. The visually impaired constantly face the danger of walking into people or hazardous objects. This thesis proposes the use of vibrotactile feedback to serve as an obstacle detection system for visually impaired users. We utilize a mixed reality headset with on-board depth sensors to build a digital map of the real world and a suit with an array of actuators to provide feedback as to indicate to the visually impaired the position of obstacles around them. This is demonstrated by a simple prototype built using commercially available devices (Microsoft HoloLens and bHaptics Tactot) and a qualitative user study was conducted to evaluate the viability of the proposed system. Through our user-testing performed on subjects with simulated visual impairments, our results affirm the potential of using mixed reality to detect obstacles in the environment along with only transmitting essential information through the haptic suit due to limited bandwidth.
{"title":"Goldeye: Enhanced Spatial Awareness for the Visually Impaired using Mixed Reality and Vibrotactile Feedback","authors":"Jun Lee, Narayanan Rajeev, A. Bhojan","doi":"10.1145/3469877.3495636","DOIUrl":"https://doi.org/10.1145/3469877.3495636","url":null,"abstract":"One in six people have some form of visual impairment ranging from mild vision loss to total blindness. The visually impaired constantly face the danger of walking into people or hazardous objects. This thesis proposes the use of vibrotactile feedback to serve as an obstacle detection system for visually impaired users. We utilize a mixed reality headset with on-board depth sensors to build a digital map of the real world and a suit with an array of actuators to provide feedback as to indicate to the visually impaired the position of obstacles around them. This is demonstrated by a simple prototype built using commercially available devices (Microsoft HoloLens and bHaptics Tactot) and a qualitative user study was conducted to evaluate the viability of the proposed system. Through our user-testing performed on subjects with simulated visual impairments, our results affirm the potential of using mixed reality to detect obstacles in the environment along with only transmitting essential information through the haptic suit due to limited bandwidth.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122154472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Visual storytelling, which aims at automatically producing a narrative paragraph for photo album, remains quite challenging due to the complexity and diversity of photo album content. In addition, open-domain photo albums cover a broad range of topics and this results in highly variable vocabularies and expression styles to describe photo albums. In this work, a novel teacher-student visual storytelling framework with hierarchical BERT semantic guidance (HBSG) is proposed to address the above-mentioned challenges. The proposed teacher module consists of two joint tasks, namely, word-level latent topic generation and semantic-guided sentence generation. The first task aims to predict the latent topic of the story. As there is no ground-truth topic information, a pre-trained BERT model based on visual contents and annotated stories is utilized to mine topics. Then the topic vector is distilled to a designed image-topic prediction model. In the semantic-guided sentence generation task, HBSG is introduced for two purposes. The first is to narrow down the language complexity across topics, where the co-attention decoder with vision and semantic is designed to leverage the latent topics to induce topic-related language models. The second is to employ sentence semantic as an online external linguistic knowledge teacher module. Finally, an auxiliary loss is devised to transform linguistic knowledge into the language generation model. Extensive experiments are performed to demonstrate the effectiveness of HBSG framework, which surpasses the state-of-the-art approaches evaluated on the VIST test set.
{"title":"Visual Storytelling with Hierarchical BERT Semantic Guidance","authors":"Ruichao Fan, Hanli Wang, Jinjing Gu, Xianhui Liu","doi":"10.1145/3469877.3490604","DOIUrl":"https://doi.org/10.1145/3469877.3490604","url":null,"abstract":"Visual storytelling, which aims at automatically producing a narrative paragraph for photo album, remains quite challenging due to the complexity and diversity of photo album content. In addition, open-domain photo albums cover a broad range of topics and this results in highly variable vocabularies and expression styles to describe photo albums. In this work, a novel teacher-student visual storytelling framework with hierarchical BERT semantic guidance (HBSG) is proposed to address the above-mentioned challenges. The proposed teacher module consists of two joint tasks, namely, word-level latent topic generation and semantic-guided sentence generation. The first task aims to predict the latent topic of the story. As there is no ground-truth topic information, a pre-trained BERT model based on visual contents and annotated stories is utilized to mine topics. Then the topic vector is distilled to a designed image-topic prediction model. In the semantic-guided sentence generation task, HBSG is introduced for two purposes. The first is to narrow down the language complexity across topics, where the co-attention decoder with vision and semantic is designed to leverage the latent topics to induce topic-related language models. The second is to employ sentence semantic as an online external linguistic knowledge teacher module. Finally, an auxiliary loss is devised to transform linguistic knowledge into the language generation model. Extensive experiments are performed to demonstrate the effectiveness of HBSG framework, which surpasses the state-of-the-art approaches evaluated on the VIST test set.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"107 Pt 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129111001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to the heterogeneity gap, the data representation of different media is inconsistent and belongs to different feature spaces. Therefore, it is challenging to measure the fine-grained gap between them. To this end, we propose an attention space training method to learn common representations of different media data. Specifically, we utilize local self-attention layers to learn the common attention space between different media data. We propose a similarity concatenation method to understand the content relationship between features. To further improve the robustness of the model, we also train a local position encoding to capture the spatial relationships between features. In this way, our proposed method can effectively reduce the gap between different feature distributions on cross-media retrieval tasks. It also improves the fine-grained recognition performance by attaching attention to high-level semantic information. Extensive experiments and ablation studies demonstrate that our proposed method achieves state-of-the-art performance. At the same time, our approach provides a new pipeline for fine-grained cross-media retrieval. The source code and models are publicly available at: https://github.com/NUST-Machine-Intelligence-Laboratory/SAFGCMHN.
{"title":"Local Self-Attention on Fine-grained Cross-media Retrieval","authors":"Chen Wang, Yazhou Yao, Qiong Wang, Zhenmin Tang","doi":"10.1145/3469877.3490590","DOIUrl":"https://doi.org/10.1145/3469877.3490590","url":null,"abstract":"Due to the heterogeneity gap, the data representation of different media is inconsistent and belongs to different feature spaces. Therefore, it is challenging to measure the fine-grained gap between them. To this end, we propose an attention space training method to learn common representations of different media data. Specifically, we utilize local self-attention layers to learn the common attention space between different media data. We propose a similarity concatenation method to understand the content relationship between features. To further improve the robustness of the model, we also train a local position encoding to capture the spatial relationships between features. In this way, our proposed method can effectively reduce the gap between different feature distributions on cross-media retrieval tasks. It also improves the fine-grained recognition performance by attaching attention to high-level semantic information. Extensive experiments and ablation studies demonstrate that our proposed method achieves state-of-the-art performance. At the same time, our approach provides a new pipeline for fine-grained cross-media retrieval. The source code and models are publicly available at: https://github.com/NUST-Machine-Intelligence-Laboratory/SAFGCMHN.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123356357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a novel training method for end-to-end scene text recognition. End-to-end scene text recognition offers high recognition accuracy, especially when using the encoder-decoder model based on Transformer. To train a highly accurate end-to-end model, we need to prepare a large image-to-text paired dataset for the target language. However, it is difficult to collect this data, especially for resource-poor languages. To overcome this difficulty, our proposed method utilizes well-prepared large datasets in resource-rich languages such as English, to train the resource-poor encoder-decoder model. Our key idea is to build a model in which the encoder reflects knowledge of multiple languages while the decoder specializes in knowledge of just the resource-poor language. To this end, the proposed method pre-trains the encoder by using a multilingual dataset that combines the resource-poor language’s dataset and the resource-rich language’s dataset to learn language-invariant knowledge for scene text recognition. The proposed method also pre-trains the decoder by using the resource-poor language’s dataset to make the decoder better suited to the resource-poor language. Experiments on Japanese scene text recognition using a small, publicly available dataset demonstrate the effectiveness of the proposed method.
{"title":"Utilizing Resource-Rich Language Datasets for End-to-End Scene Text Recognition in Resource-Poor Languages","authors":"Shota Orihashi, Yoshihiro Yamazaki, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Ryo Masumura","doi":"10.1145/3469877.3490571","DOIUrl":"https://doi.org/10.1145/3469877.3490571","url":null,"abstract":"This paper presents a novel training method for end-to-end scene text recognition. End-to-end scene text recognition offers high recognition accuracy, especially when using the encoder-decoder model based on Transformer. To train a highly accurate end-to-end model, we need to prepare a large image-to-text paired dataset for the target language. However, it is difficult to collect this data, especially for resource-poor languages. To overcome this difficulty, our proposed method utilizes well-prepared large datasets in resource-rich languages such as English, to train the resource-poor encoder-decoder model. Our key idea is to build a model in which the encoder reflects knowledge of multiple languages while the decoder specializes in knowledge of just the resource-poor language. To this end, the proposed method pre-trains the encoder by using a multilingual dataset that combines the resource-poor language’s dataset and the resource-rich language’s dataset to learn language-invariant knowledge for scene text recognition. The proposed method also pre-trains the decoder by using the resource-poor language’s dataset to make the decoder better suited to the resource-poor language. Experiments on Japanese scene text recognition using a small, publicly available dataset demonstrate the effectiveness of the proposed method.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"329 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121992174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unmanned Aerial Vehicles (UAVs) have moved beyond a platform for hobbyists to enable environmental monitoring, journalism, film industry, search and rescue, package delivery, and entertainment. This paper describes 3D displays using swarms of flying light specks, FLSs. An FLS is a small (hundreds of micrometers in size) UAV with one or more light sources to generate different colors and textures with adjustable brightness. A synchronized swarm of FLSs renders an illumination in a pre-specified 3D volume, an FLS display. An FLS display provides true depth, enabling a user to perceive a scene more completely by analyzing its illumination from different angles. An FLS display may either be non-immersive or immersive. Both will support 3D acoustics. Non-immersive FLS displays may be the size of a 1980’s computer monitor, enabling a surgical team to observe and control micro robots performing heart surgery inside a patient’s body. Immersive FLS displays may be the size of a room, enabling users to interact with objects, e.g., a rock, a teapot. An object with behavior will be constructed using FLS-matters. FLS-matter will enable a user to touch and manipulate an object, e.g., a user may pick up a teapot or throw a rock. An immersive and interactive FLS display will approximate Star Trek’s holodeck. A successful realization of the research ideas presented in this paper will provide fundamental insights into implementing a holodeck using swarms of FLSs. A holodeck will transform the future of human communication and perception, and how we interact with information and data. It will revolutionize the future of how we work, learn, play and entertain, receive medical care, and socialize.
{"title":"Holodeck: Immersive 3D Displays Using Swarms of Flying Light Specks [Extended Abstract]","authors":"Shahram Ghandeharizadeh","doi":"10.1145/3469877.3493698","DOIUrl":"https://doi.org/10.1145/3469877.3493698","url":null,"abstract":"Unmanned Aerial Vehicles (UAVs) have moved beyond a platform for hobbyists to enable environmental monitoring, journalism, film industry, search and rescue, package delivery, and entertainment. This paper describes 3D displays using swarms of flying light specks, FLSs. An FLS is a small (hundreds of micrometers in size) UAV with one or more light sources to generate different colors and textures with adjustable brightness. A synchronized swarm of FLSs renders an illumination in a pre-specified 3D volume, an FLS display. An FLS display provides true depth, enabling a user to perceive a scene more completely by analyzing its illumination from different angles. An FLS display may either be non-immersive or immersive. Both will support 3D acoustics. Non-immersive FLS displays may be the size of a 1980’s computer monitor, enabling a surgical team to observe and control micro robots performing heart surgery inside a patient’s body. Immersive FLS displays may be the size of a room, enabling users to interact with objects, e.g., a rock, a teapot. An object with behavior will be constructed using FLS-matters. FLS-matter will enable a user to touch and manipulate an object, e.g., a user may pick up a teapot or throw a rock. An immersive and interactive FLS display will approximate Star Trek’s holodeck. A successful realization of the research ideas presented in this paper will provide fundamental insights into implementing a holodeck using swarms of FLSs. A holodeck will transform the future of human communication and perception, and how we interact with information and data. It will revolutionize the future of how we work, learn, play and entertain, receive medical care, and socialize.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124401738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ziyang Ma, Xianjing Han, Xuemeng Song, Yiran Cui, Liqiang Nie
Temporal Moment Localization (TML) in untrimmed videos is a challenging task in the field of multimedia, which aims at localizing the start and end points of the activity in the video, described by a sentence query. Existing methods mainly focus on mining the correlation between video and sentence representations or investigating the fusion manner of the two modalities. These works mainly understand the video and sentence coarsely, ignoring the fact that a sentence can be understood from various semantics, and the dominant words affecting the moment localization in the semantics are the action and object reference. Toward this end, we propose a Hierarchical Deep Residual Reasoning (HDRR) model, which decomposes the video and sentence into multi-level representations with different semantics to achieve a finer-grained localization. Furthermore, considering that videos with different resolution and sentences with different length have different difficulty in understanding, we design the simple yet effective Res-BiGRUs for feature fusion, which is able to grasp the useful information in a self-adapting manner. Extensive experiments conducted on Charades-STA and ActivityNet-Captions datasets demonstrate the superiority of our HDRR model compared with other state-of-the-art methods.
{"title":"Hierarchical Deep Residual Reasoning for Temporal Moment Localization","authors":"Ziyang Ma, Xianjing Han, Xuemeng Song, Yiran Cui, Liqiang Nie","doi":"10.1145/3469877.3490595","DOIUrl":"https://doi.org/10.1145/3469877.3490595","url":null,"abstract":"Temporal Moment Localization (TML) in untrimmed videos is a challenging task in the field of multimedia, which aims at localizing the start and end points of the activity in the video, described by a sentence query. Existing methods mainly focus on mining the correlation between video and sentence representations or investigating the fusion manner of the two modalities. These works mainly understand the video and sentence coarsely, ignoring the fact that a sentence can be understood from various semantics, and the dominant words affecting the moment localization in the semantics are the action and object reference. Toward this end, we propose a Hierarchical Deep Residual Reasoning (HDRR) model, which decomposes the video and sentence into multi-level representations with different semantics to achieve a finer-grained localization. Furthermore, considering that videos with different resolution and sentences with different length have different difficulty in understanding, we design the simple yet effective Res-BiGRUs for feature fusion, which is able to grasp the useful information in a self-adapting manner. Extensive experiments conducted on Charades-STA and ActivityNet-Captions datasets demonstrate the superiority of our HDRR model compared with other state-of-the-art methods.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123846180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper focuses on camouflaged object detection (COD), which is a task to detect objects hidden in the background. Most of the current COD models aim to highlight the target object directly while outputting ambiguous camouflaged boundaries. On the other hand, the performance of the models considering edge information is not yet satisfactory. To this end, we propose a new framework that makes full use of multiple visual cues, i.e., saliency as well as edges, to refine the predicted camouflaged map. This framework consists of three key components, i.e., a pseudo-edge generator, a pseudo-map generator, and an uncertainty-aware refinement module. In particular, the pseudo-edge generator estimates the boundary that outputs the pseudo-edge label, and the conventional COD method serves as the pseudo-map generator that outputs the pseudo-map label. Then, we propose an uncertainty-based module to reduce the uncertainty and noise of such two pseudo labels, which takes both pseudo labels as input and outputs an edge-accurate camouflaged map. Experiments on various COD datasets demonstrate the effectiveness of our method with superior performance to the existing state-of-the-art methods.
{"title":"Improving Camouflaged Object Detection with the Uncertainty of Pseudo-edge Labels","authors":"Nobukatsu Kajiura, Hong Liu, S. Satoh","doi":"10.1145/3469877.3490587","DOIUrl":"https://doi.org/10.1145/3469877.3490587","url":null,"abstract":"This paper focuses on camouflaged object detection (COD), which is a task to detect objects hidden in the background. Most of the current COD models aim to highlight the target object directly while outputting ambiguous camouflaged boundaries. On the other hand, the performance of the models considering edge information is not yet satisfactory. To this end, we propose a new framework that makes full use of multiple visual cues, i.e., saliency as well as edges, to refine the predicted camouflaged map. This framework consists of three key components, i.e., a pseudo-edge generator, a pseudo-map generator, and an uncertainty-aware refinement module. In particular, the pseudo-edge generator estimates the boundary that outputs the pseudo-edge label, and the conventional COD method serves as the pseudo-map generator that outputs the pseudo-map label. Then, we propose an uncertainty-based module to reduce the uncertainty and noise of such two pseudo labels, which takes both pseudo labels as input and outputs an edge-accurate camouflaged map. Experiments on various COD datasets demonstrate the effectiveness of our method with superior performance to the existing state-of-the-art methods.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125957253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The ever-increasing 3D application makes the point cloud compression unprecedentedly important and needed. In this paper, we propose a patch-based compression process using deep learning, focusing on the lossy point cloud geometry compression. Unlike existing point cloud compression networks, which apply feature extraction and reconstruction on the entire point cloud, we divide the point cloud into patches and compress each patch independently. In the decoding process, we finally assemble the decompressed patches into a complete point cloud. In addition, we train our network by a patch-to-patch criterion, i.e., use the local reconstruction loss for optimization, to approximate the global reconstruction optimality. Our method outperforms the state-of-the-art in terms of rate-distortion performance, especially at low bitrates. Moreover, the compression process we proposed can guarantee to generate the same number of points as the input. The network model of this method can be easily applied to other point cloud reconstruction problems, such as upsampling.
{"title":"Patch-Based Deep Autoencoder for Point Cloud Geometry Compression","authors":"Kang-Soo You, Pan Gao","doi":"10.1145/3469877.3490611","DOIUrl":"https://doi.org/10.1145/3469877.3490611","url":null,"abstract":"The ever-increasing 3D application makes the point cloud compression unprecedentedly important and needed. In this paper, we propose a patch-based compression process using deep learning, focusing on the lossy point cloud geometry compression. Unlike existing point cloud compression networks, which apply feature extraction and reconstruction on the entire point cloud, we divide the point cloud into patches and compress each patch independently. In the decoding process, we finally assemble the decompressed patches into a complete point cloud. In addition, we train our network by a patch-to-patch criterion, i.e., use the local reconstruction loss for optimization, to approximate the global reconstruction optimality. Our method outperforms the state-of-the-art in terms of rate-distortion performance, especially at low bitrates. Moreover, the compression process we proposed can guarantee to generate the same number of points as the input. The network model of this method can be easily applied to other point cloud reconstruction problems, such as upsampling.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123984882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we focus on solving the navigation problem of embodied question answering (EmbodiedQA), where the lack of experience and common sense information essentially result in a failure finding target when the robot is spawn in unknown environments. We present a route planning method named Path Estimation and Memory Recalling (PEMR) framework. PEMR includes a “looking ahead” process, i.e. a visual feature extractor module that estimates feasible paths for gathering 3D navigational information; another process “looking behind” process that is a memory recalling mechanism aims at fully leveraging past experience collected by the feature extractor. To encourage the navigator to learn more accurate prior expert experience, we improve the original benchmark dataset and provide a family of evaluation metrics for diagnosing both navigation and question answering modules. We show strong experimental results of PEMR on the EmbodiedQA navigation task.
{"title":"Explore before Moving: A Feasible Path Estimation and Memory Recalling Framework for Embodied Navigation","authors":"Yang Wu, Shirui Feng, Guanbin Li, Liang Lin","doi":"10.1145/3469877.3490570","DOIUrl":"https://doi.org/10.1145/3469877.3490570","url":null,"abstract":"In this paper, we focus on solving the navigation problem of embodied question answering (EmbodiedQA), where the lack of experience and common sense information essentially result in a failure finding target when the robot is spawn in unknown environments. We present a route planning method named Path Estimation and Memory Recalling (PEMR) framework. PEMR includes a “looking ahead” process, i.e. a visual feature extractor module that estimates feasible paths for gathering 3D navigational information; another process “looking behind” process that is a memory recalling mechanism aims at fully leveraging past experience collected by the feature extractor. To encourage the navigator to learn more accurate prior expert experience, we improve the original benchmark dataset and provide a family of evaluation metrics for diagnosing both navigation and question answering modules. We show strong experimental results of PEMR on the EmbodiedQA navigation task.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124663660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}