首页 > 最新文献

ACM Multimedia Asia最新文献

英文 中文
Self-Adaptive Hashing for Fine-Grained Image Retrieval 用于细粒度图像检索的自适应哈希
Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490591
Yajie Zhang, Yuxuan Dai, Wei Tang, Lu Jin, Xinguang Xiang
The main challenge of fine-grained image hashing is how to learn highly discriminative hash codes to distinguish the within and between class variations. On the one hand, most of the existing methods treat sample pairs as equivalent in hash learning, ignoring the more discriminative information contained in hard sample pairs. On the other hand, in the testing phase, these methods ignore the influence of outliers on retrieval performance. In order to solve the above issues, this paper proposes a novel Self-Adaptive Hashing method, which learns discriminative hash codes by mining hard sample pairs, and improves retrieval performance by correcting outliers in the testing phase. In particular, to improve the discriminability of hash codes, a pair-weighted based loss function is proposed to enhance the learning of hash functions of hard sample pairs. Furthermore, in the testing phase, a self-adaptive module is proposed to discover and correct outliers by generating self-adaptive boundaries, thereby improving the retrieval performance. Experimental results on two widely-used fine-grained datasets demonstrate the effectiveness of the proposed method.
细粒度图像哈希的主要挑战是如何学习高度判别的哈希码来区分类内和类之间的变化。一方面,现有的大多数方法在哈希学习中将样本对等同对待,忽略了硬样本对中包含的更具判别性的信息。另一方面,在测试阶段,这些方法忽略了异常值对检索性能的影响。为了解决上述问题,本文提出了一种新的自适应哈希方法,该方法通过挖掘硬样本对来学习判别哈希码,并通过在测试阶段纠正异常值来提高检索性能。特别地,为了提高哈希码的可判别性,提出了一种基于对加权的损失函数来增强硬样本对哈希函数的学习。在测试阶段,提出了一个自适应模块,通过生成自适应边界来发现和纠正异常点,从而提高了检索性能。在两个广泛使用的细粒度数据集上的实验结果证明了该方法的有效性。
{"title":"Self-Adaptive Hashing for Fine-Grained Image Retrieval","authors":"Yajie Zhang, Yuxuan Dai, Wei Tang, Lu Jin, Xinguang Xiang","doi":"10.1145/3469877.3490591","DOIUrl":"https://doi.org/10.1145/3469877.3490591","url":null,"abstract":"The main challenge of fine-grained image hashing is how to learn highly discriminative hash codes to distinguish the within and between class variations. On the one hand, most of the existing methods treat sample pairs as equivalent in hash learning, ignoring the more discriminative information contained in hard sample pairs. On the other hand, in the testing phase, these methods ignore the influence of outliers on retrieval performance. In order to solve the above issues, this paper proposes a novel Self-Adaptive Hashing method, which learns discriminative hash codes by mining hard sample pairs, and improves retrieval performance by correcting outliers in the testing phase. In particular, to improve the discriminability of hash codes, a pair-weighted based loss function is proposed to enhance the learning of hash functions of hard sample pairs. Furthermore, in the testing phase, a self-adaptive module is proposed to discover and correct outliers by generating self-adaptive boundaries, thereby improving the retrieval performance. Experimental results on two widely-used fine-grained datasets demonstrate the effectiveness of the proposed method.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124046981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Goldeye: Enhanced Spatial Awareness for the Visually Impaired using Mixed Reality and Vibrotactile Feedback Goldeye:使用混合现实和触觉振动反馈增强视障人士的空间意识
Pub Date : 2021-12-01 DOI: 10.1145/3469877.3495636
Jun Lee, Narayanan Rajeev, A. Bhojan
One in six people have some form of visual impairment ranging from mild vision loss to total blindness. The visually impaired constantly face the danger of walking into people or hazardous objects. This thesis proposes the use of vibrotactile feedback to serve as an obstacle detection system for visually impaired users. We utilize a mixed reality headset with on-board depth sensors to build a digital map of the real world and a suit with an array of actuators to provide feedback as to indicate to the visually impaired the position of obstacles around them. This is demonstrated by a simple prototype built using commercially available devices (Microsoft HoloLens and bHaptics Tactot) and a qualitative user study was conducted to evaluate the viability of the proposed system. Through our user-testing performed on subjects with simulated visual impairments, our results affirm the potential of using mixed reality to detect obstacles in the environment along with only transmitting essential information through the haptic suit due to limited bandwidth.
六分之一的人有某种形式的视力障碍,从轻度视力丧失到完全失明。视障人士经常面临撞到人或危险物体的危险。本文提出使用触觉振动反馈作为视障使用者的障碍物检测系统。我们使用带有车载深度传感器的混合现实耳机来构建真实世界的数字地图,并使用带有一系列执行器的套装来提供反馈,以指示视障人士周围障碍物的位置。使用商用设备(Microsoft HoloLens和bHaptics Tactot)构建的简单原型证明了这一点,并进行了定性用户研究,以评估拟议系统的可行性。通过我们对模拟视觉障碍的受试者进行的用户测试,我们的结果证实了使用混合现实来检测环境中的障碍物的潜力,并且由于带宽有限,只能通过触觉服传输基本信息。
{"title":"Goldeye: Enhanced Spatial Awareness for the Visually Impaired using Mixed Reality and Vibrotactile Feedback","authors":"Jun Lee, Narayanan Rajeev, A. Bhojan","doi":"10.1145/3469877.3495636","DOIUrl":"https://doi.org/10.1145/3469877.3495636","url":null,"abstract":"One in six people have some form of visual impairment ranging from mild vision loss to total blindness. The visually impaired constantly face the danger of walking into people or hazardous objects. This thesis proposes the use of vibrotactile feedback to serve as an obstacle detection system for visually impaired users. We utilize a mixed reality headset with on-board depth sensors to build a digital map of the real world and a suit with an array of actuators to provide feedback as to indicate to the visually impaired the position of obstacles around them. This is demonstrated by a simple prototype built using commercially available devices (Microsoft HoloLens and bHaptics Tactot) and a qualitative user study was conducted to evaluate the viability of the proposed system. Through our user-testing performed on subjects with simulated visual impairments, our results affirm the potential of using mixed reality to detect obstacles in the environment along with only transmitting essential information through the haptic suit due to limited bandwidth.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122154472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Visual Storytelling with Hierarchical BERT Semantic Guidance 基于层次BERT语义引导的视觉叙事
Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490604
Ruichao Fan, Hanli Wang, Jinjing Gu, Xianhui Liu
Visual storytelling, which aims at automatically producing a narrative paragraph for photo album, remains quite challenging due to the complexity and diversity of photo album content. In addition, open-domain photo albums cover a broad range of topics and this results in highly variable vocabularies and expression styles to describe photo albums. In this work, a novel teacher-student visual storytelling framework with hierarchical BERT semantic guidance (HBSG) is proposed to address the above-mentioned challenges. The proposed teacher module consists of two joint tasks, namely, word-level latent topic generation and semantic-guided sentence generation. The first task aims to predict the latent topic of the story. As there is no ground-truth topic information, a pre-trained BERT model based on visual contents and annotated stories is utilized to mine topics. Then the topic vector is distilled to a designed image-topic prediction model. In the semantic-guided sentence generation task, HBSG is introduced for two purposes. The first is to narrow down the language complexity across topics, where the co-attention decoder with vision and semantic is designed to leverage the latent topics to induce topic-related language models. The second is to employ sentence semantic as an online external linguistic knowledge teacher module. Finally, an auxiliary loss is devised to transform linguistic knowledge into the language generation model. Extensive experiments are performed to demonstrate the effectiveness of HBSG framework, which surpasses the state-of-the-art approaches evaluated on the VIST test set.
由于相册内容的复杂性和多样性,以自动生成相册叙事段落为目的的视觉叙事仍然具有很大的挑战性。此外,开放域相册涵盖了广泛的主题,这导致了描述相册的高度可变的词汇和表达风格。本文提出了一种基于分层BERT语义引导的师生视觉叙事框架。提出的教师模块包括两个联合任务,即词级潜在话题生成和语义引导句子生成。第一个任务旨在预测故事的潜在主题。由于没有真实的主题信息,使用基于视觉内容和带注释的故事的预训练BERT模型来挖掘主题。然后将主题向量提炼成设计好的图像-主题预测模型。在语义引导的句子生成任务中,引入HBSG有两个目的。首先是缩小跨主题的语言复杂性,设计具有视觉和语义的共同关注解码器,利用潜在主题来推导与主题相关的语言模型。二是将句子语义作为在线外部语言知识教师模块。最后,设计了一个辅助损失,将语言知识转化为语言生成模型。进行了大量的实验来证明HBSG框架的有效性,它超过了在VIST测试集上评估的最先进的方法。
{"title":"Visual Storytelling with Hierarchical BERT Semantic Guidance","authors":"Ruichao Fan, Hanli Wang, Jinjing Gu, Xianhui Liu","doi":"10.1145/3469877.3490604","DOIUrl":"https://doi.org/10.1145/3469877.3490604","url":null,"abstract":"Visual storytelling, which aims at automatically producing a narrative paragraph for photo album, remains quite challenging due to the complexity and diversity of photo album content. In addition, open-domain photo albums cover a broad range of topics and this results in highly variable vocabularies and expression styles to describe photo albums. In this work, a novel teacher-student visual storytelling framework with hierarchical BERT semantic guidance (HBSG) is proposed to address the above-mentioned challenges. The proposed teacher module consists of two joint tasks, namely, word-level latent topic generation and semantic-guided sentence generation. The first task aims to predict the latent topic of the story. As there is no ground-truth topic information, a pre-trained BERT model based on visual contents and annotated stories is utilized to mine topics. Then the topic vector is distilled to a designed image-topic prediction model. In the semantic-guided sentence generation task, HBSG is introduced for two purposes. The first is to narrow down the language complexity across topics, where the co-attention decoder with vision and semantic is designed to leverage the latent topics to induce topic-related language models. The second is to employ sentence semantic as an online external linguistic knowledge teacher module. Finally, an auxiliary loss is devised to transform linguistic knowledge into the language generation model. Extensive experiments are performed to demonstrate the effectiveness of HBSG framework, which surpasses the state-of-the-art approaches evaluated on the VIST test set.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"107 Pt 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129111001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Local Self-Attention on Fine-grained Cross-media Retrieval 细粒度跨媒体检索中的局部自关注
Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490590
Chen Wang, Yazhou Yao, Qiong Wang, Zhenmin Tang
Due to the heterogeneity gap, the data representation of different media is inconsistent and belongs to different feature spaces. Therefore, it is challenging to measure the fine-grained gap between them. To this end, we propose an attention space training method to learn common representations of different media data. Specifically, we utilize local self-attention layers to learn the common attention space between different media data. We propose a similarity concatenation method to understand the content relationship between features. To further improve the robustness of the model, we also train a local position encoding to capture the spatial relationships between features. In this way, our proposed method can effectively reduce the gap between different feature distributions on cross-media retrieval tasks. It also improves the fine-grained recognition performance by attaching attention to high-level semantic information. Extensive experiments and ablation studies demonstrate that our proposed method achieves state-of-the-art performance. At the same time, our approach provides a new pipeline for fine-grained cross-media retrieval. The source code and models are publicly available at: https://github.com/NUST-Machine-Intelligence-Laboratory/SAFGCMHN.
由于异质性差距,不同媒体的数据表示是不一致的,属于不同的特征空间。因此,测量它们之间的细粒度差距是具有挑战性的。为此,我们提出了一种注意空间训练方法来学习不同媒体数据的共同表征。具体来说,我们利用局部自注意层来学习不同媒体数据之间的共同注意空间。我们提出了一种相似性拼接方法来理解特征之间的内容关系。为了进一步提高模型的鲁棒性,我们还训练了局部位置编码来捕获特征之间的空间关系。这样,我们提出的方法可以有效地减少跨媒体检索任务中不同特征分布之间的差距。它还通过关注高级语义信息来提高细粒度识别性能。大量的实验和烧蚀研究表明,我们提出的方法达到了最先进的性能。同时,我们的方法为细粒度的跨媒体检索提供了一个新的管道。源代码和模型可以在:https://github.com/NUST-Machine-Intelligence-Laboratory/SAFGCMHN上公开获得。
{"title":"Local Self-Attention on Fine-grained Cross-media Retrieval","authors":"Chen Wang, Yazhou Yao, Qiong Wang, Zhenmin Tang","doi":"10.1145/3469877.3490590","DOIUrl":"https://doi.org/10.1145/3469877.3490590","url":null,"abstract":"Due to the heterogeneity gap, the data representation of different media is inconsistent and belongs to different feature spaces. Therefore, it is challenging to measure the fine-grained gap between them. To this end, we propose an attention space training method to learn common representations of different media data. Specifically, we utilize local self-attention layers to learn the common attention space between different media data. We propose a similarity concatenation method to understand the content relationship between features. To further improve the robustness of the model, we also train a local position encoding to capture the spatial relationships between features. In this way, our proposed method can effectively reduce the gap between different feature distributions on cross-media retrieval tasks. It also improves the fine-grained recognition performance by attaching attention to high-level semantic information. Extensive experiments and ablation studies demonstrate that our proposed method achieves state-of-the-art performance. At the same time, our approach provides a new pipeline for fine-grained cross-media retrieval. The source code and models are publicly available at: https://github.com/NUST-Machine-Intelligence-Laboratory/SAFGCMHN.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123356357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Utilizing Resource-Rich Language Datasets for End-to-End Scene Text Recognition in Resource-Poor Languages 利用资源丰富的语言数据集进行资源贫乏语言的端到端场景文本识别
Pub Date : 2021-11-24 DOI: 10.1145/3469877.3490571
Shota Orihashi, Yoshihiro Yamazaki, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Ryo Masumura
This paper presents a novel training method for end-to-end scene text recognition. End-to-end scene text recognition offers high recognition accuracy, especially when using the encoder-decoder model based on Transformer. To train a highly accurate end-to-end model, we need to prepare a large image-to-text paired dataset for the target language. However, it is difficult to collect this data, especially for resource-poor languages. To overcome this difficulty, our proposed method utilizes well-prepared large datasets in resource-rich languages such as English, to train the resource-poor encoder-decoder model. Our key idea is to build a model in which the encoder reflects knowledge of multiple languages while the decoder specializes in knowledge of just the resource-poor language. To this end, the proposed method pre-trains the encoder by using a multilingual dataset that combines the resource-poor language’s dataset and the resource-rich language’s dataset to learn language-invariant knowledge for scene text recognition. The proposed method also pre-trains the decoder by using the resource-poor language’s dataset to make the decoder better suited to the resource-poor language. Experiments on Japanese scene text recognition using a small, publicly available dataset demonstrate the effectiveness of the proposed method.
提出了一种新的端到端场景文本识别训练方法。端到端场景文本识别提供了较高的识别精度,特别是当使用基于Transformer的编码器-解码器模型时。为了训练一个高度精确的端到端模型,我们需要为目标语言准备一个大型的图像到文本配对数据集。然而,很难收集这些数据,特别是对于资源贫乏的语言。为了克服这一困难,我们提出的方法利用资源丰富的语言(如英语)中准备充分的大型数据集来训练资源贫乏的编码器-解码器模型。我们的关键思想是建立一个模型,在这个模型中,编码器反映多种语言的知识,而解码器只专注于资源贫乏的语言的知识。为此,该方法利用资源贫乏语言数据集和资源丰富语言数据集相结合的多语言数据集对编码器进行预训练,学习用于场景文本识别的语言不变知识。该方法还利用资源贫乏语言的数据集对解码器进行预训练,使解码器更适合资源贫乏语言。使用小型公开数据集进行的日语场景文本识别实验证明了该方法的有效性。
{"title":"Utilizing Resource-Rich Language Datasets for End-to-End Scene Text Recognition in Resource-Poor Languages","authors":"Shota Orihashi, Yoshihiro Yamazaki, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Ryo Masumura","doi":"10.1145/3469877.3490571","DOIUrl":"https://doi.org/10.1145/3469877.3490571","url":null,"abstract":"This paper presents a novel training method for end-to-end scene text recognition. End-to-end scene text recognition offers high recognition accuracy, especially when using the encoder-decoder model based on Transformer. To train a highly accurate end-to-end model, we need to prepare a large image-to-text paired dataset for the target language. However, it is difficult to collect this data, especially for resource-poor languages. To overcome this difficulty, our proposed method utilizes well-prepared large datasets in resource-rich languages such as English, to train the resource-poor encoder-decoder model. Our key idea is to build a model in which the encoder reflects knowledge of multiple languages while the decoder specializes in knowledge of just the resource-poor language. To this end, the proposed method pre-trains the encoder by using a multilingual dataset that combines the resource-poor language’s dataset and the resource-rich language’s dataset to learn language-invariant knowledge for scene text recognition. The proposed method also pre-trains the decoder by using the resource-poor language’s dataset to make the decoder better suited to the resource-poor language. Experiments on Japanese scene text recognition using a small, publicly available dataset demonstrate the effectiveness of the proposed method.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"329 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121992174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Holodeck: Immersive 3D Displays Using Swarms of Flying Light Specks [Extended Abstract] 全息甲板:使用飞行光斑群的沉浸式3D显示[扩展摘要]
Pub Date : 2021-11-02 DOI: 10.1145/3469877.3493698
Shahram Ghandeharizadeh
Unmanned Aerial Vehicles (UAVs) have moved beyond a platform for hobbyists to enable environmental monitoring, journalism, film industry, search and rescue, package delivery, and entertainment. This paper describes 3D displays using swarms of flying light specks, FLSs. An FLS is a small (hundreds of micrometers in size) UAV with one or more light sources to generate different colors and textures with adjustable brightness. A synchronized swarm of FLSs renders an illumination in a pre-specified 3D volume, an FLS display. An FLS display provides true depth, enabling a user to perceive a scene more completely by analyzing its illumination from different angles. An FLS display may either be non-immersive or immersive. Both will support 3D acoustics. Non-immersive FLS displays may be the size of a 1980’s computer monitor, enabling a surgical team to observe and control micro robots performing heart surgery inside a patient’s body. Immersive FLS displays may be the size of a room, enabling users to interact with objects, e.g., a rock, a teapot. An object with behavior will be constructed using FLS-matters. FLS-matter will enable a user to touch and manipulate an object, e.g., a user may pick up a teapot or throw a rock. An immersive and interactive FLS display will approximate Star Trek’s holodeck. A successful realization of the research ideas presented in this paper will provide fundamental insights into implementing a holodeck using swarms of FLSs. A holodeck will transform the future of human communication and perception, and how we interact with information and data. It will revolutionize the future of how we work, learn, play and entertain, receive medical care, and socialize.
无人驾驶飞行器(uav)已经超越了爱好者的平台,可以用于环境监测、新闻、电影工业、搜索和救援、包裹递送和娱乐。本文描述了使用飞行光斑群(FLSs)的3D显示。FLS是一种小型(数百微米大小)无人机,具有一个或多个光源,可以产生不同的颜色和纹理,亮度可调。同步的FLS群在预先指定的3D体积中呈现照明,即FLS显示。FLS显示器提供真正的深度,使用户能够通过分析不同角度的照明来更完整地感知场景。FLS显示可以是非沉浸式的,也可以是沉浸式的。两者都将支持3D音响。非沉浸式FLS显示器可能是20世纪80年代计算机显示器的大小,使外科团队能够观察和控制在患者体内进行心脏手术的微型机器人。身临其境的FLS显示器可能有一个房间那么大,使用户能够与物体进行交互,例如石头、茶壶。具有行为的对象将使用FLS-matters构造。FLS-matter将使用户能够触摸和操纵物体,例如,用户可以拿起茶壶或扔石头。一个身临其境的交互式FLS显示器将近似于《星际迷航》的全息甲板。本文中提出的研究思路的成功实现将为使用FLSs群实现全息甲板提供基本见解。全息甲板将改变人类沟通和感知的未来,以及我们与信息和数据的互动方式。它将彻底改变我们未来的工作、学习、娱乐、医疗和社交方式。
{"title":"Holodeck: Immersive 3D Displays Using Swarms of Flying Light Specks [Extended Abstract]","authors":"Shahram Ghandeharizadeh","doi":"10.1145/3469877.3493698","DOIUrl":"https://doi.org/10.1145/3469877.3493698","url":null,"abstract":"Unmanned Aerial Vehicles (UAVs) have moved beyond a platform for hobbyists to enable environmental monitoring, journalism, film industry, search and rescue, package delivery, and entertainment. This paper describes 3D displays using swarms of flying light specks, FLSs. An FLS is a small (hundreds of micrometers in size) UAV with one or more light sources to generate different colors and textures with adjustable brightness. A synchronized swarm of FLSs renders an illumination in a pre-specified 3D volume, an FLS display. An FLS display provides true depth, enabling a user to perceive a scene more completely by analyzing its illumination from different angles. An FLS display may either be non-immersive or immersive. Both will support 3D acoustics. Non-immersive FLS displays may be the size of a 1980’s computer monitor, enabling a surgical team to observe and control micro robots performing heart surgery inside a patient’s body. Immersive FLS displays may be the size of a room, enabling users to interact with objects, e.g., a rock, a teapot. An object with behavior will be constructed using FLS-matters. FLS-matter will enable a user to touch and manipulate an object, e.g., a user may pick up a teapot or throw a rock. An immersive and interactive FLS display will approximate Star Trek’s holodeck. A successful realization of the research ideas presented in this paper will provide fundamental insights into implementing a holodeck using swarms of FLSs. A holodeck will transform the future of human communication and perception, and how we interact with information and data. It will revolutionize the future of how we work, learn, play and entertain, receive medical care, and socialize.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124401738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Hierarchical Deep Residual Reasoning for Temporal Moment Localization 时间矩定位的层次深度残差推理
Pub Date : 2021-10-31 DOI: 10.1145/3469877.3490595
Ziyang Ma, Xianjing Han, Xuemeng Song, Yiran Cui, Liqiang Nie
Temporal Moment Localization (TML) in untrimmed videos is a challenging task in the field of multimedia, which aims at localizing the start and end points of the activity in the video, described by a sentence query. Existing methods mainly focus on mining the correlation between video and sentence representations or investigating the fusion manner of the two modalities. These works mainly understand the video and sentence coarsely, ignoring the fact that a sentence can be understood from various semantics, and the dominant words affecting the moment localization in the semantics are the action and object reference. Toward this end, we propose a Hierarchical Deep Residual Reasoning (HDRR) model, which decomposes the video and sentence into multi-level representations with different semantics to achieve a finer-grained localization. Furthermore, considering that videos with different resolution and sentences with different length have different difficulty in understanding, we design the simple yet effective Res-BiGRUs for feature fusion, which is able to grasp the useful information in a self-adapting manner. Extensive experiments conducted on Charades-STA and ActivityNet-Captions datasets demonstrate the superiority of our HDRR model compared with other state-of-the-art methods.
在多媒体领域中,未修剪视频的时间瞬间定位(TML)是一项具有挑战性的任务,它旨在定位视频中活动的起始点和结束点,并用句子查询来描述。现有的方法主要集中在挖掘视频和句子表示之间的相关性或研究两种模式的融合方式。这些作品主要是粗略地理解视频和句子,忽略了一个句子可以从各种语义上理解,在语义上影响时刻定位的主导词是动作和对象指称。为此,我们提出了一种层次深度残差推理(HDRR)模型,该模型将视频和句子分解为具有不同语义的多级表示,以实现更细粒度的定位。此外,考虑到不同分辨率的视频和不同长度的句子具有不同的理解难度,我们设计了简单有效的Res-BiGRUs进行特征融合,能够以自适应的方式抓取有用信息。在Charades-STA和ActivityNet-Captions数据集上进行的大量实验表明,与其他最先进的方法相比,我们的HDRR模型具有优势。
{"title":"Hierarchical Deep Residual Reasoning for Temporal Moment Localization","authors":"Ziyang Ma, Xianjing Han, Xuemeng Song, Yiran Cui, Liqiang Nie","doi":"10.1145/3469877.3490595","DOIUrl":"https://doi.org/10.1145/3469877.3490595","url":null,"abstract":"Temporal Moment Localization (TML) in untrimmed videos is a challenging task in the field of multimedia, which aims at localizing the start and end points of the activity in the video, described by a sentence query. Existing methods mainly focus on mining the correlation between video and sentence representations or investigating the fusion manner of the two modalities. These works mainly understand the video and sentence coarsely, ignoring the fact that a sentence can be understood from various semantics, and the dominant words affecting the moment localization in the semantics are the action and object reference. Toward this end, we propose a Hierarchical Deep Residual Reasoning (HDRR) model, which decomposes the video and sentence into multi-level representations with different semantics to achieve a finer-grained localization. Furthermore, considering that videos with different resolution and sentences with different length have different difficulty in understanding, we design the simple yet effective Res-BiGRUs for feature fusion, which is able to grasp the useful information in a self-adapting manner. Extensive experiments conducted on Charades-STA and ActivityNet-Captions datasets demonstrate the superiority of our HDRR model compared with other state-of-the-art methods.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123846180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Improving Camouflaged Object Detection with the Uncertainty of Pseudo-edge Labels 利用伪边缘标签的不确定性改进伪装目标检测
Pub Date : 2021-10-29 DOI: 10.1145/3469877.3490587
Nobukatsu Kajiura, Hong Liu, S. Satoh
This paper focuses on camouflaged object detection (COD), which is a task to detect objects hidden in the background. Most of the current COD models aim to highlight the target object directly while outputting ambiguous camouflaged boundaries. On the other hand, the performance of the models considering edge information is not yet satisfactory. To this end, we propose a new framework that makes full use of multiple visual cues, i.e., saliency as well as edges, to refine the predicted camouflaged map. This framework consists of three key components, i.e., a pseudo-edge generator, a pseudo-map generator, and an uncertainty-aware refinement module. In particular, the pseudo-edge generator estimates the boundary that outputs the pseudo-edge label, and the conventional COD method serves as the pseudo-map generator that outputs the pseudo-map label. Then, we propose an uncertainty-based module to reduce the uncertainty and noise of such two pseudo labels, which takes both pseudo labels as input and outputs an edge-accurate camouflaged map. Experiments on various COD datasets demonstrate the effectiveness of our method with superior performance to the existing state-of-the-art methods.
本文主要研究伪装目标检测(COD),即检测隐藏在背景中的目标。目前大多数COD模型的目标是直接突出显示目标物体,同时输出模糊的伪装边界。另一方面,考虑边缘信息的模型的性能还不能令人满意。为此,我们提出了一个新的框架,充分利用多种视觉线索,即显著性和边缘,来完善预测的伪装地图。该框架由三个关键组件组成,即伪边缘生成器、伪映射生成器和不确定性感知细化模块。其中,伪边缘生成器估计输出伪边缘标签的边界,传统的COD方法作为输出伪地图标签的伪地图生成器。然后,我们提出了一个基于不确定性的模块来降低这两种伪标签的不确定性和噪声,该模块将这两种伪标签作为输入,输出一个边缘精确的伪装地图。在各种COD数据集上的实验证明了该方法的有效性,其性能优于现有的最先进的方法。
{"title":"Improving Camouflaged Object Detection with the Uncertainty of Pseudo-edge Labels","authors":"Nobukatsu Kajiura, Hong Liu, S. Satoh","doi":"10.1145/3469877.3490587","DOIUrl":"https://doi.org/10.1145/3469877.3490587","url":null,"abstract":"This paper focuses on camouflaged object detection (COD), which is a task to detect objects hidden in the background. Most of the current COD models aim to highlight the target object directly while outputting ambiguous camouflaged boundaries. On the other hand, the performance of the models considering edge information is not yet satisfactory. To this end, we propose a new framework that makes full use of multiple visual cues, i.e., saliency as well as edges, to refine the predicted camouflaged map. This framework consists of three key components, i.e., a pseudo-edge generator, a pseudo-map generator, and an uncertainty-aware refinement module. In particular, the pseudo-edge generator estimates the boundary that outputs the pseudo-edge label, and the conventional COD method serves as the pseudo-map generator that outputs the pseudo-map label. Then, we propose an uncertainty-based module to reduce the uncertainty and noise of such two pseudo labels, which takes both pseudo labels as input and outputs an edge-accurate camouflaged map. Experiments on various COD datasets demonstrate the effectiveness of our method with superior performance to the existing state-of-the-art methods.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125957253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Patch-Based Deep Autoencoder for Point Cloud Geometry Compression 基于补丁的深度自编码器点云几何压缩
Pub Date : 2021-10-18 DOI: 10.1145/3469877.3490611
Kang-Soo You, Pan Gao
The ever-increasing 3D application makes the point cloud compression unprecedentedly important and needed. In this paper, we propose a patch-based compression process using deep learning, focusing on the lossy point cloud geometry compression. Unlike existing point cloud compression networks, which apply feature extraction and reconstruction on the entire point cloud, we divide the point cloud into patches and compress each patch independently. In the decoding process, we finally assemble the decompressed patches into a complete point cloud. In addition, we train our network by a patch-to-patch criterion, i.e., use the local reconstruction loss for optimization, to approximate the global reconstruction optimality. Our method outperforms the state-of-the-art in terms of rate-distortion performance, especially at low bitrates. Moreover, the compression process we proposed can guarantee to generate the same number of points as the input. The network model of this method can be easily applied to other point cloud reconstruction problems, such as upsampling.
不断增长的3D应用使得点云压缩变得前所未有的重要和需要。在本文中,我们提出了一种基于补丁的深度学习压缩过程,重点关注有损点云几何压缩。与现有的点云压缩网络对整个点云进行特征提取和重构不同,我们将点云划分成小块,并对每个小块进行独立压缩。在解码过程中,我们最终将解压缩后的补丁组装成一个完整的点云。此外,我们通过patch-to-patch准则训练我们的网络,即使用局部重建损失进行优化,以近似全局重建最优性。我们的方法在速率失真性能方面优于最先进的技术,特别是在低比特率下。此外,我们提出的压缩过程可以保证生成与输入相同数量的点。该方法的网络模型可以很容易地应用于其他点云重建问题,如上采样。
{"title":"Patch-Based Deep Autoencoder for Point Cloud Geometry Compression","authors":"Kang-Soo You, Pan Gao","doi":"10.1145/3469877.3490611","DOIUrl":"https://doi.org/10.1145/3469877.3490611","url":null,"abstract":"The ever-increasing 3D application makes the point cloud compression unprecedentedly important and needed. In this paper, we propose a patch-based compression process using deep learning, focusing on the lossy point cloud geometry compression. Unlike existing point cloud compression networks, which apply feature extraction and reconstruction on the entire point cloud, we divide the point cloud into patches and compress each patch independently. In the decoding process, we finally assemble the decompressed patches into a complete point cloud. In addition, we train our network by a patch-to-patch criterion, i.e., use the local reconstruction loss for optimization, to approximate the global reconstruction optimality. Our method outperforms the state-of-the-art in terms of rate-distortion performance, especially at low bitrates. Moreover, the compression process we proposed can guarantee to generate the same number of points as the input. The network model of this method can be easily applied to other point cloud reconstruction problems, such as upsampling.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123984882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Explore before Moving: A Feasible Path Estimation and Memory Recalling Framework for Embodied Navigation 移动前探索:具身导航的可行路径估计与记忆回忆框架
Pub Date : 2021-10-16 DOI: 10.1145/3469877.3490570
Yang Wu, Shirui Feng, Guanbin Li, Liang Lin
In this paper, we focus on solving the navigation problem of embodied question answering (EmbodiedQA), where the lack of experience and common sense information essentially result in a failure finding target when the robot is spawn in unknown environments. We present a route planning method named Path Estimation and Memory Recalling (PEMR) framework. PEMR includes a “looking ahead” process, i.e. a visual feature extractor module that estimates feasible paths for gathering 3D navigational information; another process “looking behind” process that is a memory recalling mechanism aims at fully leveraging past experience collected by the feature extractor. To encourage the navigator to learn more accurate prior expert experience, we improve the original benchmark dataset and provide a family of evaluation metrics for diagnosing both navigation and question answering modules. We show strong experimental results of PEMR on the EmbodiedQA navigation task.
在本文中,我们专注于解决嵌入问答(EmbodiedQA)的导航问题,其中缺乏经验和常识信息本质上导致机器人在未知环境中产卵时无法找到目标。提出了一种路径估计和记忆召回(PEMR)框架的路由规划方法。PEMR包括一个“向前看”过程,即一个视觉特征提取模块,用于估计收集3D导航信息的可行路径;另一个“向后看”的过程是一种记忆回忆机制,旨在充分利用特征提取器收集的过去经验。为了鼓励导航器学习更准确的先前专家经验,我们改进了原始基准数据集,并提供了一系列用于诊断导航和问答模块的评估指标。我们在EmbodiedQA导航任务上展示了强有力的实验结果。
{"title":"Explore before Moving: A Feasible Path Estimation and Memory Recalling Framework for Embodied Navigation","authors":"Yang Wu, Shirui Feng, Guanbin Li, Liang Lin","doi":"10.1145/3469877.3490570","DOIUrl":"https://doi.org/10.1145/3469877.3490570","url":null,"abstract":"In this paper, we focus on solving the navigation problem of embodied question answering (EmbodiedQA), where the lack of experience and common sense information essentially result in a failure finding target when the robot is spawn in unknown environments. We present a route planning method named Path Estimation and Memory Recalling (PEMR) framework. PEMR includes a “looking ahead” process, i.e. a visual feature extractor module that estimates feasible paths for gathering 3D navigational information; another process “looking behind” process that is a memory recalling mechanism aims at fully leveraging past experience collected by the feature extractor. To encourage the navigator to learn more accurate prior expert experience, we improve the original benchmark dataset and provide a family of evaluation metrics for diagnosing both navigation and question answering modules. We show strong experimental results of PEMR on the EmbodiedQA navigation task.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124663660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
ACM Multimedia Asia
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1