首页 > 最新文献

Proceedings of the 2nd ACM International Conference on Multimedia in Asia最新文献

英文 中文
Interactive re-ranking for cross-modal retrieval based on object-wise question answering 基于对象智能问答的跨模态检索交互式重排序
Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446290
Rintaro Yanagi, Ren Togo, Takahiro Ogawa, M. Haseyama
Cross-modal retrieval methods retrieve desired images from a query text by learning relationships between texts and images. This retrieval approach is one of the most effective ways in the easiness of query preparation. Recent cross-modal retrieval is convenient and accurate when users input a query text that can uniquely identify the desired image. Meanwhile, users frequently input ambiguous query texts, and these ambiguous queries make it difficult to obtain the desired images. To alleviate these difficulties, in this paper, we propose a novel interactive cross-modal retrieval method based on question answering (QA) with users. The proposed method analyses candidate images and asks users about information that can narrow retrieval candidates effectively. By only answering the questions generated by the proposed method, users can reach their desired images even from an ambiguous query text. Experimental results show the effectiveness of the proposed method.
跨模态检索方法通过学习文本和图像之间的关系,从查询文本中检索所需的图像。这种检索方法在查询准备的便捷性方面是最有效的方法之一。当用户输入能够唯一标识所需图像的查询文本时,近期跨模态检索既方便又准确。同时,用户经常输入模棱两可的查询文本,这些模棱两可的查询使得获取所需图像变得困难。为了解决这些问题,本文提出了一种基于用户问答的交互式跨模态检索方法。该方法对候选图像进行分析,并向用户询问可以有效缩小检索候选图像的信息。通过只回答由该方法生成的问题,用户即使从模糊的查询文本中也能得到他们想要的图像。实验结果表明了该方法的有效性。
{"title":"Interactive re-ranking for cross-modal retrieval based on object-wise question answering","authors":"Rintaro Yanagi, Ren Togo, Takahiro Ogawa, M. Haseyama","doi":"10.1145/3444685.3446290","DOIUrl":"https://doi.org/10.1145/3444685.3446290","url":null,"abstract":"Cross-modal retrieval methods retrieve desired images from a query text by learning relationships between texts and images. This retrieval approach is one of the most effective ways in the easiness of query preparation. Recent cross-modal retrieval is convenient and accurate when users input a query text that can uniquely identify the desired image. Meanwhile, users frequently input ambiguous query texts, and these ambiguous queries make it difficult to obtain the desired images. To alleviate these difficulties, in this paper, we propose a novel interactive cross-modal retrieval method based on question answering (QA) with users. The proposed method analyses candidate images and asks users about information that can narrow retrieval candidates effectively. By only answering the questions generated by the proposed method, users can reach their desired images even from an ambiguous query text. Experimental results show the effectiveness of the proposed method.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115056849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Unsupervised learning of co-occurrences for face images retrieval 人脸图像检索中共现现象的无监督学习
Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446265
Thomas Petit, Pierre Letessier, S. Duffner, Christophe Garcia
Despite a huge leap in performance of face recognition systems in recent years, some cases remain challenging for them while being trivial for humans. This is because a human brain is exploiting much more information than the face appearance to identify a person. In this work, we aim at capturing the social context of unlabeled observed faces in order to improve face retrieval. In particular, we propose a framework that substantially improves face retrieval by exploiting the faces occurring simultaneously in a query's context to infer a multi-dimensional social context descriptor. Combining this compact structural descriptor with the individual visual face features in a common feature vector considerably increases the correct face retrieval rate and allows to disambiguate a large proportion of query results of different persons that are barely distinguishable visually. To evaluate our framework, we also introduce a new large dataset of faces of French TV personalities organised in TV shows in order to capture the co-occurrence relations between people. On this dataset, our framework is able to improve the mean Average Precision over a set of internal queries from 67.93% (using only facial features extracted with a state-of-the-art pre-trained model) to 78.16% (using both facial features and faces co-occurrences), and from 67.88% to 77.36% over a set of external queries.
尽管近年来人脸识别系统的性能有了巨大的飞跃,但有些情况对它们来说仍然具有挑战性,而对人类来说则微不足道。这是因为人脑在识别一个人时,利用的信息比面部表情多得多。在这项工作中,我们的目标是捕捉未标记的观察面孔的社会背景,以提高人脸检索。特别是,我们提出了一个框架,该框架通过利用在查询上下文中同时出现的面孔来推断多维社会上下文描述符,从而大大改进了人脸检索。将这种紧凑的结构描述符与公共特征向量中的单个视觉人脸特征相结合,大大提高了正确的人脸检索率,并允许消除大部分视觉上难以区分的不同人的查询结果的歧义。为了评估我们的框架,我们还引入了一个新的大型数据集,其中包括在电视节目中组织的法国电视名人的面孔,以捕捉人与人之间的共现关系。在这个数据集上,我们的框架能够将一组内部查询的平均平均精度从67.93%(仅使用最先进的预训练模型提取的面部特征)提高到78.16%(同时使用面部特征和面部共同出现),以及从67.88%提高到77.36%在一组外部查询。
{"title":"Unsupervised learning of co-occurrences for face images retrieval","authors":"Thomas Petit, Pierre Letessier, S. Duffner, Christophe Garcia","doi":"10.1145/3444685.3446265","DOIUrl":"https://doi.org/10.1145/3444685.3446265","url":null,"abstract":"Despite a huge leap in performance of face recognition systems in recent years, some cases remain challenging for them while being trivial for humans. This is because a human brain is exploiting much more information than the face appearance to identify a person. In this work, we aim at capturing the social context of unlabeled observed faces in order to improve face retrieval. In particular, we propose a framework that substantially improves face retrieval by exploiting the faces occurring simultaneously in a query's context to infer a multi-dimensional social context descriptor. Combining this compact structural descriptor with the individual visual face features in a common feature vector considerably increases the correct face retrieval rate and allows to disambiguate a large proportion of query results of different persons that are barely distinguishable visually. To evaluate our framework, we also introduce a new large dataset of faces of French TV personalities organised in TV shows in order to capture the co-occurrence relations between people. On this dataset, our framework is able to improve the mean Average Precision over a set of internal queries from 67.93% (using only facial features extracted with a state-of-the-art pre-trained model) to 78.16% (using both facial features and faces co-occurrences), and from 67.88% to 77.36% over a set of external queries.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121252923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Patch assembly for real-time instance segmentation 用于实时实例分割的补丁组装
Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446281
Yutao Xu, Hanli Wang, Jian Zhu
The paradigm of sliding window is proven effective for the task of visual instance segmentation in many popular research works. However, it still suffers from the bottleneck of inference time. To accelerate existing instance segmentation approaches which are dense sliding window based, this work introduces a novel approach, called patch assembly, which can be integrated into bounding box detectors for segmentation without extra up-sampling computations. A well-designed detector named PAMask is proposed to verify the effectiveness of the proposed approach. Benefitting from the simple structure as well as a fusion of multiple representations, PAMask has the ability to run in real time while achieving competitive performances. Besides, another effective technique called Center-NMS is designed to reduce the number of boxes for intersection of union calculation, which can be fully parallelized on device and contributes 0.6% mAP improvement both in detection and segmentation for free.
在许多研究中,滑动窗口被证明是有效的视觉实例分割方法。但是,它仍然存在推理时间的瓶颈。为了加速现有的基于密集滑动窗口的实例分割方法,本工作引入了一种称为补丁组装的新方法,该方法可以集成到边界盒检测器中进行分割,而无需额外的上采样计算。设计了一个名为PAMask的检测器来验证该方法的有效性。得益于简单的结构以及多种表示的融合,PAMask能够在实现竞争性性能的同时实时运行。此外,设计了另一种有效的技术Center-NMS,减少了交联计算的盒数,可以在设备上完全并行化,在检测和分割方面都可以免费提高0.6%的mAP。
{"title":"Patch assembly for real-time instance segmentation","authors":"Yutao Xu, Hanli Wang, Jian Zhu","doi":"10.1145/3444685.3446281","DOIUrl":"https://doi.org/10.1145/3444685.3446281","url":null,"abstract":"The paradigm of sliding window is proven effective for the task of visual instance segmentation in many popular research works. However, it still suffers from the bottleneck of inference time. To accelerate existing instance segmentation approaches which are dense sliding window based, this work introduces a novel approach, called patch assembly, which can be integrated into bounding box detectors for segmentation without extra up-sampling computations. A well-designed detector named PAMask is proposed to verify the effectiveness of the proposed approach. Benefitting from the simple structure as well as a fusion of multiple representations, PAMask has the ability to run in real time while achieving competitive performances. Besides, another effective technique called Center-NMS is designed to reduce the number of boxes for intersection of union calculation, which can be fully parallelized on device and contributes 0.6% mAP improvement both in detection and segmentation for free.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116712409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Graph-based variational auto-encoder for generalized zero-shot learning 基于图的广义零采样学习变分自编码器
Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446283
Jiwei Wei, Yang Yang, Xing Xu, Yanli Ji, Xiaofeng Zhu, Heng Tao Shen
Zero-shot learning has been a highlighted research topic in both vision and language areas. Recently, generative methods have emerged as a new trend of zero-shot learning, which synthesizes unseen categories samples via generative models. However, the lack of fine-grained information in the synthesized samples makes it difficult to improve classification accuracy. It is also time-consuming and inefficient to synthesize samples and using them to train classifiers. To address such issues, we propose a novel Graph-based Variational Auto-Encoder for zero-shot learning. Specifically, we adopt knowledge graph to model the explicit inter-class relationships, and design a full graph convolution auto-encoder framework to generate the classifier from the distribution of the class-level semantic features on individual nodes. The encoder learns the latent representations of individual nodes, and the decoder generates the classifiers from latent representations of individual nodes. In contrast to synthesize samples, our proposed method directly generates classifiers from the distribution of the class-level semantic features for both seen and unseen categories, which is more straightforward, accurate and computationally efficient. We conduct extensive experiments and evaluate our method on the widely used large-scale ImageNet-21K dataset. Experimental results validate the efficacy of the proposed approach.
零学习一直是视觉和语言领域的研究热点。近年来,生成方法作为零次学习的新趋势出现,它通过生成模型来合成未见过的类别样本。然而,由于合成样本中缺乏细粒度信息,使得分类精度难以提高。合成样本并使用它们来训练分类器也是费时且低效的。为了解决这些问题,我们提出了一种新的基于图的变分自编码器用于零射击学习。具体来说,我们采用知识图对显式类间关系建模,并设计了一个全图卷积自编码器框架,从单个节点上类级语义特征的分布中生成分类器。编码器学习单个节点的潜在表示,解码器从单个节点的潜在表示生成分类器。与合成样本的方法相比,我们提出的方法直接从已见和未见类别的类级语义特征分布中生成分类器,更直观、准确和计算效率高。我们在广泛使用的大规模ImageNet-21K数据集上进行了大量的实验并评估了我们的方法。实验结果验证了该方法的有效性。
{"title":"Graph-based variational auto-encoder for generalized zero-shot learning","authors":"Jiwei Wei, Yang Yang, Xing Xu, Yanli Ji, Xiaofeng Zhu, Heng Tao Shen","doi":"10.1145/3444685.3446283","DOIUrl":"https://doi.org/10.1145/3444685.3446283","url":null,"abstract":"Zero-shot learning has been a highlighted research topic in both vision and language areas. Recently, generative methods have emerged as a new trend of zero-shot learning, which synthesizes unseen categories samples via generative models. However, the lack of fine-grained information in the synthesized samples makes it difficult to improve classification accuracy. It is also time-consuming and inefficient to synthesize samples and using them to train classifiers. To address such issues, we propose a novel Graph-based Variational Auto-Encoder for zero-shot learning. Specifically, we adopt knowledge graph to model the explicit inter-class relationships, and design a full graph convolution auto-encoder framework to generate the classifier from the distribution of the class-level semantic features on individual nodes. The encoder learns the latent representations of individual nodes, and the decoder generates the classifiers from latent representations of individual nodes. In contrast to synthesize samples, our proposed method directly generates classifiers from the distribution of the class-level semantic features for both seen and unseen categories, which is more straightforward, accurate and computationally efficient. We conduct extensive experiments and evaluate our method on the widely used large-scale ImageNet-21K dataset. Experimental results validate the efficacy of the proposed approach.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125815228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
An automated method with anchor-free detection and U-shaped segmentation for nuclei instance segmentation 基于无锚检测和u型分割的核实例自动分割方法
Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446258
X. Feng, Lijuan Duan, Jie Chen
Nuclei segmentation plays an important role in cancer diagnosis. Automated methods for digital pathology become popular due to the developments of deep learning and neural networks. However, this task still faces challenges. Most of current techniques cannot be applied directly because of the clustered state and the large number of nuclei in images. Moreover, anchor-based methods for object detection lead a huge amount of calculation, which is even worse on pathological images with a large target density. To address these issues, we propose a novel network with an anchor-free detection and a U-shaped segmentation. An altered feature enhancement module is attached to improve the performance in dense target detection. Meanwhile, the U-Shaped structure in segmentation block ensures the aggregation of features in different dimensions generated from the backbone network. We evaluate our work on a Multi-Organ Nuclei Segmentation dataset from MICCAI 2018 challenge. In comparisons with others, our proposed method achieves state-of-the-art performance.
细胞核分割在肿瘤诊断中起着重要的作用。由于深度学习和神经网络的发展,数字病理学的自动化方法变得流行。然而,这一任务仍然面临挑战。由于图像的聚类状态和大量的核,目前大多数技术都不能直接应用。此外,基于锚点的目标检测方法需要大量的计算量,对于目标密度较大的病理图像更是如此。为了解决这些问题,我们提出了一种具有无锚检测和u形分割的新型网络。附加一改型特征增强模块以改进密集目标检测的性能。同时,分割块的u型结构保证了骨干网生成的不同维数特征的聚合。我们评估了我们在MICCAI 2018挑战的多器官核分割数据集上的工作。与其他方法相比,我们提出的方法达到了最先进的性能。
{"title":"An automated method with anchor-free detection and U-shaped segmentation for nuclei instance segmentation","authors":"X. Feng, Lijuan Duan, Jie Chen","doi":"10.1145/3444685.3446258","DOIUrl":"https://doi.org/10.1145/3444685.3446258","url":null,"abstract":"Nuclei segmentation plays an important role in cancer diagnosis. Automated methods for digital pathology become popular due to the developments of deep learning and neural networks. However, this task still faces challenges. Most of current techniques cannot be applied directly because of the clustered state and the large number of nuclei in images. Moreover, anchor-based methods for object detection lead a huge amount of calculation, which is even worse on pathological images with a large target density. To address these issues, we propose a novel network with an anchor-free detection and a U-shaped segmentation. An altered feature enhancement module is attached to improve the performance in dense target detection. Meanwhile, the U-Shaped structure in segmentation block ensures the aggregation of features in different dimensions generated from the backbone network. We evaluate our work on a Multi-Organ Nuclei Segmentation dataset from MICCAI 2018 challenge. In comparisons with others, our proposed method achieves state-of-the-art performance.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126883895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Integrating aspect-aware interactive attention and emotional position-aware for multi-aspect sentiment analysis 结合面向感知互动注意和情感位置感知进行多向情感分析
Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446315
Xiaoye Wang, Xiaowen Zhou, Zan Gao, Peng Yang, Xianbin Wen, Hongyun Ning
Aspect-level Sentiment Analysis is a fine-grained sentiment analysis task, which aims to infer the corresponding sentiment polarity with different aspects in an opinion sentence. Attention-based neural networks have proven to be effective in extracting aspect terms, but the prior models are based on context-dependent. Moreover, the prior works only attend aspect terms to detect the sentiment word and cannot consider the sentiment words that might be influenced by domain-specific knowledge. In this work, we proposed a novel integrating Aspect-aware Interactive Attention and Emotional Position-aware module for multi-aspect sentiment analysis (abbreviated to AIAEP) where the aspect-aware interactive attention is utilized to extract aspect terms, and it fuses the domain-specific information of an aspect and context and learns their relationship representations by global context and local context attention mechanisms. Specifically, in the sentiment lexicon, the syntactic parse is used to increase the prior domain knowledge. Then we propose a novel position-aware fusion scheme to compose aspect-sentiment pairs. It combines absolute distance and relative distance from aspect terms and sentiment words, which can improve the accuracy of polarity classification. Extensive experimental results on SemEval2014 task4 restaurant and AIChallenge2018 datasets demonstrate that AIAEP can outperform state-of-the-art approaches, and it is very effective for aspect-level sentiment analysis.
方面级情感分析是一种细粒度的情感分析任务,旨在推断意见句中不同方面对应的情感极性。基于注意的神经网络已被证明在提取方面项方面是有效的,但之前的模型是基于上下文相关的。此外,以往的工作只关注方面词来检测情感词,而没有考虑可能受到领域特定知识影响的情感词。在本研究中,我们提出了一种新的面向多方面情感分析的面向方面感知交互注意和情感位置感知模块(简称AIAEP),该模块利用面向方面感知交互注意提取面向方面的术语,并通过全局上下文和局部上下文注意机制融合面向方面和上下文的领域特定信息,学习它们之间的关系表征。具体而言,在情感词典中,使用句法解析来增加先验领域知识。然后,我们提出了一种新的位置感知融合方案来组成方面-情感对。它结合了方面词和情感词的绝对距离和相对距离,提高了极性分类的准确性。在SemEval2014 task4 restaurant和aicchallenge2018数据集上的大量实验结果表明,AIAEP可以优于最先进的方法,并且对于方面级情感分析非常有效。
{"title":"Integrating aspect-aware interactive attention and emotional position-aware for multi-aspect sentiment analysis","authors":"Xiaoye Wang, Xiaowen Zhou, Zan Gao, Peng Yang, Xianbin Wen, Hongyun Ning","doi":"10.1145/3444685.3446315","DOIUrl":"https://doi.org/10.1145/3444685.3446315","url":null,"abstract":"Aspect-level Sentiment Analysis is a fine-grained sentiment analysis task, which aims to infer the corresponding sentiment polarity with different aspects in an opinion sentence. Attention-based neural networks have proven to be effective in extracting aspect terms, but the prior models are based on context-dependent. Moreover, the prior works only attend aspect terms to detect the sentiment word and cannot consider the sentiment words that might be influenced by domain-specific knowledge. In this work, we proposed a novel integrating Aspect-aware Interactive Attention and Emotional Position-aware module for multi-aspect sentiment analysis (abbreviated to AIAEP) where the aspect-aware interactive attention is utilized to extract aspect terms, and it fuses the domain-specific information of an aspect and context and learns their relationship representations by global context and local context attention mechanisms. Specifically, in the sentiment lexicon, the syntactic parse is used to increase the prior domain knowledge. Then we propose a novel position-aware fusion scheme to compose aspect-sentiment pairs. It combines absolute distance and relative distance from aspect terms and sentiment words, which can improve the accuracy of polarity classification. Extensive experimental results on SemEval2014 task4 restaurant and AIChallenge2018 datasets demonstrate that AIAEP can outperform state-of-the-art approaches, and it is very effective for aspect-level sentiment analysis.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131517338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A multi-scale language embedding network for proposal-free referring expression comprehension 无建议参考表达式理解的多尺度语言嵌入网络
Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446279
Taijin Zhao, Hongliang Li, Heqian Qiu, Q. Wu, K. Ngan
Referring expression comprehension (REC) is a task that aims to find the location of an object specified by a language expression. Current solutions for REC can be classified into proposal-based methods and proposal-free methods. Proposal-free methods are popular recently because of its flexibility and lightness. Nevertheless, existing proposal-free works give little consideration to visual context. As REC is a context sensitive task, it is hard for current proposal-free methods to comprehend expressions that describe objects by the relative position with surrounding things. In this paper, we propose a multi-scale language embedding network for REC. Our method adopts the proposal-free structure, which directly feeds fused visual-language features into a detection head to predict the bounding box of the target. In the fusion process, we propose a grid fusion module and a grid-context fusion module to compute the similarity between language features and visual features in different size regions. Meanwhile, we extra add fully interacted vision-language information and position information to strength the feature fusion. This novel fusion strategy can help to utilize context flexibly therefore the network can deal with varied expressions, especially expressions that describe objects by things around. Our proposed method outperforms the state-of-the-art methods on Refcoco, Refcoco+ and Refcocog datasets.
引用表达式理解(REC)是一项旨在查找语言表达式指定的对象的位置的任务。当前REC的解决方案可分为基于提议的方法和无提议的方法。无提议方法因其灵活、轻便而受到近年来的欢迎。然而,现有的无提案作品很少考虑视觉背景。由于REC是一个上下文敏感的任务,目前的无提议方法很难理解通过与周围事物的相对位置来描述对象的表达式。本文提出了一种用于REC的多尺度语言嵌入网络,该方法采用无提议结构,直接将融合的视觉语言特征输入到检测头中,以预测目标的边界框。在融合过程中,我们提出了网格融合模块和网格-上下文融合模块来计算不同大小区域的语言特征和视觉特征之间的相似度。同时,我们还增加了充分交互的视觉语言信息和位置信息,加强特征融合。这种新颖的融合策略有助于灵活地利用上下文,从而使网络能够处理各种表达,特别是用周围事物描述物体的表达。我们提出的方法在Refcoco、Refcoco+和Refcoco数据集上优于最先进的方法。
{"title":"A multi-scale language embedding network for proposal-free referring expression comprehension","authors":"Taijin Zhao, Hongliang Li, Heqian Qiu, Q. Wu, K. Ngan","doi":"10.1145/3444685.3446279","DOIUrl":"https://doi.org/10.1145/3444685.3446279","url":null,"abstract":"Referring expression comprehension (REC) is a task that aims to find the location of an object specified by a language expression. Current solutions for REC can be classified into proposal-based methods and proposal-free methods. Proposal-free methods are popular recently because of its flexibility and lightness. Nevertheless, existing proposal-free works give little consideration to visual context. As REC is a context sensitive task, it is hard for current proposal-free methods to comprehend expressions that describe objects by the relative position with surrounding things. In this paper, we propose a multi-scale language embedding network for REC. Our method adopts the proposal-free structure, which directly feeds fused visual-language features into a detection head to predict the bounding box of the target. In the fusion process, we propose a grid fusion module and a grid-context fusion module to compute the similarity between language features and visual features in different size regions. Meanwhile, we extra add fully interacted vision-language information and position information to strength the feature fusion. This novel fusion strategy can help to utilize context flexibly therefore the network can deal with varied expressions, especially expressions that describe objects by things around. Our proposed method outperforms the state-of-the-art methods on Refcoco, Refcoco+ and Refcocog datasets.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"576 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134272984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Incremental multi-view object detection from a moving camera 移动摄像机的增量多视图目标检测
Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446257
T. Konno, Ayako Amma, Asako Kanezaki
Object detection in a single image is a challenging problem due to clutters, occlusions, and a large variety of viewing locations. This task can benefit from integrating multi-frame information captured by a moving camera. In this paper, we propose a method to increment object detection scores extracted from multiple frames captured from different viewpoints. For each frame, we run an efficient end-to-end object detector that outputs object bounding boxes, each of which is associated with the scores of categories and poses. The scores of detected objects are then stored in grid locations in 3D space. After observing multiple frames, the object scores stored in each grid location are integrated based on the best object pose hypothesis. This strategy requires the consistency of object categories and poses among multiple frames, and thus it significantly suppresses miss detections. The performance of the proposed method is evaluated on our newly created multi-class object dataset captured in robot simulation and real environments, as well as on a public benchmark dataset.
单幅图像中的目标检测是一个具有挑战性的问题,由于杂乱,遮挡和各种各样的观看位置。该任务可以受益于整合多帧信息捕获的移动摄像机。本文提出了一种从不同视点捕获的多帧图像中提取目标检测分数增量的方法。对于每一帧,我们运行一个高效的端到端对象检测器,输出对象边界框,每个边界框都与类别和姿势的分数相关联。然后将检测到的物体的分数存储在3D空间的网格位置中。在观察多帧后,基于最佳目标姿态假设对存储在每个网格位置的目标得分进行整合。该策略要求多帧间目标类别和姿态的一致性,从而显著地抑制了误检。在机器人仿真和真实环境中捕获的新创建的多类对象数据集以及公共基准数据集上评估了所提出方法的性能。
{"title":"Incremental multi-view object detection from a moving camera","authors":"T. Konno, Ayako Amma, Asako Kanezaki","doi":"10.1145/3444685.3446257","DOIUrl":"https://doi.org/10.1145/3444685.3446257","url":null,"abstract":"Object detection in a single image is a challenging problem due to clutters, occlusions, and a large variety of viewing locations. This task can benefit from integrating multi-frame information captured by a moving camera. In this paper, we propose a method to increment object detection scores extracted from multiple frames captured from different viewpoints. For each frame, we run an efficient end-to-end object detector that outputs object bounding boxes, each of which is associated with the scores of categories and poses. The scores of detected objects are then stored in grid locations in 3D space. After observing multiple frames, the object scores stored in each grid location are integrated based on the best object pose hypothesis. This strategy requires the consistency of object categories and poses among multiple frames, and thus it significantly suppresses miss detections. The performance of the proposed method is evaluated on our newly created multi-class object dataset captured in robot simulation and real environments, as well as on a public benchmark dataset.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132472022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Semantic feature augmentation for fine-grained visual categorization with few-sample training 基于少样本训练的细粒度视觉分类语义特征增强
Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446264
Xiang Guan, Yang Yang, Zheng Wang, Jingjing Li
Small data challenges have emerged in many learning problems, since the success of deep neural networks often relies on the availability of a huge number of labeled data that is expensive to collect. We explore a highly challenging task, few-sample training, which uses a small number of labeled images of each category and corresponding textual descriptions to train a model for fine-grained visual categorization. In order to tackle overfitting caused by small data, in this paper, we propose two novel feature augmentation approaches, Semantic Gate Feature Augmentation (SGFA) and Semantic Boundary Feature Augmentation (SBFA). Instead of generating a new image instance, we propose to directly synthesize instance features by leveraging semantic information, and its main novelties are: (1) The SGFA method is proposed to reduce the overfitting of small data by adding random noise to different regions of the image's feature maps through a gating mechanism. (2) The SBFA approach is proposed to optimize the decision boundary of the classifier. Technically, the decision boundary of the image feature is estimated through the assistance of semantic information, and then feature augmentation is performed by sampling in this region. Experiments in fine-grained visual categorization benchmark demonstrate that our proposed approach can significantly improve the categorization performance.
在许多学习问题中都出现了小数据挑战,因为深度神经网络的成功往往依赖于大量标记数据的可用性,而这些数据的收集成本很高。我们探索了一个极具挑战性的任务,即少样本训练,它使用每个类别的少量标记图像和相应的文本描述来训练模型进行细粒度的视觉分类。为了解决小数据引起的过拟合问题,本文提出了语义门特征增强(SGFA)和语义边界特征增强(SBFA)两种新的特征增强方法。本文提出利用语义信息直接合成实例特征,而不是生成新的图像实例,其主要新颖之处有:(1)提出了SGFA方法,通过门限机制在图像特征映射的不同区域加入随机噪声,以减少小数据的过拟合。(2)提出了SBFA方法来优化分类器的决策边界。技术上,通过语义信息的辅助估计图像特征的决策边界,然后在该区域进行采样进行特征增强。在细粒度视觉分类基准上的实验表明,该方法可以显著提高分类性能。
{"title":"Semantic feature augmentation for fine-grained visual categorization with few-sample training","authors":"Xiang Guan, Yang Yang, Zheng Wang, Jingjing Li","doi":"10.1145/3444685.3446264","DOIUrl":"https://doi.org/10.1145/3444685.3446264","url":null,"abstract":"Small data challenges have emerged in many learning problems, since the success of deep neural networks often relies on the availability of a huge number of labeled data that is expensive to collect. We explore a highly challenging task, few-sample training, which uses a small number of labeled images of each category and corresponding textual descriptions to train a model for fine-grained visual categorization. In order to tackle overfitting caused by small data, in this paper, we propose two novel feature augmentation approaches, Semantic Gate Feature Augmentation (SGFA) and Semantic Boundary Feature Augmentation (SBFA). Instead of generating a new image instance, we propose to directly synthesize instance features by leveraging semantic information, and its main novelties are: (1) The SGFA method is proposed to reduce the overfitting of small data by adding random noise to different regions of the image's feature maps through a gating mechanism. (2) The SBFA approach is proposed to optimize the decision boundary of the classifier. Technically, the decision boundary of the image feature is estimated through the assistance of semantic information, and then feature augmentation is performed by sampling in this region. Experiments in fine-grained visual categorization benchmark demonstrate that our proposed approach can significantly improve the categorization performance.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"6 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115735014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Hungry networks: 3D mesh reconstruction of a dish and a plate from a single dish image for estimating food volume 饥饿网络:从单个盘子图像重建盘子和盘子的三维网格,用于估计食物体积
Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446275
Shu Naritomi, Keiji Yanai
Dietary calorie management has been an important topic in recent years, and various methods and applications on image-based food calorie estimation have been published in the multimedia community. Most of the existing methods of estimating food calorie amounts use 2D-based image recognition. On the other hand, in this paper, we would like to make inferences based on 3D volume for more accurate estimation. We performed 3D reconstruction of a dish (food and plate) and a plate (without foods), from a single image. We succeeded in restoring the 3D shape with high accuracy while maintaining the consistency between a plate part of an estimated 3D dish and an estimated 3D plate. To achieve this, the following contributions were made in this paper. (1) Proposal of "Hungry Networks," a new network that generates two kinds of 3D volumes from a single image. (2) Introduction of plate consistency loss that matches the shapes of the plate parts of the two reconstructed models. (3) Creating a new dataset of 3D food models that are 3D scanned of actual foods and plates. We also conducted an experiment to infer the volume of only the food region from the difference of the two reconstructed volumes. As a result, it was shown that the introduced new loss function not only matches the 3D shape of the plate, but also contributes to obtaining the volume with higher accuracy. Although there are some existing studies that consider 3D shapes of foods, this is the first study to generate a 3D mesh volume from a single dish image.
膳食热量管理是近年来研究的一个重要课题,多媒体界已经发表了各种基于图像的食物热量估算方法和应用。大多数现有的估算食物卡路里量的方法使用基于2d的图像识别。另一方面,在本文中,我们希望基于三维体积进行推断,以获得更准确的估计。我们对一个盘子(食物和盘子)和一个盘子(没有食物)进行了3D重建。我们成功地以高精度恢复了三维形状,同时保持了估计的3D盘子的盘子部分和估计的3D盘子之间的一致性。为了实现这一目标,本文做出了以下贡献。(1)“饥饿网络”(Hungry Networks)的提议,这是一种新的网络,可以从一张图像中生成两种3D体量。(2)引入与两种重构模型的板部形状相匹配的板一致性损失。(3)创建新的3D食品模型数据集,对实际食品和盘子进行3D扫描。我们还进行了一个实验,通过两个重建体积的差异来推断仅食物区域的体积。结果表明,所引入的损失函数不仅与板的三维形状相匹配,而且有助于获得更高精度的体积。虽然已有一些研究考虑了食物的三维形状,但这是第一次从单个盘子图像生成三维网格体积的研究。
{"title":"Hungry networks: 3D mesh reconstruction of a dish and a plate from a single dish image for estimating food volume","authors":"Shu Naritomi, Keiji Yanai","doi":"10.1145/3444685.3446275","DOIUrl":"https://doi.org/10.1145/3444685.3446275","url":null,"abstract":"Dietary calorie management has been an important topic in recent years, and various methods and applications on image-based food calorie estimation have been published in the multimedia community. Most of the existing methods of estimating food calorie amounts use 2D-based image recognition. On the other hand, in this paper, we would like to make inferences based on 3D volume for more accurate estimation. We performed 3D reconstruction of a dish (food and plate) and a plate (without foods), from a single image. We succeeded in restoring the 3D shape with high accuracy while maintaining the consistency between a plate part of an estimated 3D dish and an estimated 3D plate. To achieve this, the following contributions were made in this paper. (1) Proposal of \"Hungry Networks,\" a new network that generates two kinds of 3D volumes from a single image. (2) Introduction of plate consistency loss that matches the shapes of the plate parts of the two reconstructed models. (3) Creating a new dataset of 3D food models that are 3D scanned of actual foods and plates. We also conducted an experiment to infer the volume of only the food region from the difference of the two reconstructed volumes. As a result, it was shown that the introduced new loss function not only matches the 3D shape of the plate, but also contributes to obtaining the volume with higher accuracy. Although there are some existing studies that consider 3D shapes of foods, this is the first study to generate a 3D mesh volume from a single dish image.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115986742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
Proceedings of the 2nd ACM International Conference on Multimedia in Asia
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1