首页 > 最新文献

Proceedings of the 26th ACM international conference on Multimedia最新文献

英文 中文
Session details: Deep-2 (Recognition) 会议详情:Deep-2(识别)
Pub Date : 2018-10-15 DOI: 10.1145/3286931
Qin Jin
{"title":"Session details: Deep-2 (Recognition)","authors":"Qin Jin","doi":"10.1145/3286931","DOIUrl":"https://doi.org/10.1145/3286931","url":null,"abstract":"","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124744106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Knowledge-aware Multimodal Dialogue Systems 知识感知多模式对话系统
Pub Date : 2018-10-15 DOI: 10.1145/3240508.3240605
Lizi Liao, Yunshan Ma, Xiangnan He, Richang Hong, Tat-Seng Chua
By offering a natural way for information seeking, multimodal dialogue systems are attracting increasing attention in several domains such as retail, travel etc. However, most existing dialogue systems are limited to textual modality, which cannot be easily extended to capture the rich semantics in visual modality such as product images. For example, in fashion domain, the visual appearance of clothes and matching styles play a crucial role in understanding the user's intention. Without considering these, the dialogue agent may fail to generate desirable responses for users. In this paper, we present a Knowledge-aware Multimodal Dialogue (KMD) model to address the limitation of text-based dialogue systems. It gives special consideration to the semantics and domain knowledge revealed in visual content, and is featured with three key components. First, we build a taxonomy-based learning module to capture the fine-grained semantics in images the category and attributes of a product). Second, we propose an end-to-end neural conversational model to generate responses based on the conversation history, visual semantics, and domain knowledge. Lastly, to avoid inconsistent dialogues, we adopt a deep reinforcement learning method which accounts for future rewards to optimize the neural conversational model. We perform extensive evaluation on a multi-turn task-oriented dialogue dataset in fashion domain. Experiment results show that our method significantly outperforms state-of-the-art methods, demonstrating the efficacy of modeling visual modality and domain knowledge for dialogue systems.
多模态对话系统提供了一种自然的信息搜索方式,在零售、旅游等领域受到越来越多的关注。然而,现有的对话系统大多局限于文本形态,难以扩展到产品图像等视觉形态中丰富的语义。例如,在时尚领域,服装的视觉外观和搭配风格对理解用户的意图起着至关重要的作用。如果不考虑这些,对话代理可能无法为用户生成所需的响应。在本文中,我们提出了一个知识感知多模态对话(KMD)模型来解决基于文本的对话系统的局限性。它特别考虑了视觉内容中揭示的语义和领域知识,并具有三个关键组成部分。首先,我们构建一个基于分类法的学习模块来捕获图像中的细粒度语义(产品的类别和属性)。其次,我们提出了一个端到端的神经会话模型,该模型基于会话历史、视觉语义和领域知识生成响应。最后,为了避免不一致的对话,我们采用了一种考虑未来奖励的深度强化学习方法来优化神经会话模型。我们对时尚领域的多回合面向任务的对话数据集进行了广泛的评估。实验结果表明,我们的方法明显优于目前最先进的方法,证明了对对话系统的视觉模态和领域知识建模的有效性。
{"title":"Knowledge-aware Multimodal Dialogue Systems","authors":"Lizi Liao, Yunshan Ma, Xiangnan He, Richang Hong, Tat-Seng Chua","doi":"10.1145/3240508.3240605","DOIUrl":"https://doi.org/10.1145/3240508.3240605","url":null,"abstract":"By offering a natural way for information seeking, multimodal dialogue systems are attracting increasing attention in several domains such as retail, travel etc. However, most existing dialogue systems are limited to textual modality, which cannot be easily extended to capture the rich semantics in visual modality such as product images. For example, in fashion domain, the visual appearance of clothes and matching styles play a crucial role in understanding the user's intention. Without considering these, the dialogue agent may fail to generate desirable responses for users. In this paper, we present a Knowledge-aware Multimodal Dialogue (KMD) model to address the limitation of text-based dialogue systems. It gives special consideration to the semantics and domain knowledge revealed in visual content, and is featured with three key components. First, we build a taxonomy-based learning module to capture the fine-grained semantics in images the category and attributes of a product). Second, we propose an end-to-end neural conversational model to generate responses based on the conversation history, visual semantics, and domain knowledge. Lastly, to avoid inconsistent dialogues, we adopt a deep reinforcement learning method which accounts for future rewards to optimize the neural conversational model. We perform extensive evaluation on a multi-turn task-oriented dialogue dataset in fashion domain. Experiment results show that our method significantly outperforms state-of-the-art methods, demonstrating the efficacy of modeling visual modality and domain knowledge for dialogue systems.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130673813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 106
Session details: Open Source Software Competition 会议详情:开源软件竞赛
Pub Date : 2018-10-15 DOI: 10.1145/3286934
Min-Chun Hu
{"title":"Session details: Open Source Software Competition","authors":"Min-Chun Hu","doi":"10.1145/3286934","DOIUrl":"https://doi.org/10.1145/3286934","url":null,"abstract":"","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"104 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123388071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning to Transfer: Generalizable Attribute Learning with Multitask Neural Model Search 学习迁移:多任务神经模型搜索的可归纳属性学习
Pub Date : 2018-10-15 DOI: 10.1145/3240508.3240518
Zhi-Qi Cheng, Xiao Wu, Siyu Huang, Jun-Xiu Li, Alexander Hauptmann, Qiang Peng
As attribute leaning brings mid-level semantic properties for objects, it can benefit many traditional learning problems in multimedia and computer vision communities. When facing the huge number of attributes, it is extremely challenging to automatically design a generalizable neural network for other attribute learning tasks. Even for a specific attribute domain, the exploration of the neural network architecture is always optimized by a combination of heuristics and grid search, from which there is a large space of possible choices to be searched. In this paper, Generalizable Attribute Learning Model (GALM) is proposed to automatically design the neural networks for generalizable attribute learning. The main novelty of GALM is that it fully exploits the Multi-Task Learning and Reinforcement Learning to speed up the search procedure. With the help of parameter sharing, GALM is able to transfer the pre-searched architecture to different attribute domains. In experiments, we comprehensively evaluate GALM on 251 attributes from three domains: animals, objects, and scenes. Extensive experimental results demonstrate that GALM significantly outperforms the state-of-the-art attribute learning approaches and previous neural architecture search methods on two generalizable attribute learning scenarios.
由于属性学习为对象提供了中级语义属性,它可以解决多媒体和计算机视觉领域的许多传统学习问题。当面对大量的属性时,自动设计一个可泛化的神经网络用于其他属性学习任务是极具挑战性的。即使对于特定的属性域,神经网络架构的探索也总是采用启发式和网格搜索相结合的方式进行优化,从中有很大的可能选择空间可供搜索。本文提出了广义属性学习模型(GALM)来自动设计用于广义属性学习的神经网络。GALM的主要新颖之处在于它充分利用了多任务学习和强化学习来加快搜索过程。在参数共享的帮助下,GALM能够将预先搜索的体系结构转移到不同的属性域。在实验中,我们对来自动物、物体和场景三个领域的251个属性进行了综合评价。大量的实验结果表明,在两种可推广的属性学习场景下,GALM显著优于最先进的属性学习方法和以前的神经结构搜索方法。
{"title":"Learning to Transfer: Generalizable Attribute Learning with Multitask Neural Model Search","authors":"Zhi-Qi Cheng, Xiao Wu, Siyu Huang, Jun-Xiu Li, Alexander Hauptmann, Qiang Peng","doi":"10.1145/3240508.3240518","DOIUrl":"https://doi.org/10.1145/3240508.3240518","url":null,"abstract":"As attribute leaning brings mid-level semantic properties for objects, it can benefit many traditional learning problems in multimedia and computer vision communities. When facing the huge number of attributes, it is extremely challenging to automatically design a generalizable neural network for other attribute learning tasks. Even for a specific attribute domain, the exploration of the neural network architecture is always optimized by a combination of heuristics and grid search, from which there is a large space of possible choices to be searched. In this paper, Generalizable Attribute Learning Model (GALM) is proposed to automatically design the neural networks for generalizable attribute learning. The main novelty of GALM is that it fully exploits the Multi-Task Learning and Reinforcement Learning to speed up the search procedure. With the help of parameter sharing, GALM is able to transfer the pre-searched architecture to different attribute domains. In experiments, we comprehensively evaluate GALM on 251 attributes from three domains: animals, objects, and scenes. Extensive experimental results demonstrate that GALM significantly outperforms the state-of-the-art attribute learning approaches and previous neural architecture search methods on two generalizable attribute learning scenarios.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"143 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121636877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Hierarchical Memory Modelling for Video Captioning 视频字幕的分层记忆建模
Pub Date : 2018-10-15 DOI: 10.1145/3240508.3240538
Junbo Wang, Wei Wang, Yan Huang, Liang Wang, T. Tan
Translating videos into natural language sentences has drawn much attention recently. The framework of combining visual attention with Long Short-Term Memory (LSTM) based text decoder has achieved much progress. However, the vision-language translation still remains unsolved due to the semantic gap and misalignment between video content and described semantic concept. In this paper, we propose a Hierarchical Memory Model (HMM) - a novel deep video captioning architecture which unifies a textual memory, a visual memory and an attribute memory in a hierarchical way. These memories can guide attention for efficient video representation extraction and semantic attribute selection in addition to modelling the long-term dependency for video sequence and sentences, respectively. Compared with traditional vision-based text decoder, the proposed attribute-based text decoder can largely reduce the semantic discrepancy between video and sentence. To prove the effectiveness of the proposed model, we perform extensive experiments on two public benchmark datasets: MSVD and MSR-VTT. Experiments show that our model not only can discover appropriate video representation and semantic attributes but also can achieve comparable or superior performances than state-of-the-art methods on these datasets.
将视频翻译成自然语言句子最近引起了人们的广泛关注。基于视觉注意和长短期记忆(LSTM)相结合的文本解码器框架已经取得了很大进展。然而,由于视频内容与描述的语义概念之间的语义差距和不一致,视觉语言翻译仍然没有得到解决。本文提出了一种层次记忆模型(HMM)——一种将文本记忆、视觉记忆和属性记忆分层统一起来的新型深度视频字幕结构。这些记忆可以引导注意力进行高效的视频表示提取和语义属性选择,并分别对视频序列和句子的长期依赖进行建模。与传统的基于视觉的文本解码器相比,本文提出的基于属性的文本解码器可以大大降低视频和句子之间的语义差异。为了证明该模型的有效性,我们在两个公共基准数据集:MSVD和MSR-VTT上进行了广泛的实验。实验表明,我们的模型不仅可以发现合适的视频表示和语义属性,而且可以在这些数据集上取得与现有方法相当或更好的性能。
{"title":"Hierarchical Memory Modelling for Video Captioning","authors":"Junbo Wang, Wei Wang, Yan Huang, Liang Wang, T. Tan","doi":"10.1145/3240508.3240538","DOIUrl":"https://doi.org/10.1145/3240508.3240538","url":null,"abstract":"Translating videos into natural language sentences has drawn much attention recently. The framework of combining visual attention with Long Short-Term Memory (LSTM) based text decoder has achieved much progress. However, the vision-language translation still remains unsolved due to the semantic gap and misalignment between video content and described semantic concept. In this paper, we propose a Hierarchical Memory Model (HMM) - a novel deep video captioning architecture which unifies a textual memory, a visual memory and an attribute memory in a hierarchical way. These memories can guide attention for efficient video representation extraction and semantic attribute selection in addition to modelling the long-term dependency for video sequence and sentences, respectively. Compared with traditional vision-based text decoder, the proposed attribute-based text decoder can largely reduce the semantic discrepancy between video and sentence. To prove the effectiveness of the proposed model, we perform extensive experiments on two public benchmark datasets: MSVD and MSR-VTT. Experiments show that our model not only can discover appropriate video representation and semantic attributes but also can achieve comparable or superior performances than state-of-the-art methods on these datasets.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121662598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Session details: Keynote 1 会议详情:主题演讲1
Pub Date : 2018-10-15 DOI: 10.1145/3286916
Susanne CJ Boll
{"title":"Session details: Keynote 1","authors":"Susanne CJ Boll","doi":"10.1145/3286916","DOIUrl":"https://doi.org/10.1145/3286916","url":null,"abstract":"","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121720138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
From Data to Knowledge: Deep Learning Model Compression, Transmission and Communication 从数据到知识:深度学习模型压缩、传输和通信
Pub Date : 2018-10-15 DOI: 10.1145/3240508.3240654
Ziqian Chen, Shiqi Wang, D. Wu, Tiejun Huang, Ling-yu Duan
With the advances of artificial intelligence, recent years have witnessed a gradual transition from the big data to the big knowledge. Based on the knowledge-powered deep learning models, the big data such as the vast text, images and videos can be efficiently analyzed. As such, in addition to data, the communication of knowledge implied in the deep learning models is also strongly desired. As a specific example regarding the concept of knowledge creation and communication in the context of Knowledge Centric Networking (KCN), we investigate the deep learning model compression and demonstrate its promise use through a set of experiments. In particular, towards future KCN, we introduce efficient transmission of deep learning models in terms of both single model compression and multiple model prediction. The necessity, importance and open problems regarding the standardization of deep learning models, which enables the interoperability with the standardized compact model representation bitstream syntax, are also discussed.
近年来,随着人工智能的发展,人们逐渐从大数据向大知识过渡。基于知识驱动的深度学习模型,可以对海量文本、图像、视频等大数据进行高效分析。因此,除了数据之外,深度学习模型中隐含的知识交流也是非常需要的。作为知识中心网络(KCN)背景下知识创造和交流概念的具体示例,我们研究了深度学习模型压缩,并通过一组实验证明了它的应用前景。特别是,对于未来的KCN,我们在单模型压缩和多模型预测方面引入了深度学习模型的高效传输。讨论了深度学习模型标准化的必要性、重要性和有待解决的问题,以实现与标准化的紧凑模型表示比特流语法的互操作性。
{"title":"From Data to Knowledge: Deep Learning Model Compression, Transmission and Communication","authors":"Ziqian Chen, Shiqi Wang, D. Wu, Tiejun Huang, Ling-yu Duan","doi":"10.1145/3240508.3240654","DOIUrl":"https://doi.org/10.1145/3240508.3240654","url":null,"abstract":"With the advances of artificial intelligence, recent years have witnessed a gradual transition from the big data to the big knowledge. Based on the knowledge-powered deep learning models, the big data such as the vast text, images and videos can be efficiently analyzed. As such, in addition to data, the communication of knowledge implied in the deep learning models is also strongly desired. As a specific example regarding the concept of knowledge creation and communication in the context of Knowledge Centric Networking (KCN), we investigate the deep learning model compression and demonstrate its promise use through a set of experiments. In particular, towards future KCN, we introduce efficient transmission of deep learning models in terms of both single model compression and multiple model prediction. The necessity, importance and open problems regarding the standardization of deep learning models, which enables the interoperability with the standardized compact model representation bitstream syntax, are also discussed.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121765034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
VIVID 生动的
Pub Date : 2018-10-15 DOI: 10.1145/3240508.3243653
Kuan-Ting Lai, Chia-Chih Lin, Chun-Yao Kang, Mei-Enn Liao, Ming-Syan Chen
Due to the advances in deep reinforcement learning and the demand of large training data, virtual-to-real learning has gained lots of attention from computer vision community recently. As state-of-the-art 3D engines can generate photo-realistic images suitable for training deep neural networks, researchers have been gradually applied 3D virtual environment to learn different tasks including autonomous driving, collision avoidance, and image segmentation, to name a few. Although there are already many open-source simulation environments readily available, most of them either provide small scenes or have limited interactions with objects in the environment. To facilitate visual recognition learning, we present a new Virtual Environment for Visual Deep Learning (VIVID), which offers large-scale diversified indoor and outdoor scenes. Moreover, VIVID leverages the advanced human skeleton system, which enables us to simulate numerous complex human actions. VIVID has a wide range of applications and can be used for learning indoor navigation, action recognition, event detection, etc. We also release several deep learning examples in Python to demonstrate the capabilities and advantages of our system.
{"title":"VIVID","authors":"Kuan-Ting Lai, Chia-Chih Lin, Chun-Yao Kang, Mei-Enn Liao, Ming-Syan Chen","doi":"10.1145/3240508.3243653","DOIUrl":"https://doi.org/10.1145/3240508.3243653","url":null,"abstract":"Due to the advances in deep reinforcement learning and the demand of large training data, virtual-to-real learning has gained lots of attention from computer vision community recently. As state-of-the-art 3D engines can generate photo-realistic images suitable for training deep neural networks, researchers have been gradually applied 3D virtual environment to learn different tasks including autonomous driving, collision avoidance, and image segmentation, to name a few. Although there are already many open-source simulation environments readily available, most of them either provide small scenes or have limited interactions with objects in the environment. To facilitate visual recognition learning, we present a new Virtual Environment for Visual Deep Learning (VIVID), which offers large-scale diversified indoor and outdoor scenes. Moreover, VIVID leverages the advanced human skeleton system, which enables us to simulate numerous complex human actions. VIVID has a wide range of applications and can be used for learning indoor navigation, action recognition, event detection, etc. We also release several deep learning examples in Python to demonstrate the capabilities and advantages of our system.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"407 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115855053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Previewer for Multi-Scale Object Detector 多尺度对象检测器预览器
Pub Date : 2018-10-15 DOI: 10.1145/3240508.3240544
Zhihang Fu, Zhongming Jin, Guo-Jun Qi, Chen Shen, Rongxin Jiang, Yao-wu Chen, Xiansheng Hua
Most multi-scale detectors face a challenge of small-size false positives due to the inadequacy of low-level features, which have small receptive field sizes and weak semantic capabilities. This paper demonstrates independent predictions from different feature layers on the same region is beneficial for reducing false positives. We propose a novel light-weight previewer block, which previews the objectness probability for the potential regression region of each prior box, using the stronger features with larger receptive fields and more contextual information for better predictions. This previewer block is generic and can be easily implemented in multi-scale detectors, such as SSD, RFBNet and MS-CNN. Extensive experiments are conducted on PASCAL VOC and KITTI pedestrian benchmark to show the superiority of the proposed method.
大多数多尺度检测器由于低层次特征的不足而面临小尺度误报的挑战,低层次特征的接收野大小小,语义能力弱。本文论证了在同一区域上不同特征层的独立预测有利于减少误报。我们提出了一种新的轻量级预览块,它可以预览每个先验盒的潜在回归区域的客观概率,使用具有更大接受域和更多上下文信息的更强特征来更好地预测。这个预览块是通用的,可以很容易地实现在多尺度检测器,如SSD, RFBNet和MS-CNN。在PASCAL VOC和KITTI行人基准上进行了大量的实验,证明了该方法的优越性。
{"title":"Previewer for Multi-Scale Object Detector","authors":"Zhihang Fu, Zhongming Jin, Guo-Jun Qi, Chen Shen, Rongxin Jiang, Yao-wu Chen, Xiansheng Hua","doi":"10.1145/3240508.3240544","DOIUrl":"https://doi.org/10.1145/3240508.3240544","url":null,"abstract":"Most multi-scale detectors face a challenge of small-size false positives due to the inadequacy of low-level features, which have small receptive field sizes and weak semantic capabilities. This paper demonstrates independent predictions from different feature layers on the same region is beneficial for reducing false positives. We propose a novel light-weight previewer block, which previews the objectness probability for the potential regression region of each prior box, using the stronger features with larger receptive fields and more contextual information for better predictions. This previewer block is generic and can be easily implemented in multi-scale detectors, such as SSD, RFBNet and MS-CNN. Extensive experiments are conducted on PASCAL VOC and KITTI pedestrian benchmark to show the superiority of the proposed method.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130257195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Multi-modal Preference Modeling for Product Search 产品搜索的多模态偏好建模
Pub Date : 2018-10-15 DOI: 10.1145/3240508.3240541
Yangyang Guo, Zhiyong Cheng, Liqiang Nie, Xin-Shun Xu, M. Kankanhalli
The visual preference of users for products has been largely ignored by the existing product search methods. In this work, we propose a multi-modal personalized product search method, which aims to search products which not only are relevant to the submitted textual query, but also match the user preferences from both textual and visual modalities. To achieve the goal, we first leverage the also_view and buy_after_viewing products to construct the visual and textual latent spaces, which are expected to preserve the visual similarity and semantic similarity of products, respectively. We then propose a translation-based search model (TranSearch ) to 1) learn a multi-modal latent space based on the pre-trained visual and textual latent spaces; and 2) map the users, queries and products into this space for direct matching. The TranSearch model is trained based on a comparative learning strategy, such that the multi-modal latent space is oriented to personalized ranking in the training stage. Experiments have been conducted on real-world datasets to validate the effectiveness of our method. The results demonstrate that our method outperforms the state-of-the-art method by a large margin.
现有的产品搜索方法在很大程度上忽略了用户对产品的视觉偏好。在这项工作中,我们提出了一种多模态个性化产品搜索方法,该方法旨在搜索与提交的文本查询相关的产品,并且从文本和视觉两方面匹配用户偏好。为了实现这一目标,我们首先利用产品的also_view和buy_after_viewing来构建视觉潜空间和文本潜空间,分别保持产品的视觉相似性和语义相似性。然后,我们提出了一个基于翻译的搜索模型(TranSearch): 1)学习基于预训练的视觉和文本潜在空间的多模态潜在空间;2)将用户、查询和产品映射到该空间中进行直接匹配。TranSearch模型基于比较学习策略进行训练,使得多模态潜在空间在训练阶段面向个性化排序。实验已经在真实世界的数据集上进行,以验证我们的方法的有效性。结果表明,我们的方法在很大程度上优于最先进的方法。
{"title":"Multi-modal Preference Modeling for Product Search","authors":"Yangyang Guo, Zhiyong Cheng, Liqiang Nie, Xin-Shun Xu, M. Kankanhalli","doi":"10.1145/3240508.3240541","DOIUrl":"https://doi.org/10.1145/3240508.3240541","url":null,"abstract":"The visual preference of users for products has been largely ignored by the existing product search methods. In this work, we propose a multi-modal personalized product search method, which aims to search products which not only are relevant to the submitted textual query, but also match the user preferences from both textual and visual modalities. To achieve the goal, we first leverage the also_view and buy_after_viewing products to construct the visual and textual latent spaces, which are expected to preserve the visual similarity and semantic similarity of products, respectively. We then propose a translation-based search model (TranSearch ) to 1) learn a multi-modal latent space based on the pre-trained visual and textual latent spaces; and 2) map the users, queries and products into this space for direct matching. The TranSearch model is trained based on a comparative learning strategy, such that the multi-modal latent space is oriented to personalized ranking in the training stage. Experiments have been conducted on real-world datasets to validate the effectiveness of our method. The results demonstrate that our method outperforms the state-of-the-art method by a large margin.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134070877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 55
期刊
Proceedings of the 26th ACM international conference on Multimedia
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1