Indrani Bhattacharya, Arkabandhu Chowdhury, V. Raykar
{"title":"在联合嵌入空间中使用探索-开发范式浏览大型视觉目录的多模态对话","authors":"Indrani Bhattacharya, Arkabandhu Chowdhury, V. Raykar","doi":"10.1145/3323873.3325036","DOIUrl":null,"url":null,"abstract":"We present a multimodal dialog (MMD) system to assist online customers in visually browsing through large catalogs. Visual browsing allows customers to explore products beyond exact search results. We focus on a slightly asymmetric version of a complete MMD system, in that our agent can understand both text and image queries, but responds only in images. We formulate our problem of \"showing the k best images to a user'', based on the dialog context so far, as sampling from a Gaussian Mixture Model (GMM) in a high dimensional joint multimodal embedding space. The joint embedding space is learned by Common Representation Learning and embeds both the text and the image queries. Our system remembers the context of the dialog, and uses an exploration-exploitation paradigm to assist in visual browsing. We train and evaluate the system on an MMD dataset that we synthesize from large catalog data. Our experiments and preliminary human evaluation show that the system is capable of learning and displaying relevant products with an average cosine similarity of 0.85 to the ground truth results, and is capable of engaging human users.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Multimodal Dialog for Browsing Large Visual Catalogs using Exploration-Exploitation Paradigm in a Joint Embedding Space\",\"authors\":\"Indrani Bhattacharya, Arkabandhu Chowdhury, V. Raykar\",\"doi\":\"10.1145/3323873.3325036\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present a multimodal dialog (MMD) system to assist online customers in visually browsing through large catalogs. Visual browsing allows customers to explore products beyond exact search results. We focus on a slightly asymmetric version of a complete MMD system, in that our agent can understand both text and image queries, but responds only in images. We formulate our problem of \\\"showing the k best images to a user'', based on the dialog context so far, as sampling from a Gaussian Mixture Model (GMM) in a high dimensional joint multimodal embedding space. The joint embedding space is learned by Common Representation Learning and embeds both the text and the image queries. Our system remembers the context of the dialog, and uses an exploration-exploitation paradigm to assist in visual browsing. We train and evaluate the system on an MMD dataset that we synthesize from large catalog data. Our experiments and preliminary human evaluation show that the system is capable of learning and displaying relevant products with an average cosine similarity of 0.85 to the ground truth results, and is capable of engaging human users.\",\"PeriodicalId\":149041,\"journal\":{\"name\":\"Proceedings of the 2019 on International Conference on Multimedia Retrieval\",\"volume\":\"21 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-01-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2019 on International Conference on Multimedia Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3323873.3325036\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3323873.3325036","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Multimodal Dialog for Browsing Large Visual Catalogs using Exploration-Exploitation Paradigm in a Joint Embedding Space
We present a multimodal dialog (MMD) system to assist online customers in visually browsing through large catalogs. Visual browsing allows customers to explore products beyond exact search results. We focus on a slightly asymmetric version of a complete MMD system, in that our agent can understand both text and image queries, but responds only in images. We formulate our problem of "showing the k best images to a user'', based on the dialog context so far, as sampling from a Gaussian Mixture Model (GMM) in a high dimensional joint multimodal embedding space. The joint embedding space is learned by Common Representation Learning and embeds both the text and the image queries. Our system remembers the context of the dialog, and uses an exploration-exploitation paradigm to assist in visual browsing. We train and evaluate the system on an MMD dataset that we synthesize from large catalog data. Our experiments and preliminary human evaluation show that the system is capable of learning and displaying relevant products with an average cosine similarity of 0.85 to the ground truth results, and is capable of engaging human users.