{"title":"Towards semantic visual representation: augmenting image representation with natural language descriptors","authors":"Konda Reddy Mopuri, R. Venkatesh Babu","doi":"10.1145/3009977.3010010","DOIUrl":null,"url":null,"abstract":"Learning image representations has been an interesting and challenging problem. When users upload images to photo sharing websites, they often provide multiple textual tags for ease of reference. These tags can reveal significant information about the content of the image such as the objects present in the image or the action that is taking place. Approaches have been proposed to extract additional information from these tags in order to augment the visual cues and build a multi-modal image representation. However, the existing approaches do not pay much attention to the semantic meaning of the tags while they encode. In this work, we attempt to enrich the image representation with the tag encodings that leverage their semantics. Our approach utilizes neural network based natural language descriptors to represent the tag information. By complementing the visual features learned by convnets, our approach results in an efficient multi-modal image representation. Experimental evaluation suggests that our approach results in a better multi-modal image representation by exploiting the two data modalities for classification on benchmark datasets.","PeriodicalId":93806,"journal":{"name":"Proceedings. Indian Conference on Computer Vision, Graphics & Image Processing","volume":"5 1","pages":"64:1-64:8"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. Indian Conference on Computer Vision, Graphics & Image Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3009977.3010010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Learning image representations has been an interesting and challenging problem. When users upload images to photo sharing websites, they often provide multiple textual tags for ease of reference. These tags can reveal significant information about the content of the image such as the objects present in the image or the action that is taking place. Approaches have been proposed to extract additional information from these tags in order to augment the visual cues and build a multi-modal image representation. However, the existing approaches do not pay much attention to the semantic meaning of the tags while they encode. In this work, we attempt to enrich the image representation with the tag encodings that leverage their semantics. Our approach utilizes neural network based natural language descriptors to represent the tag information. By complementing the visual features learned by convnets, our approach results in an efficient multi-modal image representation. Experimental evaluation suggests that our approach results in a better multi-modal image representation by exploiting the two data modalities for classification on benchmark datasets.