Naeha Sharif, M. Jalwana, Bennamoun, Wei Liu, Syed Afaq Ali Shah
{"title":"利用语言感知对象关系和NASNet进行图像字幕","authors":"Naeha Sharif, M. Jalwana, Bennamoun, Wei Liu, Syed Afaq Ali Shah","doi":"10.1109/IVCNZ51579.2020.9290719","DOIUrl":null,"url":null,"abstract":"Image captioning is a challenging vision-to-language task, which has garnered a lot of attention over the past decade. The introduction of Encoder-Decoder based architectures expedited the research in this area and provided the backbone of the most recent systems. Moreover, leveraging relationships between objects for holistic scene understanding, which in turn improves captioning, has recently sparked interest among researchers. Our proposed model encodes the spatial and semantic proximity of object pairs into linguistically-aware relationship embeddings. Moreover, it captures the global semantics of the image using NASNet. This way, true semantic relations that are not apparent in visual content of an image can be learned, such that the decoder can attend to the most relevant object relations and visual features to generate more semantically-meaningful captions. Our experiments highlight the usefulness of linguistically-aware object relations as well as NASNet visual features for image captioning.","PeriodicalId":164317,"journal":{"name":"2020 35th International Conference on Image and Vision Computing New Zealand (IVCNZ)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Leveraging Linguistically-aware Object Relations and NASNet for Image Captioning\",\"authors\":\"Naeha Sharif, M. Jalwana, Bennamoun, Wei Liu, Syed Afaq Ali Shah\",\"doi\":\"10.1109/IVCNZ51579.2020.9290719\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Image captioning is a challenging vision-to-language task, which has garnered a lot of attention over the past decade. The introduction of Encoder-Decoder based architectures expedited the research in this area and provided the backbone of the most recent systems. Moreover, leveraging relationships between objects for holistic scene understanding, which in turn improves captioning, has recently sparked interest among researchers. Our proposed model encodes the spatial and semantic proximity of object pairs into linguistically-aware relationship embeddings. Moreover, it captures the global semantics of the image using NASNet. This way, true semantic relations that are not apparent in visual content of an image can be learned, such that the decoder can attend to the most relevant object relations and visual features to generate more semantically-meaningful captions. Our experiments highlight the usefulness of linguistically-aware object relations as well as NASNet visual features for image captioning.\",\"PeriodicalId\":164317,\"journal\":{\"name\":\"2020 35th International Conference on Image and Vision Computing New Zealand (IVCNZ)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 35th International Conference on Image and Vision Computing New Zealand (IVCNZ)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IVCNZ51579.2020.9290719\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 35th International Conference on Image and Vision Computing New Zealand (IVCNZ)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IVCNZ51579.2020.9290719","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Leveraging Linguistically-aware Object Relations and NASNet for Image Captioning
Image captioning is a challenging vision-to-language task, which has garnered a lot of attention over the past decade. The introduction of Encoder-Decoder based architectures expedited the research in this area and provided the backbone of the most recent systems. Moreover, leveraging relationships between objects for holistic scene understanding, which in turn improves captioning, has recently sparked interest among researchers. Our proposed model encodes the spatial and semantic proximity of object pairs into linguistically-aware relationship embeddings. Moreover, it captures the global semantics of the image using NASNet. This way, true semantic relations that are not apparent in visual content of an image can be learned, such that the decoder can attend to the most relevant object relations and visual features to generate more semantically-meaningful captions. Our experiments highlight the usefulness of linguistically-aware object relations as well as NASNet visual features for image captioning.