Fariba Lotfi , Mansour Jamzad , Hamid Beigy , Helia Farhood , Quan Z. Sheng , Amin Beheshti
{"title":"Knowledge graph construction in hyperbolic space for automatic image annotation","authors":"Fariba Lotfi , Mansour Jamzad , Hamid Beigy , Helia Farhood , Quan Z. Sheng , Amin Beheshti","doi":"10.1016/j.imavis.2024.105293","DOIUrl":null,"url":null,"abstract":"<div><div>Automatic image annotation (AIA) is a fundamental and challenging task in computer vision. Considering the correlations between tags can lead to more accurate image understanding, benefiting various applications, including image retrieval and visual search. While many attempts have been made to incorporate tag correlations in annotation models, the method of constructing a knowledge graph based on external knowledge sources and hyperbolic space has not been explored. In this paper, we create an attributed knowledge graph based on vocabulary, integrate external knowledge sources such as WordNet, and utilize hyperbolic word embeddings for the tag representations. These embeddings provide a sophisticated tag representation that captures hierarchical and complex correlations more effectively, enhancing the image annotation results. In addition, leveraging external knowledge sources enhances contextuality and significantly enriches existing AIA datasets. We exploit two deep learning-based models, the Relational Graph Convolutional Network (R-GCN) and the Vision Transformer (ViT), to extract the input features. We apply two R-GCN operations to obtain word descriptors and fuse them with the extracted visual features. We evaluate the proposed approach using three public benchmark datasets. Our experimental results demonstrate that the proposed architecture achieves state-of-the-art performance across most metrics on Corel5k, ESP Game, and IAPRTC-12.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105293"},"PeriodicalIF":4.2000,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624003986","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Automatic image annotation (AIA) is a fundamental and challenging task in computer vision. Considering the correlations between tags can lead to more accurate image understanding, benefiting various applications, including image retrieval and visual search. While many attempts have been made to incorporate tag correlations in annotation models, the method of constructing a knowledge graph based on external knowledge sources and hyperbolic space has not been explored. In this paper, we create an attributed knowledge graph based on vocabulary, integrate external knowledge sources such as WordNet, and utilize hyperbolic word embeddings for the tag representations. These embeddings provide a sophisticated tag representation that captures hierarchical and complex correlations more effectively, enhancing the image annotation results. In addition, leveraging external knowledge sources enhances contextuality and significantly enriches existing AIA datasets. We exploit two deep learning-based models, the Relational Graph Convolutional Network (R-GCN) and the Vision Transformer (ViT), to extract the input features. We apply two R-GCN operations to obtain word descriptors and fuse them with the extracted visual features. We evaluate the proposed approach using three public benchmark datasets. Our experimental results demonstrate that the proposed architecture achieves state-of-the-art performance across most metrics on Corel5k, ESP Game, and IAPRTC-12.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.