{"title":"TransUser's: A Transformer Based Salient Object Detection for Users Experience Generation in 360° Videos","authors":"I. Khan, Kyungjin Han, Jong Weon Lee","doi":"10.1109/AIxVR59861.2024.00042","DOIUrl":null,"url":null,"abstract":"A 360-degree video stream enables users to view their point of interest while giving them the sense of 'being there'. Performing head or hand manipulations to watch the salient objects and sceneries in such a video is a very tiresome task and the user may miss the interesting events. Compared to this, the automatic selection of a user's Point of Interest (PoI) in a 360° video is extremely challenging due to subjective viewpoints and varying degrees of satisfaction. To handle these challenges, we employed an attention-based transformer approach to detect salient objects inside the immersive contents. In the proposed framework, first, an input 360° video is converted into frames where each frame is passed to a CNNbased encoder. The CNN encoder generates feature maps of the input framers. Further, for an attention-based network, we used a stack of three transformers encoder with position embeddings to generate position-awareness embeddings of the encoded feature maps. Each transformer encoder is based on a multihead self-attention block and a multi-layer perceptron with various sets of attention blocks. Finally, encoded features and position embeddings from the transformer encoder are passed through a CNN decoder network to predict the salient object inside the 360° video frames. We evaluated our results on four immersive videos to find the effectiveness of the proposed framework. Further, we also compared our results with state-of-the-art methods where the proposed method outperformed the other existing models.","PeriodicalId":518749,"journal":{"name":"2024 IEEE International Conference on Artificial Intelligence and eXtended and Virtual Reality (AIxVR)","volume":"194 2","pages":"256-260"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2024 IEEE International Conference on Artificial Intelligence and eXtended and Virtual Reality (AIxVR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AIxVR59861.2024.00042","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
A 360-degree video stream enables users to view their point of interest while giving them the sense of 'being there'. Performing head or hand manipulations to watch the salient objects and sceneries in such a video is a very tiresome task and the user may miss the interesting events. Compared to this, the automatic selection of a user's Point of Interest (PoI) in a 360° video is extremely challenging due to subjective viewpoints and varying degrees of satisfaction. To handle these challenges, we employed an attention-based transformer approach to detect salient objects inside the immersive contents. In the proposed framework, first, an input 360° video is converted into frames where each frame is passed to a CNNbased encoder. The CNN encoder generates feature maps of the input framers. Further, for an attention-based network, we used a stack of three transformers encoder with position embeddings to generate position-awareness embeddings of the encoded feature maps. Each transformer encoder is based on a multihead self-attention block and a multi-layer perceptron with various sets of attention blocks. Finally, encoded features and position embeddings from the transformer encoder are passed through a CNN decoder network to predict the salient object inside the 360° video frames. We evaluated our results on four immersive videos to find the effectiveness of the proposed framework. Further, we also compared our results with state-of-the-art methods where the proposed method outperformed the other existing models.