{"title":"Dense Captioning of Videos using Feature Context Integrated Deep LSTM with Local Attention","authors":"J. Jacob, V. P. Devassia","doi":"10.1109/I-SMAC55078.2022.9987416","DOIUrl":null,"url":null,"abstract":"Dense captioning is a fast emerging area in video processing in natural language, that construe semantic contents present in an input video and. A traditional deep learning algorithm faces more challenges in solving this problem because it requires optimizing not just one set of values, but two sets, namely (1) event proposals, which are the timestamps for detecting an activity in a particular temporal region, and (2) natural language annotations for the detected proposals. Bidirectional LS TMs are used to predict event proposals based on information from the past and future of the event. Captions for detected events are also generated based on the past and future information associated with the event. The context vectors are augmented with original C3D video features in the decoder network in order to optimize the encoder network for proposals instead of captions. In this way, all the information necessary for the decoding network is provided. A local attention mechanism is added to the model so that it can focus on the relevant parts of the data to improve its performance. As a final step, captions will be generated with deep LSTMs. In order to verify the effectiveness of proposed model, a rigorous experiments have been conducted on the suggested innovations and demonstrated that it is remarkably effective at dense captioning events in videos with significant gains across a variety of metrics when it uses Feature Context Integrated (FC1) Deep LS TM with local attention.","PeriodicalId":306129,"journal":{"name":"2022 Sixth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC)","volume":"28 5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Sixth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/I-SMAC55078.2022.9987416","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Dense captioning is a fast emerging area in video processing in natural language, that construe semantic contents present in an input video and. A traditional deep learning algorithm faces more challenges in solving this problem because it requires optimizing not just one set of values, but two sets, namely (1) event proposals, which are the timestamps for detecting an activity in a particular temporal region, and (2) natural language annotations for the detected proposals. Bidirectional LS TMs are used to predict event proposals based on information from the past and future of the event. Captions for detected events are also generated based on the past and future information associated with the event. The context vectors are augmented with original C3D video features in the decoder network in order to optimize the encoder network for proposals instead of captions. In this way, all the information necessary for the decoding network is provided. A local attention mechanism is added to the model so that it can focus on the relevant parts of the data to improve its performance. As a final step, captions will be generated with deep LSTMs. In order to verify the effectiveness of proposed model, a rigorous experiments have been conducted on the suggested innovations and demonstrated that it is remarkably effective at dense captioning events in videos with significant gains across a variety of metrics when it uses Feature Context Integrated (FC1) Deep LS TM with local attention.