{"title":"VDA: Deep Learning based Visual Data Analysis in Integrated Edge to Cloud Computing Environment","authors":"Atanu Mandal, Amir Sinaeepourfard, S. Naskar","doi":"10.1145/3427477.3429781","DOIUrl":null,"url":null,"abstract":"In recent years, video surveillance technology has become pervasive in every sphere. The manual generation of videos’ descriptions requires enormous time and labor, and sometimes essential aspects of videos are overlooked in human summaries. The present work is an attempt towards the automated description generation of Surveillance Video. The proposed method consists of the extraction of key-frames from a surveillance video, objects detection in the key-frames, natural language (English) description generation of the key-frames, and summarizing the descriptions. The key-frames are identified based on a structural similarity index measure. Object detection in a key-frame is performed using the architecture of Single Shot Detection. We used Long Short Term Memory (LSTM) to generate captions from frames. Translation Error Rate (TER) is used to identify and remove duplicate event descriptions. Term frequency-inverse document frequency (TF-IDF) is used to rank the event descriptions generated from a video, and the top-ranked the description is returned as the system generated a summary of the video. We evaluated the Microsoft Video Description Corpus (MSVD) data set to validate our proposed approach, and the system produces a Bilingual Evaluation Understudy (BLEU) score of 46.83.","PeriodicalId":435827,"journal":{"name":"Adjunct Proceedings of the 2021 International Conference on Distributed Computing and Networking","volume":"80 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Adjunct Proceedings of the 2021 International Conference on Distributed Computing and Networking","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3427477.3429781","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
In recent years, video surveillance technology has become pervasive in every sphere. The manual generation of videos’ descriptions requires enormous time and labor, and sometimes essential aspects of videos are overlooked in human summaries. The present work is an attempt towards the automated description generation of Surveillance Video. The proposed method consists of the extraction of key-frames from a surveillance video, objects detection in the key-frames, natural language (English) description generation of the key-frames, and summarizing the descriptions. The key-frames are identified based on a structural similarity index measure. Object detection in a key-frame is performed using the architecture of Single Shot Detection. We used Long Short Term Memory (LSTM) to generate captions from frames. Translation Error Rate (TER) is used to identify and remove duplicate event descriptions. Term frequency-inverse document frequency (TF-IDF) is used to rank the event descriptions generated from a video, and the top-ranked the description is returned as the system generated a summary of the video. We evaluated the Microsoft Video Description Corpus (MSVD) data set to validate our proposed approach, and the system produces a Bilingual Evaluation Understudy (BLEU) score of 46.83.