Huijuan Xu, Boyang Albert Li, Vasili Ramanishka, L. Sigal, Kate Saenko
Dense video captioning involves first localizing events in a video and then generating captions for the identified events. We present the Joint Event Detection and Description Network (JEDDi-Net) for solving this task in an end-to-end fashion, which encodes the input video stream with three-dimensional convolutional layers, proposes variable- length temporal events based on pooled features, and then uses a two-level hierarchical LSTM module with context modeling to transcribe the event proposals into captions. We show the effectiveness of our proposed JEDDi-Net on the large-scale ActivityNet Captions dataset.
{"title":"Joint Event Detection and Description in Continuous Video Streams","authors":"Huijuan Xu, Boyang Albert Li, Vasili Ramanishka, L. Sigal, Kate Saenko","doi":"10.1109/WACV.2019.00048","DOIUrl":"https://doi.org/10.1109/WACV.2019.00048","url":null,"abstract":"Dense video captioning involves first localizing events in a video and then generating captions for the identified events. We present the Joint Event Detection and Description Network (JEDDi-Net) for solving this task in an end-to-end fashion, which encodes the input video stream with three-dimensional convolutional layers, proposes variable- length temporal events based on pooled features, and then uses a two-level hierarchical LSTM module with context modeling to transcribe the event proposals into captions. We show the effectiveness of our proposed JEDDi-Net on the large-scale ActivityNet Captions dataset.","PeriodicalId":254512,"journal":{"name":"2019 IEEE Winter Applications of Computer Vision Workshops (WACVW)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127532506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.1109/WACVW.2019.00012
Rico Thomanek, Christian Roschke, Benny Platte, R. Manthey, Tony Rolletschke, Manuel Heinzig, M. Vodel, Frank Zimmer, Maximilian Eibl
The analysis of video footage regarding the identification of persons at defined locations or the detection of complex activities is still a challenging process. Nowadays, various (semi-)automated systems can be used to overcome different parts of these challenges. Object detection and their classification reach even higher detection rates when making use of the latest cutting-edge convolutional neural network frameworks. Integrated into a scalable infrastructure as a service data base system, we employ the combination of such networks by using the Detectron framework within Docker containers with case-specific engineered tracking and motion pattern heuristics in order to detect several activities with comparatively low and distributed computing efforts and reasonable results.
{"title":"A Scalable System Architecture for Activity Detection with Simple Heuristics","authors":"Rico Thomanek, Christian Roschke, Benny Platte, R. Manthey, Tony Rolletschke, Manuel Heinzig, M. Vodel, Frank Zimmer, Maximilian Eibl","doi":"10.1109/WACVW.2019.00012","DOIUrl":"https://doi.org/10.1109/WACVW.2019.00012","url":null,"abstract":"The analysis of video footage regarding the identification of persons at defined locations or the detection of complex activities is still a challenging process. Nowadays, various (semi-)automated systems can be used to overcome different parts of these challenges. Object detection and their classification reach even higher detection rates when making use of the latest cutting-edge convolutional neural network frameworks. Integrated into a scalable infrastructure as a service data base system, we employ the combination of such networks by using the Detectron framework within Docker containers with case-specific engineered tracking and motion pattern heuristics in order to detect several activities with comparatively low and distributed computing efforts and reasonable results.","PeriodicalId":254512,"journal":{"name":"2019 IEEE Winter Applications of Computer Vision Workshops (WACVW)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124824462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.1109/WACVW.2019.00014
Sathyanarayanan N. Aakur, Daniel Sawyer, Sudeep Sarkar
Spatiotemporal localization of activities in untrimmed surveillance videos is a hard task, especially given the occurrence of simultaneous activities across different temporal and spatial scales. We tackle this problem using a cascaded region proposal and detection (CRPAD) framework implementing frame-level simultaneous action detection, followed by tracking. We propose the use of a frame-level spatial detection model based on advances in object detection and a temporal linking algorithm that models the temporal dynamics of the detected activities. We show results on the VIRAT dataset through the recent Activities in Extended Video (ActEV) challenge that is part of the TrecVID competition[1, 2].
{"title":"Fine-grained Action Detection in Untrimmed Surveillance Videos","authors":"Sathyanarayanan N. Aakur, Daniel Sawyer, Sudeep Sarkar","doi":"10.1109/WACVW.2019.00014","DOIUrl":"https://doi.org/10.1109/WACVW.2019.00014","url":null,"abstract":"Spatiotemporal localization of activities in untrimmed surveillance videos is a hard task, especially given the occurrence of simultaneous activities across different temporal and spatial scales. We tackle this problem using a cascaded region proposal and detection (CRPAD) framework implementing frame-level simultaneous action detection, followed by tracking. We propose the use of a frame-level spatial detection model based on advances in object detection and a temporal linking algorithm that models the temporal dynamics of the detected activities. We show results on the VIRAT dataset through the recent Activities in Extended Video (ActEV) challenge that is part of the TrecVID competition[1, 2].","PeriodicalId":254512,"journal":{"name":"2019 IEEE Winter Applications of Computer Vision Workshops (WACVW)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133511538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present an event detection system, which shares many similarities with standard object detection pipelines. It is composed of four modules: feature extraction, event proposal generation, event classification and event localization. We developed and assessed each module separately by evaluating several candidate options given oracle input using intermediate evaluation metric. This particular process results in a mismatch gap between training and testing when we integrate the module into the complete system pipeline. This results from the fact that each module is trained on clean oracle input, but during testing the module can only receive system generated input, which can be significantly different from the oracle data. Furthermore, we discovered that all the gaps between the different modules can contribute to a decrease in accuracy and they represent the major bottleneck for a system developed in this way. Fortunately, we were able to develop a set of relatively simple fixes in our final system to address and mitigate some of the gaps.
{"title":"Minding the Gaps in a Video Action Analysis Pipeline","authors":"Jia Chen, Jiang Liu, Junwei Liang, Ting-yao Hu, Wei Ke, Wayner Barrios, Dong Huang, Alexander Hauptmann","doi":"10.1109/WACVW.2019.00015","DOIUrl":"https://doi.org/10.1109/WACVW.2019.00015","url":null,"abstract":"We present an event detection system, which shares many similarities with standard object detection pipelines. It is composed of four modules: feature extraction, event proposal generation, event classification and event localization. We developed and assessed each module separately by evaluating several candidate options given oracle input using intermediate evaluation metric. This particular process results in a mismatch gap between training and testing when we integrate the module into the complete system pipeline. This results from the fact that each module is trained on clean oracle input, but during testing the module can only receive system generated input, which can be significantly different from the oracle data. Furthermore, we discovered that all the gaps between the different modules can contribute to a decrease in accuracy and they represent the major bottleneck for a system developed in this way. Fortunately, we were able to develop a set of relatively simple fixes in our final system to address and mitigate some of the gaps.","PeriodicalId":254512,"journal":{"name":"2019 IEEE Winter Applications of Computer Vision Workshops (WACVW)","volume":"193 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124297774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.1109/WACVW.2019.00019
Bor-Chun Chen, L. Davis
Verifying the authenticity of a given image is an emerging topic in media forensics research. Many current works focus on content manipulation detection, which aims to detect possible alteration in the image content. However, tampering might not only occur in the image content itself, but also in the metadata associated with the image, such as timestamp, geo-tag, and captions. We address metadata verification, aiming to verify the authenticity of the metadata associated with the image, using a deep representation learning approach. We propose a deep neural network called Attentive Bilinear Convolutional Neural Networks (AB-CNN) that learns appropriate representation for metadata verification. AB-CNN address several common challenges in verifying a specific type of metadata – event (i.e. time and places), including lack of training data, finegrained differences between distinct events, and diverse visual content within the same event. Experimental results on three different datasets show that the proposed model can provide a substantial improvement over the baseline method.
{"title":"Deep Representation Learning for Metadata Verification","authors":"Bor-Chun Chen, L. Davis","doi":"10.1109/WACVW.2019.00019","DOIUrl":"https://doi.org/10.1109/WACVW.2019.00019","url":null,"abstract":"Verifying the authenticity of a given image is an emerging topic in media forensics research. Many current works focus on content manipulation detection, which aims to detect possible alteration in the image content. However, tampering might not only occur in the image content itself, but also in the metadata associated with the image, such as timestamp, geo-tag, and captions. We address metadata verification, aiming to verify the authenticity of the metadata associated with the image, using a deep representation learning approach. We propose a deep neural network called Attentive Bilinear Convolutional Neural Networks (AB-CNN) that learns appropriate representation for metadata verification. AB-CNN address several common challenges in verifying a specific type of metadata – event (i.e. time and places), including lack of training data, finegrained differences between distinct events, and diverse visual content within the same event. Experimental results on three different datasets show that the proposed model can provide a substantial improvement over the baseline method.","PeriodicalId":254512,"journal":{"name":"2019 IEEE Winter Applications of Computer Vision Workshops (WACVW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125873343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}