Matt Brown, Keith Fieldhouse, E. Swears, Paul Tunison, Adam Romlein, A. Hoogs
We introduce a self-contained, mobile surveillance system designed to remotely detect and track people in real time, at long ranges, and over a wide field of view in cluttered urban and natural settings. The system is integrated with an unmanned ground vehicle, which hosts an array of four IR and four high-resolution RGB cameras, navigational sensors, and onboard processing computers. High-confidence, low-false-alarm-rate person tracks are produced by fusing motion detections and single-frame CNN person detections between co-registered RGB and IR video streams. Processing speeds are increased by using semantic scene segmentation and a tiered inference scheme to focus processing on the most salient regions of the 43° x 7.8° composite field of view. The system autonomously produces alerts of human presence and movement within the field of view, which are disseminated over a radio network and remotely viewed on a tablet computer. We present an ablation study quantifying the benefits that multi-sensor, multi-detector fusion brings to the problem of detecting people in challenging outdoor environments with shadows, occlusions, clutter, and variable weather conditions.
{"title":"Multi-Modal Detection Fusion on a Mobile UGV for Wide-Area, Long-Range Surveillance","authors":"Matt Brown, Keith Fieldhouse, E. Swears, Paul Tunison, Adam Romlein, A. Hoogs","doi":"10.1109/WACV.2019.00207","DOIUrl":"https://doi.org/10.1109/WACV.2019.00207","url":null,"abstract":"We introduce a self-contained, mobile surveillance system designed to remotely detect and track people in real time, at long ranges, and over a wide field of view in cluttered urban and natural settings. The system is integrated with an unmanned ground vehicle, which hosts an array of four IR and four high-resolution RGB cameras, navigational sensors, and onboard processing computers. High-confidence, low-false-alarm-rate person tracks are produced by fusing motion detections and single-frame CNN person detections between co-registered RGB and IR video streams. Processing speeds are increased by using semantic scene segmentation and a tiered inference scheme to focus processing on the most salient regions of the 43° x 7.8° composite field of view. The system autonomously produces alerts of human presence and movement within the field of view, which are disseminated over a radio network and remotely viewed on a tablet computer. We present an ablation study quantifying the benefits that multi-sensor, multi-detector fusion brings to the problem of detecting people in challenging outdoor environments with shadows, occlusions, clutter, and variable weather conditions.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"202 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125732000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Puneet Mathur, Meghna P. Ayyar, R. Shah, S. Sharma
Identification of diseased kidney glomeruli and fibrotic regions remains subjective and time-consuming due to complete dependence on an expert kidney pathologist. In an attempt to automate the classification of glomeruli into normal and abnormal morphology and classification of fibrosis patches into mild, moderate and severe categories, we investigate three deep learning techniques: traditional transfer learning, pre-trained deep neural networks for feature extraction followed by supervised classification, and a novel Multi-Gaze Attention Network (MGANet) that uses multi-headed self-attention through parallel residual skip connections in a CNN architecture. Emperically, while the transfer learning models such as ResNet50, InceptionResNetV2, VGG19 and InceptionV3 acutely under-perform in the classification tasks, the Logistic Regression model augmented with features extracted from the InceptionResNetV2 shows promising results. Additionally, the experiments effectively ascertain that the proposed MGANet architecture outperforms both the former baseline techniques to establish the state of the art accuracy of 87.25% and 81.47% for glomerluli and fibrosis classification, respectively on the Renal Glomeruli Fibrosis Histopathological (RGFH) database.
{"title":"Exploring Classification of Histological Disease Biomarkers From Renal Biopsy Images","authors":"Puneet Mathur, Meghna P. Ayyar, R. Shah, S. Sharma","doi":"10.1109/WACV.2019.00016","DOIUrl":"https://doi.org/10.1109/WACV.2019.00016","url":null,"abstract":"Identification of diseased kidney glomeruli and fibrotic regions remains subjective and time-consuming due to complete dependence on an expert kidney pathologist. In an attempt to automate the classification of glomeruli into normal and abnormal morphology and classification of fibrosis patches into mild, moderate and severe categories, we investigate three deep learning techniques: traditional transfer learning, pre-trained deep neural networks for feature extraction followed by supervised classification, and a novel Multi-Gaze Attention Network (MGANet) that uses multi-headed self-attention through parallel residual skip connections in a CNN architecture. Emperically, while the transfer learning models such as ResNet50, InceptionResNetV2, VGG19 and InceptionV3 acutely under-perform in the classification tasks, the Logistic Regression model augmented with features extracted from the InceptionResNetV2 shows promising results. Additionally, the experiments effectively ascertain that the proposed MGANet architecture outperforms both the former baseline techniques to establish the state of the art accuracy of 87.25% and 81.47% for glomerluli and fibrosis classification, respectively on the Renal Glomeruli Fibrosis Histopathological (RGFH) database.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125953356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Talha Ahmad Siddiqui, Samarth Bharadwaj, S. Kalyanaraman
Ahead-of-time forecasting of incident solar-irradiance on a panel is indicative of expected energy yield and is essential for efficient grid distribution and planning. Traditionally, these forecasts are based on meteorological physics models whose parameters are tuned by coarse-grained radiometric tiles sensed from geo-satellites. This research presents a novel application of deep neural network approach to observe and estimate short-term weather effects from videos. Specifically, we use time-lapsed videos (sky-videos) obtained from upward facing wide-lensed cameras (sky-cameras) to directly estimate and forecast solar irradiance. We introduce and present results on two large publicly available datasets obtained from weather stations in two regions of North America using relatively inexpensive optical hardware. These datasets contain over a million images that span for 1 and 12 years respectively, the largest such collection to our knowledge. Compared to satellite based approaches, the proposed deep learning approach significantly reduces the normalized mean-absolute-percentage error for both nowcasting, i.e. prediction of the solar irradiance at the instance the frame is captured, as well as forecasting, ahead-of-time irradiance prediction for a duration for upto 4 hours.
{"title":"A Deep Learning Approach to Solar-Irradiance Forecasting in Sky-Videos","authors":"Talha Ahmad Siddiqui, Samarth Bharadwaj, S. Kalyanaraman","doi":"10.1109/WACV.2019.00234","DOIUrl":"https://doi.org/10.1109/WACV.2019.00234","url":null,"abstract":"Ahead-of-time forecasting of incident solar-irradiance on a panel is indicative of expected energy yield and is essential for efficient grid distribution and planning. Traditionally, these forecasts are based on meteorological physics models whose parameters are tuned by coarse-grained radiometric tiles sensed from geo-satellites. This research presents a novel application of deep neural network approach to observe and estimate short-term weather effects from videos. Specifically, we use time-lapsed videos (sky-videos) obtained from upward facing wide-lensed cameras (sky-cameras) to directly estimate and forecast solar irradiance. We introduce and present results on two large publicly available datasets obtained from weather stations in two regions of North America using relatively inexpensive optical hardware. These datasets contain over a million images that span for 1 and 12 years respectively, the largest such collection to our knowledge. Compared to satellite based approaches, the proposed deep learning approach significantly reduces the normalized mean-absolute-percentage error for both nowcasting, i.e. prediction of the solar irradiance at the instance the frame is captured, as well as forecasting, ahead-of-time irradiance prediction for a duration for upto 4 hours.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130046565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent progresses in model-free single object tracking (SOT) algorithms have largely inspired applying SOT to multi-object tracking (MOT) to improve the robustness as well as relieving dependency on external detector. However, SOT algorithms are generally designed for distinguishing a target from its environment, and hence meet problems when a target is spatially mixed with similar objects as observed frequently in MOT. To address this issue, in this paper we propose an instance-aware tracker to integrate SOT techniques for MOT by encoding awareness both within and between target models. In particular, we construct each target model by fusing information for distinguishing target both from background and other instances (tracking targets). To conserve uniqueness of all target models, our instance-aware tracker considers response maps from all target models and assigns spatial locations exclusively to optimize the overall accuracy. Another contribution we make is a dynamic model refreshing strategy learned by a convolutional neural network. This strategy helps to eliminate initialization noise as well as to adapt to variation of target size and appearance. To show the effectiveness of the proposed approach, it is evaluated on the popular MOT15 and MOT16 challenge benchmarks. On both benchmarks, our approach achieves the best overall performances in comparison with published results.
{"title":"Online Multi-Object Tracking With Instance-Aware Tracker and Dynamic Model Refreshment","authors":"Peng Chu, Heng Fan, C. C. Tan, Haibin Ling","doi":"10.1109/WACV.2019.00023","DOIUrl":"https://doi.org/10.1109/WACV.2019.00023","url":null,"abstract":"Recent progresses in model-free single object tracking (SOT) algorithms have largely inspired applying SOT to multi-object tracking (MOT) to improve the robustness as well as relieving dependency on external detector. However, SOT algorithms are generally designed for distinguishing a target from its environment, and hence meet problems when a target is spatially mixed with similar objects as observed frequently in MOT. To address this issue, in this paper we propose an instance-aware tracker to integrate SOT techniques for MOT by encoding awareness both within and between target models. In particular, we construct each target model by fusing information for distinguishing target both from background and other instances (tracking targets). To conserve uniqueness of all target models, our instance-aware tracker considers response maps from all target models and assigns spatial locations exclusively to optimize the overall accuracy. Another contribution we make is a dynamic model refreshing strategy learned by a convolutional neural network. This strategy helps to eliminate initialization noise as well as to adapt to variation of target size and appearance. To show the effectiveness of the proposed approach, it is evaluated on the popular MOT15 and MOT16 challenge benchmarks. On both benchmarks, our approach achieves the best overall performances in comparison with published results.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"148 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132449987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent research in pixel-wise semantic segmentation has increasingly focused on the development of very complicated deep neural networks, which require a large amount of computational resources. The ability to perform dense predictions in real-time, therefore, becomes tantamount to achieving high accuracies. This real-time demand turns out to be fundamental particularly on the mobile platform and other GPU-powered embedded systems like NVIDIA Jetson TX series. In this paper, we present a fast and efficient lightweight network called Turbo Unified Network (ThunderNet). With a minimum backbone truncated from ResNet18, ThunderNet unifies the pyramid pooling module with our customized decoder. Our experimental results show that ThunderNet can achieve 64.0% mIoU on CityScapes, with real-time performance of 96.2 fps on a Titan XP GPU (512x1024), and 20.9 fps on Jetson TX2 (256x512).
近年来在逐像素语义分割方面的研究越来越多地集中在非常复杂的深度神经网络的开发上,这需要大量的计算资源。因此,实时执行密集预测的能力就等同于实现高精度。这种实时需求被证明是基本的,特别是在移动平台和其他gpu驱动的嵌入式系统,如NVIDIA Jetson TX系列。本文提出了一种快速高效的轻量级网络——Turbo统一网络(ThunderNet)。通过从ResNet18截断的最小骨干,ThunderNet将金字塔池模块与我们定制的解码器统一起来。我们的实验结果表明,ThunderNet在cityscape上可以实现64.0%的mIoU,在Titan XP GPU (512x1024)上实时性能为96.2 fps,在Jetson TX2 (256x512)上实时性能为20.9 fps。
{"title":"ThunderNet: A Turbo Unified Network for Real-Time Semantic Segmentation","authors":"Wei Xiang, Hongda Mao, V. Athitsos","doi":"10.1109/WACV.2019.00195","DOIUrl":"https://doi.org/10.1109/WACV.2019.00195","url":null,"abstract":"Recent research in pixel-wise semantic segmentation has increasingly focused on the development of very complicated deep neural networks, which require a large amount of computational resources. The ability to perform dense predictions in real-time, therefore, becomes tantamount to achieving high accuracies. This real-time demand turns out to be fundamental particularly on the mobile platform and other GPU-powered embedded systems like NVIDIA Jetson TX series. In this paper, we present a fast and efficient lightweight network called Turbo Unified Network (ThunderNet). With a minimum backbone truncated from ResNet18, ThunderNet unifies the pyramid pooling module with our customized decoder. Our experimental results show that ThunderNet can achieve 64.0% mIoU on CityScapes, with real-time performance of 96.2 fps on a Titan XP GPU (512x1024), and 20.9 fps on Jetson TX2 (256x512).","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131720453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vishal Kaushal, Rishabh K. Iyer, S. Kothawade, Rohan Mahadev, Khoshrav Doctor, Ganesh Ramakrishnan
Supervised machine learning based state-of-the-art computer vision techniques are in general data hungry. Their data curation poses the challenges of expensive human labeling, inadequate computing resources and larger experiment turn around times. Training data subset selection and active learning techniques have been proposed as possible solutions to these challenges. A special class of subset selection functions naturally model notions of diversity, coverage and representation and can be used to eliminate redundancy thus lending themselves well for training data subset selection. They can also help improve the efficiency of active learning in further reducing human labeling efforts by selecting a subset of the examples obtained using the conventional uncertainty sampling based techniques. In this work, we empirically demonstrate the effectiveness of two diversity models, namely the Facility-Location and Dispersion models for training-data subset selection and reducing labeling effort. We demonstrate this across the board for a variety of computer vision tasks including Gender Recognition, Face Recognition, Scene Recognition, Object Detection and Object Recognition. Our results show that diversity based subset selection done in the right way can increase the accuracy by upto 5 - 10% over existing baselines, particularly in settings in which less training data is available. This allows the training of complex machine learning models like Convolutional Neural Networks with much less training data and labeling costs while incurring minimal performance loss.
{"title":"Learning From Less Data: A Unified Data Subset Selection and Active Learning Framework for Computer Vision","authors":"Vishal Kaushal, Rishabh K. Iyer, S. Kothawade, Rohan Mahadev, Khoshrav Doctor, Ganesh Ramakrishnan","doi":"10.1109/WACV.2019.00142","DOIUrl":"https://doi.org/10.1109/WACV.2019.00142","url":null,"abstract":"Supervised machine learning based state-of-the-art computer vision techniques are in general data hungry. Their data curation poses the challenges of expensive human labeling, inadequate computing resources and larger experiment turn around times. Training data subset selection and active learning techniques have been proposed as possible solutions to these challenges. A special class of subset selection functions naturally model notions of diversity, coverage and representation and can be used to eliminate redundancy thus lending themselves well for training data subset selection. They can also help improve the efficiency of active learning in further reducing human labeling efforts by selecting a subset of the examples obtained using the conventional uncertainty sampling based techniques. In this work, we empirically demonstrate the effectiveness of two diversity models, namely the Facility-Location and Dispersion models for training-data subset selection and reducing labeling effort. We demonstrate this across the board for a variety of computer vision tasks including Gender Recognition, Face Recognition, Scene Recognition, Object Detection and Object Recognition. Our results show that diversity based subset selection done in the right way can increase the accuracy by upto 5 - 10% over existing baselines, particularly in settings in which less training data is available. This allows the training of complex machine learning models like Convolutional Neural Networks with much less training data and labeling costs while incurring minimal performance loss.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121712239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Tsesmelis, Irtiza Hasan, M. Cristani, A. D. Bue, Fabio Galasso
Lighting design in indoor environments is of primary importance for at least two reasons: 1) people should perceive an adequate light; 2) an effective lighting design means consistent energy saving. We present the Invisible Light Switch (ILS) to address both aspects. ILS dynamically adjusts the room illumination level to save energy while maintaining constant the light level perception of the users. So the energy saving is invisible to them. Our proposed ILS leverages a radiosity model to estimate the light level which is perceived by a person within an indoor environment, taking into account the person position and her/his viewing frustum (head pose). ILS may therefore dim those luminaires, which are not seen by the user, resulting in an effective energy saving, especially in large open offices (where light may otherwise be ON everywhere for a single person). To quantify the system performance, we have collected a new dataset where people wear luxmeter devices while working in office rooms. The luxmeters measure the amount of light (in Lux) reaching the people gaze, which we consider a proxy to their illumination level perception. Our initial results are promising: in a room with 8 LED luminaires, the energy consumption in a day may be reduced from 18585 to 6206 watts with ILS (currently needing 1560 watts for operations). While doing so, the drop in perceived lighting decreases by just 200 lux, a value considered negligible when the original illumination level is above 1200 lux, as is normally the case in offices.
{"title":"Human-Centric Light Sensing and Estimation From RGBD Images: The Invisible Light Switch","authors":"T. Tsesmelis, Irtiza Hasan, M. Cristani, A. D. Bue, Fabio Galasso","doi":"10.1109/WACV.2019.00050","DOIUrl":"https://doi.org/10.1109/WACV.2019.00050","url":null,"abstract":"Lighting design in indoor environments is of primary importance for at least two reasons: 1) people should perceive an adequate light; 2) an effective lighting design means consistent energy saving. We present the Invisible Light Switch (ILS) to address both aspects. ILS dynamically adjusts the room illumination level to save energy while maintaining constant the light level perception of the users. So the energy saving is invisible to them. Our proposed ILS leverages a radiosity model to estimate the light level which is perceived by a person within an indoor environment, taking into account the person position and her/his viewing frustum (head pose). ILS may therefore dim those luminaires, which are not seen by the user, resulting in an effective energy saving, especially in large open offices (where light may otherwise be ON everywhere for a single person). To quantify the system performance, we have collected a new dataset where people wear luxmeter devices while working in office rooms. The luxmeters measure the amount of light (in Lux) reaching the people gaze, which we consider a proxy to their illumination level perception. Our initial results are promising: in a room with 8 LED luminaires, the energy consumption in a day may be reduced from 18585 to 6206 watts with ILS (currently needing 1560 watts for operations). While doing so, the drop in perceived lighting decreases by just 200 lux, a value considered negligible when the original illumination level is above 1200 lux, as is normally the case in offices.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123876306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper considers the problem of video inpainting, i.e., to remove specified objects from an input video. Many methods have been developed for the problem so far, in which there is a trade-off between image quality and computational time. There was no method that can generate high-quality images in video rate. The key to video inpainting is how to establish correspondences from scene regions occluded in a frame to those observed in other frames. To break the trade-off, we propose to use CNNs as a solution to this key problem. We extend existing CNNs for the standard task of optical flow estimation to be able to estimate the flow of occluded background regions. The extension includes augmentation of their architecture and changes of their training method. We experimentally show that this approach works well despite its simplicity, and that a simple video inpainting method integrating this flow estimator runs in video rate (e.g., 32fps for 832 × 448 pixel videos on a standard PC with a GPU) while achieving image quality close to the state-of-the-art.
{"title":"Video-Rate Video Inpainting","authors":"Rito Murase, Yan Zhang, Takayuki Okatani","doi":"10.1109/WACV.2019.00170","DOIUrl":"https://doi.org/10.1109/WACV.2019.00170","url":null,"abstract":"This paper considers the problem of video inpainting, i.e., to remove specified objects from an input video. Many methods have been developed for the problem so far, in which there is a trade-off between image quality and computational time. There was no method that can generate high-quality images in video rate. The key to video inpainting is how to establish correspondences from scene regions occluded in a frame to those observed in other frames. To break the trade-off, we propose to use CNNs as a solution to this key problem. We extend existing CNNs for the standard task of optical flow estimation to be able to estimate the flow of occluded background regions. The extension includes augmentation of their architecture and changes of their training method. We experimentally show that this approach works well despite its simplicity, and that a simple video inpainting method integrating this flow estimator runs in video rate (e.g., 32fps for 832 × 448 pixel videos on a standard PC with a GPU) while achieving image quality close to the state-of-the-art.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126961090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In cardiac interventions, localization of guiding catheter tip in 2D fluoroscopic images is important to specify ves-sel branches and calibrate vessels with stenosis. While detection of guiding catheter tip is not trivial in contrast-free images due to low dose radiation as well as occlusion by other devices, it is even more challenging in contrast-filled images. As contrast-filled vessels become visible in X-ray imaging, the landmark of guiding catheter tip can often be completely occluded by the contrast medium. It is difficult even for human eyes to precisely localize the catheter tip from a single angiography image. Physicians have to rely on information before the inject of contrast medium to localize the guiding catheter tip occluded by contrast medium. Automatic landmark detection when occlusion happens is important and can significantly simplify the intervention workflow. To address this problem, we propose a novel Cascade Attention Machine (CAM) model. It borrows the idea of how human experts localize the catheter tip by first per-forming landmark detection when occlusion does not hap-pen, then leveraging this information as prior knowledge to assist the occluded detection. Attention maps are computed from non-occluded detection to further refine the heatmaps for occluded detection to guide the inference focusing on related regions. Experiments on X-ray angiography demonstrate the promising performance compared with the state-of-the-art baselines. It shows that the CAM can capture the relation between situations with and without occlusion to achieve precise detection of occluded landmark.
{"title":"Cascade Attention Machine for Occluded Landmark Detection in 2D X-Ray Angiography","authors":"Liheng Zhang, V. Singh, Guo-Jun Qi, Terrence Chen","doi":"10.1109/WACV.2019.00017","DOIUrl":"https://doi.org/10.1109/WACV.2019.00017","url":null,"abstract":"In cardiac interventions, localization of guiding catheter tip in 2D fluoroscopic images is important to specify ves-sel branches and calibrate vessels with stenosis. While detection of guiding catheter tip is not trivial in contrast-free images due to low dose radiation as well as occlusion by other devices, it is even more challenging in contrast-filled images. As contrast-filled vessels become visible in X-ray imaging, the landmark of guiding catheter tip can often be completely occluded by the contrast medium. It is difficult even for human eyes to precisely localize the catheter tip from a single angiography image. Physicians have to rely on information before the inject of contrast medium to localize the guiding catheter tip occluded by contrast medium. Automatic landmark detection when occlusion happens is important and can significantly simplify the intervention workflow. To address this problem, we propose a novel Cascade Attention Machine (CAM) model. It borrows the idea of how human experts localize the catheter tip by first per-forming landmark detection when occlusion does not hap-pen, then leveraging this information as prior knowledge to assist the occluded detection. Attention maps are computed from non-occluded detection to further refine the heatmaps for occluded detection to guide the inference focusing on related regions. Experiments on X-ray angiography demonstrate the promising performance compared with the state-of-the-art baselines. It shows that the CAM can capture the relation between situations with and without occlusion to achieve precise detection of occluded landmark.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126964853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Harshala Gammulle, Tharindu Fernando, S. Denman, S. Sridharan, C. Fookes
We propose a novel conditional GAN (cGAN) model for continuous fine-grained human action segmentation, that utilises multi-modal data and learned scene context information. The proposed approach utilises two GANs: termed Action GAN and Auxiliary GAN, where the Action GAN is trained to operate over the current RGB frame while the Auxiliary GAN utilises supplementary information such as depth or optical flow. The goal of both GANs is to generate similar 'action codes', a vector representation of the current action. To facilitate this process a context extractor that incorporates data and recent outputs from both modes is used to extract context information to aids recognition performance. The result is a recurrent GAN architecture which learns a task specific loss function from multiple feature modalities. Extensive evaluations on variants of the proposed model to show the importance of utilising different streams of information such as context and auxiliary information in the proposed network; and show that our model is capable of outperforming state-of-the-art methods for three widely used datasets: 50 Salads, MERL Shopping and Georgia Tech Egocentric Activities, comprising both static and dynamic camera settings.
{"title":"Coupled Generative Adversarial Network for Continuous Fine-Grained Action Segmentation","authors":"Harshala Gammulle, Tharindu Fernando, S. Denman, S. Sridharan, C. Fookes","doi":"10.1109/WACV.2019.00027","DOIUrl":"https://doi.org/10.1109/WACV.2019.00027","url":null,"abstract":"We propose a novel conditional GAN (cGAN) model for continuous fine-grained human action segmentation, that utilises multi-modal data and learned scene context information. The proposed approach utilises two GANs: termed Action GAN and Auxiliary GAN, where the Action GAN is trained to operate over the current RGB frame while the Auxiliary GAN utilises supplementary information such as depth or optical flow. The goal of both GANs is to generate similar 'action codes', a vector representation of the current action. To facilitate this process a context extractor that incorporates data and recent outputs from both modes is used to extract context information to aids recognition performance. The result is a recurrent GAN architecture which learns a task specific loss function from multiple feature modalities. Extensive evaluations on variants of the proposed model to show the importance of utilising different streams of information such as context and auxiliary information in the proposed network; and show that our model is capable of outperforming state-of-the-art methods for three widely used datasets: 50 Salads, MERL Shopping and Georgia Tech Egocentric Activities, comprising both static and dynamic camera settings.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114549403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}