Estimation of human energy expenditure in sports and exercise contributes to performance analyses and tracking of physical activity levels. The focus of this work is to develop a video-based method for estimation of energy expenditure in athletes. We propose a method using thermal video analysis to automatically extract the cyclic motion pattern, in walking and running represented as steps, and analyse the frequency. Experiments are performed with one subject in two different tests, each at 5, 8, 10, and 12 km/h. The results of our proposed video-based method is compared to concurrent measurements of oxygen uptake. These initial experiments indicate a correlation between estimated step frequency and oxygen uptake. Based on the preliminary results we conclude that the proposed method has potential as a future non-invasive approach to estimate energy expenditure during sports.
{"title":"Measuring Energy Expenditure in Sports by Thermal Video Analysis","authors":"Rikke Gade, R. Larsen, T. Moeslund","doi":"10.1109/CVPRW.2017.29","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.29","url":null,"abstract":"Estimation of human energy expenditure in sports and exercise contributes to performance analyses and tracking of physical activity levels. The focus of this work is to develop a video-based method for estimation of energy expenditure in athletes. We propose a method using thermal video analysis to automatically extract the cyclic motion pattern, in walking and running represented as steps, and analyse the frequency. Experiments are performed with one subject in two different tests, each at 5, 8, 10, and 12 km/h. The results of our proposed video-based method is compared to concurrent measurements of oxygen uptake. These initial experiments indicate a correlation between estimated step frequency and oxygen uptake. Based on the preliminary results we conclude that the proposed method has potential as a future non-invasive approach to estimate energy expenditure during sports.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"19 1","pages":"187-194"},"PeriodicalIF":0.0,"publicationDate":"2017-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88407443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we propose a novel method for displaying 3D images based on a 5D light field representation. In our method, the light fields emitted by a light field projector are projected into 3D scattering media such as fog. The intensity of light lays projected into the scattering media decreases because of the scattering effect of the media. As a result, 5D light fields are generated in the scattering media. The proposed method models the relationship between the 5D light fields and observed images, and uses the relationship for projecting light fields so that the observed image changes according to the viewpoint of observers. In order to achieve accurate and efficient 3D image representation, we describe the relationship not by using a parametric model, but by using an observation based model obtained from a point spread function (PSF) of scattering media. The experimental results show the efficiency of the proposed method.
{"title":"Generating 5D Light Fields in Scattering Media for Representing 3D Images","authors":"E. Yuasa, Fumihiko Sakaue, J. Sato","doi":"10.1109/CVPRW.2017.169","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.169","url":null,"abstract":"In this paper, we propose a novel method for displaying 3D images based on a 5D light field representation. In our method, the light fields emitted by a light field projector are projected into 3D scattering media such as fog. The intensity of light lays projected into the scattering media decreases because of the scattering effect of the media. As a result, 5D light fields are generated in the scattering media. The proposed method models the relationship between the 5D light fields and observed images, and uses the relationship for projecting light fields so that the observed image changes according to the viewpoint of observers. In order to achieve accurate and efficient 3D image representation, we describe the relationship not by using a parametric model, but by using an observation based model obtained from a point spread function (PSF) of scattering media. The experimental results show the efficiency of the proposed method.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"18 1","pages":"1287-1294"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73559430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The use of surveillance cameras continues to increase, ranging from conventional applications such as law enforcement to newer scenarios with looser requirements such as gathering business intelligence. Humans still play an integral part in using and interpreting the footage from these systems, but are also a significant factor in causing unintentional privacy breaches. As computer vision methods continue to improve, we argue in this position paper that system designers should reconsider the role of machines in surveillance, and how automation can be used to help protect privacy. We explore this by discussing the impact of the human-in-the-loop, the potential for using abstraction and distributed computing to further privacy goals, and an approach for determining when video footage should be hidden from human users. We propose that in an ideal surveillance scenario, a privacy-affirming framework causes collected camera footage to be processed by computers directly, and never shown to humans. This implicitly requires humans to establish trust, to believe that computer vision systems can generate sufficiently accurate results without human supervision, so that if information about people must be gathered, unintentional data collection is mitigated as much as possible.
{"title":"Trusting the Computer in Computer Vision: A Privacy-Affirming Framework","authors":"A. Chen, M. Biglari-Abhari, K. Wang","doi":"10.1109/CVPRW.2017.178","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.178","url":null,"abstract":"The use of surveillance cameras continues to increase, ranging from conventional applications such as law enforcement to newer scenarios with looser requirements such as gathering business intelligence. Humans still play an integral part in using and interpreting the footage from these systems, but are also a significant factor in causing unintentional privacy breaches. As computer vision methods continue to improve, we argue in this position paper that system designers should reconsider the role of machines in surveillance, and how automation can be used to help protect privacy. We explore this by discussing the impact of the human-in-the-loop, the potential for using abstraction and distributed computing to further privacy goals, and an approach for determining when video footage should be hidden from human users. We propose that in an ideal surveillance scenario, a privacy-affirming framework causes collected camera footage to be processed by computers directly, and never shown to humans. This implicitly requires humans to establish trust, to believe that computer vision systems can generate sufficiently accurate results without human supervision, so that if information about people must be gathered, unintentional data collection is mitigated as much as possible.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"65 1","pages":"1360-1367"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74401892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a model for full body and face deidentification of humans in images. Given a segmentation of the human figure, our model generates a synthetic human image with an alternative appearance that looks natural and fits the segmentation outline. The model is usable with various levels of segmentation, from simple human figure blobs to complex garment-level segmentations. The level of detail in the de-identified output depends on the level of detail in the input segmentation. The model de-identifies not only primary biometric identifiers (e.g. the face), but also soft and non-biometric identifiers including clothing, hairstyle, etc. Quantitative and perceptual experiments indicate that our model produces de-identified outputs that thwart human and machine recognition, while preserving data utility and naturalness.
{"title":"I Know That Person: Generative Full Body and Face De-identification of People in Images","authors":"K. Brkić, I. Sikirić, T. Hrkać, Z. Kalafatić","doi":"10.1109/CVPRW.2017.173","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.173","url":null,"abstract":"We propose a model for full body and face deidentification of humans in images. Given a segmentation of the human figure, our model generates a synthetic human image with an alternative appearance that looks natural and fits the segmentation outline. The model is usable with various levels of segmentation, from simple human figure blobs to complex garment-level segmentations. The level of detail in the de-identified output depends on the level of detail in the input segmentation. The model de-identifies not only primary biometric identifiers (e.g. the face), but also soft and non-biometric identifiers including clothing, hairstyle, etc. Quantitative and perceptual experiments indicate that our model produces de-identified outputs that thwart human and machine recognition, while preserving data utility and naturalness.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"10 1","pages":"1319-1328"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82057116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Understanding human hand usage is one of the richest information source to recognize human manipulation actions. Since humans use various tools during actions, grasp recognition gives important cues to figure out humans' intention and tasks. Earlier studies analyzed grasps with positions of hand joints by attaching sensors, but since these types of sensors prevent humans from naturally conducting actions, visual approaches have been focused in recent years. Convolutional neural networks require a vast annotated dataset, but, to our knowledge, no human grasping dataset includes ground truth of hand regions. In this paper, we propose a grasp recognition method only with image-level labels by the weakly supervised learning framework. In addition, we split the grasp recognition process into two stages that are hand localization and grasp classification so as to speed up. Experimental results demonstrate that the proposed method outperforms existing methods and can perform in real-time.
{"title":"Real-Time Hand Grasp Recognition Using Weakly Supervised Two-Stage Convolutional Neural Networks for Understanding Manipulation Actions","authors":"Ji Woong Kim, Sujeong You, S. Ji, Hong-Seok Kim","doi":"10.1109/CVPRW.2017.67","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.67","url":null,"abstract":"Understanding human hand usage is one of the richest information source to recognize human manipulation actions. Since humans use various tools during actions, grasp recognition gives important cues to figure out humans' intention and tasks. Earlier studies analyzed grasps with positions of hand joints by attaching sensors, but since these types of sensors prevent humans from naturally conducting actions, visual approaches have been focused in recent years. Convolutional neural networks require a vast annotated dataset, but, to our knowledge, no human grasping dataset includes ground truth of hand regions. In this paper, we propose a grasp recognition method only with image-level labels by the weakly supervised learning framework. In addition, we split the grasp recognition process into two stages that are hand localization and grasp classification so as to speed up. Experimental results demonstrate that the proposed method outperforms existing methods and can perform in real-time.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"41 1","pages":"481-483"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88110755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrea Zunino, Jacopo Cavazza, A. Koul, A. Cavallo, C. Becchio, Vittorio Murino
In computer vision, video-based approaches have been widely explored for the early classification and the prediction of actions or activities. However, it remains unclear whether this modality (as compared to 3D kinematics) can still be reliable for the prediction of human intentions, defined as the overarching goal embedded in an action sequence. Since the same action can be performed with different intentions, this problem is more challenging but yet affordable as proved by quantitative cognitive studies which exploit the 3D kinematics acquired through motion capture systems.In this paper, we bridge cognitive and computer vision studies, by demonstrating the effectiveness of video-based approaches for the prediction of human intentions. Precisely, we propose Intention from Motion, a new paradigm where, without using any contextual information, we consider instantaneous grasping motor acts involving a bottle in order to forecast why the bottle itself has been reached (to pass it or to place in a box, or to pour or to drink the liquid inside).We process only the grasping onsets casting intention prediction as a classification framework. Leveraging on our multimodal acquisition (3D motion capture data and 2D optical videos), we compare the most commonly used 3D descriptors from cognitive studies with state-of-the-art video-based techniques. Since the two analyses achieve an equivalent performance, we demonstrate that computer vision tools are effective in capturing the kinematics and facing the cognitive problem of human intention prediction.
{"title":"What Will I Do Next? The Intention from Motion Experiment","authors":"Andrea Zunino, Jacopo Cavazza, A. Koul, A. Cavallo, C. Becchio, Vittorio Murino","doi":"10.1109/CVPRW.2017.7","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.7","url":null,"abstract":"In computer vision, video-based approaches have been widely explored for the early classification and the prediction of actions or activities. However, it remains unclear whether this modality (as compared to 3D kinematics) can still be reliable for the prediction of human intentions, defined as the overarching goal embedded in an action sequence. Since the same action can be performed with different intentions, this problem is more challenging but yet affordable as proved by quantitative cognitive studies which exploit the 3D kinematics acquired through motion capture systems.In this paper, we bridge cognitive and computer vision studies, by demonstrating the effectiveness of video-based approaches for the prediction of human intentions. Precisely, we propose Intention from Motion, a new paradigm where, without using any contextual information, we consider instantaneous grasping motor acts involving a bottle in order to forecast why the bottle itself has been reached (to pass it or to place in a box, or to pour or to drink the liquid inside).We process only the grasping onsets casting intention prediction as a classification framework. Leveraging on our multimodal acquisition (3D motion capture data and 2D optical videos), we compare the most commonly used 3D descriptors from cognitive studies with state-of-the-art video-based techniques. Since the two analyses achieve an equivalent performance, we demonstrate that computer vision tools are effective in capturing the kinematics and facing the cognitive problem of human intention prediction.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"19 1","pages":"1-8"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79219674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yamin Han, Peng Zhang, Tao Zhuo, Wei Huang, Yanning Zhang
Deep convolution networks based strategies have shown a remarkable performance in different recognition tasks. Unfortunately, in a variety of realistic scenarios, accurate and robust recognition is hard especially for the videos. Different challenges such as cluttered backgrounds or viewpoint change etc. may generate the problem like large intrinsic and extrinsic class variations. In addition, the problem of data deficiency could also make the designed model degrade during learning and update. Therefore, an effective way by incorporating the frame-wise motion into the learning model on-the-fly has become more and more attractive in contemporary video analysis studies.,,,,,,To overcome those limitations, in this work, we proposed a deeper convolution networks based approach with pairwise motion concatenation, which is named deep temporal convolutional networks. In this work, a temporal motion accumulation mechanism has been introduced as an effective data entry for the learning of convolution networks. Specifically, to handle the possible data deficiency, beneficial practices of transferring ResNet-101 weights and data variation augmentation are also utilized for the purpose of robust recognition. Experiments on challenging dataset UCF101 and ODAR dataset have verified a preferable performance when compared with other state-of-art works.
{"title":"Video Action Recognition Based on Deeper Convolution Networks with Pair-Wise Frame Motion Concatenation","authors":"Yamin Han, Peng Zhang, Tao Zhuo, Wei Huang, Yanning Zhang","doi":"10.1109/CVPRW.2017.162","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.162","url":null,"abstract":"Deep convolution networks based strategies have shown a remarkable performance in different recognition tasks. Unfortunately, in a variety of realistic scenarios, accurate and robust recognition is hard especially for the videos. Different challenges such as cluttered backgrounds or viewpoint change etc. may generate the problem like large intrinsic and extrinsic class variations. In addition, the problem of data deficiency could also make the designed model degrade during learning and update. Therefore, an effective way by incorporating the frame-wise motion into the learning model on-the-fly has become more and more attractive in contemporary video analysis studies.,,,,,,To overcome those limitations, in this work, we proposed a deeper convolution networks based approach with pairwise motion concatenation, which is named deep temporal convolutional networks. In this work, a temporal motion accumulation mechanism has been introduced as an effective data entry for the learning of convolution networks. Specifically, to handle the possible data deficiency, beneficial practices of transferring ResNet-101 weights and data variation augmentation are also utilized for the purpose of robust recognition. Experiments on challenging dataset UCF101 and ODAR dataset have verified a preferable performance when compared with other state-of-art works.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"64 1","pages":"1226-1235"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91235967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Misha Sra, Prashanth Vijayaraghavan, Ognjen Rudovic, P. Maes, D. Roy
Affective virtual spaces are of interest for many VR applications in areas of wellbeing, art, education, and entertainment. Creating content for virtual environments is a laborious task involving multiple skills like 3D modeling, texturing, animation, lighting, and programming. One way to facilitate content creation is to automate sub-processes like assignment of textures and materials within virtual environments. To this end, we introduce the DeepSpace approach that automatically creates and applies image textures to objects in procedurally created 3D scenes. The main novelty of our DeepSpace approach is that it uses music to automatically create kaleidoscopic textures for virtual environments designed to elicit emotional responses in users. Specifically, DeepSpace exploits the modeling power of deep neural networks, which have shown great performance in image generation tasks, to achieve mood-based image generation. Our study results indicate the virtual environments created by DeepSpace elicit positive emotions and achieve high presence scores.
{"title":"DeepSpace: Mood-Based Image Texture Generation for Virtual Reality from Music","authors":"Misha Sra, Prashanth Vijayaraghavan, Ognjen Rudovic, P. Maes, D. Roy","doi":"10.1109/CVPRW.2017.283","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.283","url":null,"abstract":"Affective virtual spaces are of interest for many VR applications in areas of wellbeing, art, education, and entertainment. Creating content for virtual environments is a laborious task involving multiple skills like 3D modeling, texturing, animation, lighting, and programming. One way to facilitate content creation is to automate sub-processes like assignment of textures and materials within virtual environments. To this end, we introduce the DeepSpace approach that automatically creates and applies image textures to objects in procedurally created 3D scenes. The main novelty of our DeepSpace approach is that it uses music to automatically create kaleidoscopic textures for virtual environments designed to elicit emotional responses in users. Specifically, DeepSpace exploits the modeling power of deep neural networks, which have shown great performance in image generation tasks, to achieve mood-based image generation. Our study results indicate the virtual environments created by DeepSpace elicit positive emotions and achieve high presence scores.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"33 1","pages":"2289-2298"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88457622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianshu Li, Yunpeng Chen, Shengtao Xiao, Jian Zhao, S. Roy, Jiashi Feng, Shuicheng Yan, T. Sim
This paper presents the proposed solution to the "affect in the wild" challenge, which aims to estimate the affective level, i.e. the valence and arousal values, of every frame in a video. A carefully designed deep convolutional neural network (a variation of residual network) for affective level estimation of facial expressions is first implemented as a baseline. Next we use multiple memory networks to model the temporal relations between the frames. Finally ensemble models are used to combine the predictions from multiple memory networks. Our proposed solution outperforms the baseline model by a factor of 10.62% in terms of mean square error (MSE).
{"title":"Estimation of Affective Level in the Wild with Multiple Memory Networks","authors":"Jianshu Li, Yunpeng Chen, Shengtao Xiao, Jian Zhao, S. Roy, Jiashi Feng, Shuicheng Yan, T. Sim","doi":"10.1109/CVPRW.2017.244","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.244","url":null,"abstract":"This paper presents the proposed solution to the \"affect in the wild\" challenge, which aims to estimate the affective level, i.e. the valence and arousal values, of every frame in a video. A carefully designed deep convolutional neural network (a variation of residual network) for affective level estimation of facial expressions is first implemented as a baseline. Next we use multiple memory networks to model the temporal relations between the frames. Finally ensemble models are used to combine the predictions from multiple memory networks. Our proposed solution outperforms the baseline model by a factor of 10.62% in terms of mean square error (MSE).","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"3 1","pages":"1947-1954"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78432413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Janice Pan, Vikram V. Appia, Jesse Villarreal, Lucas Weaver, Do-Kyoung Kwon
Automobiles are currently equipped with a three-mirror system for rear-view visualization. The two side-view mirrors show close the periphery on the left and right sides of the vehicle, and the center rear-view mirror is typically adjusted to allow the driver to see through the vehicle's rear windshield. This three-mirror system, however, imposes safety concerns in requiring drivers to shift their attention and gaze to look in each mirror to obtain a full visualization of the rear-view surroundings, which takes attention off the scene in front of the vehicle. We present an alternative to the three-mirror rear-view system, which we call Rear-Stitched View Panorama (RSVP). The proposed system uses four rear-facing cameras, strategically placed to overcome the traditional blind spot problem, and stitches the feeds from each camera together to generate a single panoramic view, which can display the entire rear surroundings. We project individually captured frames onto a single virtual view using precomputed system calibration parameters. Then we determine optimal seam lines, along which the images are fused together to form the single RSVP view presented to the driver. Furthermore, we highlight techniques that enable efficient embedded implementation of the system and showcase a real-time system utilizing under 2W of power, making it suitable for in-cabin deployment in vehicles.
{"title":"Rear-Stitched View Panorama: A Low-Power Embedded Implementation for Smart Rear-View Mirrors on Vehicles","authors":"Janice Pan, Vikram V. Appia, Jesse Villarreal, Lucas Weaver, Do-Kyoung Kwon","doi":"10.1109/CVPRW.2017.157","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.157","url":null,"abstract":"Automobiles are currently equipped with a three-mirror system for rear-view visualization. The two side-view mirrors show close the periphery on the left and right sides of the vehicle, and the center rear-view mirror is typically adjusted to allow the driver to see through the vehicle's rear windshield. This three-mirror system, however, imposes safety concerns in requiring drivers to shift their attention and gaze to look in each mirror to obtain a full visualization of the rear-view surroundings, which takes attention off the scene in front of the vehicle. We present an alternative to the three-mirror rear-view system, which we call Rear-Stitched View Panorama (RSVP). The proposed system uses four rear-facing cameras, strategically placed to overcome the traditional blind spot problem, and stitches the feeds from each camera together to generate a single panoramic view, which can display the entire rear surroundings. We project individually captured frames onto a single virtual view using precomputed system calibration parameters. Then we determine optimal seam lines, along which the images are fused together to form the single RSVP view presented to the driver. Furthermore, we highlight techniques that enable efficient embedded implementation of the system and showcase a real-time system utilizing under 2W of power, making it suitable for in-cabin deployment in vehicles.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"32 1","pages":"1184-1193"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81517148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}