Alina Roitberg, Monica Haurilet, Simon Reiß, R. Stiefelhagen
{"title":"CNN-based Driver Activity Understanding: Shedding Light on Deep Spatiotemporal Representations","authors":"Alina Roitberg, Monica Haurilet, Simon Reiß, R. Stiefelhagen","doi":"10.1109/ITSC45102.2020.9294731","DOIUrl":null,"url":null,"abstract":"While deep Convolutional Neural Networks(CNNs) have become front-runners in the field of driver observation, they are often perceived as black boxes due to their end-to-end nature. Interpretability of such models is vital for building trust and is a serious concern for the integration of CNNs in real-life systems. In this paper, we implement a diagnostic framework for analyzing such models internally and shed light on the learned spatiotemporal representations in a comprehensive study. We examine prominent driver monitoring models from three points of view: (1) visually explaining the prediction by combining the gradient with respect to the intermediate features and the corresponding activation maps, (2) looking at what the network has learned by clustering the internal representations and discovering, how individual classes relate at the feature-level, and (3) conducting a detailed failure analysis with multiple metrics and evaluation settings (e.g. common versus rare behaviors). Among our findings, we show that most of the mistakes can be traced back to learning an object- or a specific movement bias, strong semantic similarity between classes (e.g. preparing food and eating) and underrepresentation in the training set. Besides, we demonstrate the advantages of the Inflated 3D Net compared to other CNNs as it results in more discriminative embedding clusters and in the highest recognition rates based on all metrics.","PeriodicalId":394538,"journal":{"name":"2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC)","volume":"160 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ITSC45102.2020.9294731","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13
Abstract
While deep Convolutional Neural Networks(CNNs) have become front-runners in the field of driver observation, they are often perceived as black boxes due to their end-to-end nature. Interpretability of such models is vital for building trust and is a serious concern for the integration of CNNs in real-life systems. In this paper, we implement a diagnostic framework for analyzing such models internally and shed light on the learned spatiotemporal representations in a comprehensive study. We examine prominent driver monitoring models from three points of view: (1) visually explaining the prediction by combining the gradient with respect to the intermediate features and the corresponding activation maps, (2) looking at what the network has learned by clustering the internal representations and discovering, how individual classes relate at the feature-level, and (3) conducting a detailed failure analysis with multiple metrics and evaluation settings (e.g. common versus rare behaviors). Among our findings, we show that most of the mistakes can be traced back to learning an object- or a specific movement bias, strong semantic similarity between classes (e.g. preparing food and eating) and underrepresentation in the training set. Besides, we demonstrate the advantages of the Inflated 3D Net compared to other CNNs as it results in more discriminative embedding clusters and in the highest recognition rates based on all metrics.