{"title":"基于二维和三维卷积神经网络的RGB-D手势识别的比较研究","authors":"M. Kurmanji, F. Ghaderi","doi":"10.22044/JADM.2019.7903.1929","DOIUrl":null,"url":null,"abstract":"Despite considerable enhances in recognizing hand gestures from still images, there are still many challenges in the classification of hand gestures in videos. The latter comes with more challenges, including higher computational complexity and arduous task of representing temporal features. Hand movement dynamics, represented by temporal features, have to be extracted by analyzing the total frames of a video. So far, both 2D and 3D convolutional neural networks have been used to manipulate the temporal dynamics of the video frames. 3D CNNs can extract the changes in the consecutive frames and tend to be more suitable for the video classification task, however, they usually need more time. On the other hand, by using techniques like tiling it is possible to aggregate all the frames in a single matrix and preserve the temporal and spatial features. This way, using 2D CNNs, which are inherently simpler than 3D CNNs can be used to classify the video instances. In this paper, we compared the application of 2D and 3D CNNs for representing temporal features and classifying hand gesture sequences. Additionally, providing a two-stage two-stream architecture, we efficiently combined color and depth modalities and 2D and 3D CNN predictions. The effect of different types of augmentation techniques is also investigated. Our results confirm that appropriate usage of 2D CNNs outperforms a 3D CNN implementation in this task.","PeriodicalId":32592,"journal":{"name":"Journal of Artificial Intelligence and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Hand Gesture Recognition from RGB-D Data using 2D and 3D Convolutional Neural Networks: a comparative study\",\"authors\":\"M. Kurmanji, F. Ghaderi\",\"doi\":\"10.22044/JADM.2019.7903.1929\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Despite considerable enhances in recognizing hand gestures from still images, there are still many challenges in the classification of hand gestures in videos. The latter comes with more challenges, including higher computational complexity and arduous task of representing temporal features. Hand movement dynamics, represented by temporal features, have to be extracted by analyzing the total frames of a video. So far, both 2D and 3D convolutional neural networks have been used to manipulate the temporal dynamics of the video frames. 3D CNNs can extract the changes in the consecutive frames and tend to be more suitable for the video classification task, however, they usually need more time. On the other hand, by using techniques like tiling it is possible to aggregate all the frames in a single matrix and preserve the temporal and spatial features. This way, using 2D CNNs, which are inherently simpler than 3D CNNs can be used to classify the video instances. In this paper, we compared the application of 2D and 3D CNNs for representing temporal features and classifying hand gesture sequences. Additionally, providing a two-stage two-stream architecture, we efficiently combined color and depth modalities and 2D and 3D CNN predictions. The effect of different types of augmentation techniques is also investigated. Our results confirm that appropriate usage of 2D CNNs outperforms a 3D CNN implementation in this task.\",\"PeriodicalId\":32592,\"journal\":{\"name\":\"Journal of Artificial Intelligence and Data Mining\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Artificial Intelligence and Data Mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.22044/JADM.2019.7903.1929\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Artificial Intelligence and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.22044/JADM.2019.7903.1929","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Hand Gesture Recognition from RGB-D Data using 2D and 3D Convolutional Neural Networks: a comparative study
Despite considerable enhances in recognizing hand gestures from still images, there are still many challenges in the classification of hand gestures in videos. The latter comes with more challenges, including higher computational complexity and arduous task of representing temporal features. Hand movement dynamics, represented by temporal features, have to be extracted by analyzing the total frames of a video. So far, both 2D and 3D convolutional neural networks have been used to manipulate the temporal dynamics of the video frames. 3D CNNs can extract the changes in the consecutive frames and tend to be more suitable for the video classification task, however, they usually need more time. On the other hand, by using techniques like tiling it is possible to aggregate all the frames in a single matrix and preserve the temporal and spatial features. This way, using 2D CNNs, which are inherently simpler than 3D CNNs can be used to classify the video instances. In this paper, we compared the application of 2D and 3D CNNs for representing temporal features and classifying hand gesture sequences. Additionally, providing a two-stage two-stream architecture, we efficiently combined color and depth modalities and 2D and 3D CNN predictions. The effect of different types of augmentation techniques is also investigated. Our results confirm that appropriate usage of 2D CNNs outperforms a 3D CNN implementation in this task.