Azmain Yakin Srizon, S. Hasan, Md. Farukuzzaman Faruk, Abu Sayeed, Md. Ali Hossain
{"title":"Human Activity Recognition Utilizing Ensemble of Transfer-Learned Attention Networks and a Low-Cost Convolutional Neural Architecture","authors":"Azmain Yakin Srizon, S. Hasan, Md. Farukuzzaman Faruk, Abu Sayeed, Md. Ali Hossain","doi":"10.1109/ICCIT57492.2022.10055456","DOIUrl":null,"url":null,"abstract":"Throughout the last decades, human activity recognition has been considered one of the most complex tasks in the domain of computer vision. Previously, many works have suggested different machine learning models for the recognition of human actions from sensor-based data and video-based data which is not cost-efficient. The recent advancement of the convolutional neural network (CNN) has opened the possibility of accurate human activity recognition from still images. Although many researchers have already proposed some deep learning-based approaches addressing the problem, due to the high diversity in human actions, those approaches failed to achieve decent performance for all human actions under consideration. Some researchers argued that an ensemble of different models may work better in this regard. However, as the images used for recognition in this domain are mostly captured by security cameras, often, the deep models couldn’t extract valuable features resulting in misclassifications. To resolve these issues, in this study, we have considered three transfer-learned models i.e., DenseNet201, Xception, and EfficientNetB6, and applied a multichannel attention module to extract more distinguishable features. Moreover, a custom-made low-cost CNN has been proposed that works with small images extracting features that often get lost due to deep computations. Finally, the fusion of features extracted by attention-based transfer-learned models and the low-cost CNN has been used for the final prediction. We validated the proposed ensemble model on Stanford 40 actions, BU-101, and Willow datasets, and it achieved 97.48%, 98.29%, and 94.19% overall accuracy respectively which outperformed the previous performances by notable margins.","PeriodicalId":255498,"journal":{"name":"2022 25th International Conference on Computer and Information Technology (ICCIT)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 25th International Conference on Computer and Information Technology (ICCIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCIT57492.2022.10055456","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Throughout the last decades, human activity recognition has been considered one of the most complex tasks in the domain of computer vision. Previously, many works have suggested different machine learning models for the recognition of human actions from sensor-based data and video-based data which is not cost-efficient. The recent advancement of the convolutional neural network (CNN) has opened the possibility of accurate human activity recognition from still images. Although many researchers have already proposed some deep learning-based approaches addressing the problem, due to the high diversity in human actions, those approaches failed to achieve decent performance for all human actions under consideration. Some researchers argued that an ensemble of different models may work better in this regard. However, as the images used for recognition in this domain are mostly captured by security cameras, often, the deep models couldn’t extract valuable features resulting in misclassifications. To resolve these issues, in this study, we have considered three transfer-learned models i.e., DenseNet201, Xception, and EfficientNetB6, and applied a multichannel attention module to extract more distinguishable features. Moreover, a custom-made low-cost CNN has been proposed that works with small images extracting features that often get lost due to deep computations. Finally, the fusion of features extracted by attention-based transfer-learned models and the low-cost CNN has been used for the final prediction. We validated the proposed ensemble model on Stanford 40 actions, BU-101, and Willow datasets, and it achieved 97.48%, 98.29%, and 94.19% overall accuracy respectively which outperformed the previous performances by notable margins.