{"title":"综合多模态情绪识别的深度神经网络","authors":"Ashutosh Tiwari, Satyam Kumar, Tushar Mehrotra, Rajneesh Kumar Singh","doi":"10.1109/ICDT57929.2023.10150945","DOIUrl":null,"url":null,"abstract":"Emotions may be expressed in many different ways, making automatic affect recognition challenging. Several industries may benefit from this technology, including audiovisual search and human- machine interface. Recently, neural networks have been developed to assess emotional states with unprecedented accuracy. We provide an approach to emotion identification that makes use of both visual and aural signals. It’s crucial to isolate relevant features in order to accurately represent the nuanced emotions conveyed in a wide range of speech patterns. We do this by using a Convolutional Neural Network (CNN) to parse the audio track for feature extraction and a 50-layer deep ResNet to process the visual track. Machine learning algorithms, in addition to needing to extract the characteristics, should also be robust against outliers and reflective of their surroundings. To solve this problem, LSTM networks are used. We train the system from the ground up, using the RECOLA datasets from the AVEC 2016 emotion recognition research challenge, and we demonstrate that our method is superior to prior approaches that relied on manually constructed aural and visual cues for identifying genuine emotional states. It has been demonstrated that the visual modality predicts valence more accurately than arousal. The best results for the valence dimension from the RECOLA dataset are shown in Table III below.","PeriodicalId":266681,"journal":{"name":"2023 International Conference on Disruptive Technologies (ICDT)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Deep Neural Networks for Comprehensive Multimodal Emotion Recognition\",\"authors\":\"Ashutosh Tiwari, Satyam Kumar, Tushar Mehrotra, Rajneesh Kumar Singh\",\"doi\":\"10.1109/ICDT57929.2023.10150945\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Emotions may be expressed in many different ways, making automatic affect recognition challenging. Several industries may benefit from this technology, including audiovisual search and human- machine interface. Recently, neural networks have been developed to assess emotional states with unprecedented accuracy. We provide an approach to emotion identification that makes use of both visual and aural signals. It’s crucial to isolate relevant features in order to accurately represent the nuanced emotions conveyed in a wide range of speech patterns. We do this by using a Convolutional Neural Network (CNN) to parse the audio track for feature extraction and a 50-layer deep ResNet to process the visual track. Machine learning algorithms, in addition to needing to extract the characteristics, should also be robust against outliers and reflective of their surroundings. To solve this problem, LSTM networks are used. We train the system from the ground up, using the RECOLA datasets from the AVEC 2016 emotion recognition research challenge, and we demonstrate that our method is superior to prior approaches that relied on manually constructed aural and visual cues for identifying genuine emotional states. It has been demonstrated that the visual modality predicts valence more accurately than arousal. The best results for the valence dimension from the RECOLA dataset are shown in Table III below.\",\"PeriodicalId\":266681,\"journal\":{\"name\":\"2023 International Conference on Disruptive Technologies (ICDT)\",\"volume\":\"51 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 International Conference on Disruptive Technologies (ICDT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDT57929.2023.10150945\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Disruptive Technologies (ICDT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDT57929.2023.10150945","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Deep Neural Networks for Comprehensive Multimodal Emotion Recognition
Emotions may be expressed in many different ways, making automatic affect recognition challenging. Several industries may benefit from this technology, including audiovisual search and human- machine interface. Recently, neural networks have been developed to assess emotional states with unprecedented accuracy. We provide an approach to emotion identification that makes use of both visual and aural signals. It’s crucial to isolate relevant features in order to accurately represent the nuanced emotions conveyed in a wide range of speech patterns. We do this by using a Convolutional Neural Network (CNN) to parse the audio track for feature extraction and a 50-layer deep ResNet to process the visual track. Machine learning algorithms, in addition to needing to extract the characteristics, should also be robust against outliers and reflective of their surroundings. To solve this problem, LSTM networks are used. We train the system from the ground up, using the RECOLA datasets from the AVEC 2016 emotion recognition research challenge, and we demonstrate that our method is superior to prior approaches that relied on manually constructed aural and visual cues for identifying genuine emotional states. It has been demonstrated that the visual modality predicts valence more accurately than arousal. The best results for the valence dimension from the RECOLA dataset are shown in Table III below.