Yue Gu, Xinyu Li, Shuhong Chen, Jianyu Zhang, Ivan Marsic
{"title":"多模式深度学习的言语意图分类。","authors":"Yue Gu, Xinyu Li, Shuhong Chen, Jianyu Zhang, Ivan Marsic","doi":"10.1007/978-3-319-57351-9_30","DOIUrl":null,"url":null,"abstract":"<p><p>We present a novel multimodal deep learning structure that automatically extracts features from textual-acoustic data for sentence-level speech classification. Textual and acoustic features were first extracted using two independent convolutional neural network structures, then combined into a joint representation, and finally fed into a decision softmax layer. We tested the proposed model in an actual medical setting, using speech recording and its transcribed log. Our model achieved 83.10% average accuracy in detecting 6 different intentions. We also found that our model using automatically extracted features for intention classification outperformed existing models that use manufactured features.</p>","PeriodicalId":91830,"journal":{"name":"Advances in artificial intelligence. Canadian Society for Computational Studies of Intelligence. Conference","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6261374/pdf/nihms-993283.pdf","citationCount":"0","resultStr":"{\"title\":\"Speech Intention Classification with Multimodal Deep Learning.\",\"authors\":\"Yue Gu, Xinyu Li, Shuhong Chen, Jianyu Zhang, Ivan Marsic\",\"doi\":\"10.1007/978-3-319-57351-9_30\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>We present a novel multimodal deep learning structure that automatically extracts features from textual-acoustic data for sentence-level speech classification. Textual and acoustic features were first extracted using two independent convolutional neural network structures, then combined into a joint representation, and finally fed into a decision softmax layer. We tested the proposed model in an actual medical setting, using speech recording and its transcribed log. Our model achieved 83.10% average accuracy in detecting 6 different intentions. We also found that our model using automatically extracted features for intention classification outperformed existing models that use manufactured features.</p>\",\"PeriodicalId\":91830,\"journal\":{\"name\":\"Advances in artificial intelligence. Canadian Society for Computational Studies of Intelligence. Conference\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6261374/pdf/nihms-993283.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Advances in artificial intelligence. Canadian Society for Computational Studies of Intelligence. Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/978-3-319-57351-9_30\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2017/4/11 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in artificial intelligence. Canadian Society for Computational Studies of Intelligence. Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/978-3-319-57351-9_30","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2017/4/11 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
Speech Intention Classification with Multimodal Deep Learning.
We present a novel multimodal deep learning structure that automatically extracts features from textual-acoustic data for sentence-level speech classification. Textual and acoustic features were first extracted using two independent convolutional neural network structures, then combined into a joint representation, and finally fed into a decision softmax layer. We tested the proposed model in an actual medical setting, using speech recording and its transcribed log. Our model achieved 83.10% average accuracy in detecting 6 different intentions. We also found that our model using automatically extracted features for intention classification outperformed existing models that use manufactured features.