S. P. Dubagunta, Edoardo Moneta, Eleni Theocharopoulos, Mathew Magimai Doss
{"title":"Towards Automatic Prediction of Non-Expert Perceived Speech Fluency Ratings","authors":"S. P. Dubagunta, Edoardo Moneta, Eleni Theocharopoulos, Mathew Magimai Doss","doi":"10.1145/3536220.3563689","DOIUrl":null,"url":null,"abstract":"Automatic speech fluency prediction has been mainly approached from the perspective of computer aided language learning, where the system tends to predict ratings similar to those of the human experts. Speech fluency prediction, however, can be questioned in a more relaxed social setting, where the ratings arise usually from non-experts; indeed, everyday assessments of fluency are appraised by our social environment and encounters; these encounters due to globalisation are becoming of international nature and therefore being a non-expert has become a norm. This paper explores the latter direction, i.e., prediction of non-expert perceived speech fluency ratings, which has not been studied in the speech technology literature, to the best of our knowledge. Toward that, we investigate several approaches, namely, (a) low-level descriptor feature functionals, (b) bag-of-audio word based approach and (c) neural network based end-to-end acoustic modelling approach. Our investigations on speech data collected from 54 speakers and rated by seven non-experts demonstrate that non-expert speech fluency ratings can be systematically predicted, with the best performing system yielding a Pearson’s correlation coefficient of 0.66 and a Spearman’s correlation coefficient of 0.67 with the median human scores.","PeriodicalId":186796,"journal":{"name":"Companion Publication of the 2022 International Conference on Multimodal Interaction","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion Publication of the 2022 International Conference on Multimodal Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3536220.3563689","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Automatic speech fluency prediction has been mainly approached from the perspective of computer aided language learning, where the system tends to predict ratings similar to those of the human experts. Speech fluency prediction, however, can be questioned in a more relaxed social setting, where the ratings arise usually from non-experts; indeed, everyday assessments of fluency are appraised by our social environment and encounters; these encounters due to globalisation are becoming of international nature and therefore being a non-expert has become a norm. This paper explores the latter direction, i.e., prediction of non-expert perceived speech fluency ratings, which has not been studied in the speech technology literature, to the best of our knowledge. Toward that, we investigate several approaches, namely, (a) low-level descriptor feature functionals, (b) bag-of-audio word based approach and (c) neural network based end-to-end acoustic modelling approach. Our investigations on speech data collected from 54 speakers and rated by seven non-experts demonstrate that non-expert speech fluency ratings can be systematically predicted, with the best performing system yielding a Pearson’s correlation coefficient of 0.66 and a Spearman’s correlation coefficient of 0.67 with the median human scores.