A. Chotimongkol, Vataya Chunwijitra, Sumonmas Thatphithakkul, Nattapong Kurpukdee, C. Wutiwiwatchai
{"title":"Elicit spoken-style data from social media through a style classifier","authors":"A. Chotimongkol, Vataya Chunwijitra, Sumonmas Thatphithakkul, Nattapong Kurpukdee, C. Wutiwiwatchai","doi":"10.1109/ICSDA.2015.7357856","DOIUrl":null,"url":null,"abstract":"We explore the use of social media data to reduce the effort in developing a conversational speech corpus. The LOTUS-SOC corpus is created by recording Twitter messages through a mobile application. In the first phase, which took around one month, 172 hours of speech from 208 speakers were recorded and ready for use without the need for speech segmentation and transcription. In terms of language similarity to spoken language, the perplexity of LOTUS-SOC with respect to known spoken utterances is lower than that of the broadcast news corpus and almost as low as the telephone conversation corpus. We also applied a style classifier trained from words and parts-of-speech using two machine learning approaches, SVM and CRF, to identify spoken-style utterances in LOTUS-SOC. By training a language model from only the utterances classified as “spoken”, the perplexity of LOTUS-SOC was further reduced as evaluated by three different sets of spoken utterances.","PeriodicalId":290790,"journal":{"name":"2015 International Conference Oriental COCOSDA held jointly with 2015 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference Oriental COCOSDA held jointly with 2015 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSDA.2015.7357856","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
We explore the use of social media data to reduce the effort in developing a conversational speech corpus. The LOTUS-SOC corpus is created by recording Twitter messages through a mobile application. In the first phase, which took around one month, 172 hours of speech from 208 speakers were recorded and ready for use without the need for speech segmentation and transcription. In terms of language similarity to spoken language, the perplexity of LOTUS-SOC with respect to known spoken utterances is lower than that of the broadcast news corpus and almost as low as the telephone conversation corpus. We also applied a style classifier trained from words and parts-of-speech using two machine learning approaches, SVM and CRF, to identify spoken-style utterances in LOTUS-SOC. By training a language model from only the utterances classified as “spoken”, the perplexity of LOTUS-SOC was further reduced as evaluated by three different sets of spoken utterances.