Pub Date : 2021-10-13DOI: 10.1109/sped53181.2021.9587447
M. Boldeanu, C. Marin, D. Ene, L. Mărmureanu, H. Cucu, C. Burileanu
Pollen allergies are a growing concern for human health. This is why automated pollen monitoring is becoming an important area of research. Machine learning approaches show great promise for tackling this issue but these algorithms need large training data sets to perform well. This study introduces a new pollen data set, obtained using a Rapid-E particle analyzer, that is representative for the flora of Romania. Pollen, from thirteen species present in Romania, was used in developing this database with over 100 thousand samples measured. Our study shows performance similar to or above that of humans in the task of pollen classification on the newly introduced data set using a convolutional neural network.
{"title":"MARS: the First Romanian Pollen Dataset using a Rapid-E Particle Analyzer","authors":"M. Boldeanu, C. Marin, D. Ene, L. Mărmureanu, H. Cucu, C. Burileanu","doi":"10.1109/sped53181.2021.9587447","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587447","url":null,"abstract":"Pollen allergies are a growing concern for human health. This is why automated pollen monitoring is becoming an important area of research. Machine learning approaches show great promise for tackling this issue but these algorithms need large training data sets to perform well. This study introduces a new pollen data set, obtained using a Rapid-E particle analyzer, that is representative for the flora of Romania. Pollen, from thirteen species present in Romania, was used in developing this database with over 100 thousand samples measured. Our study shows performance similar to or above that of humans in the task of pollen classification on the newly introduced data set using a convolutional neural network.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125938865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-13DOI: 10.1109/sped53181.2021.9587345
D. Ungureanu, Madalina Badeanu, Gabriela-Catalina Marica, M. Dascalu, D. Tufis
With the increasing usage of Natural Language Processing to facilitate the interactions between humans and machines, automatic speech recognition systems have become increasingly popular as a result of their utility in a wide range of applications. In this paper we explore well-known open-source speech-to-text engines, namely CMUSphinx, DeepSpeech, and Kaldi, to build a baseline of models to transcribe Romanian speech. These engines employ various underlying methods from hidden Markov models to deep neural networks that also integrate language models, thus providing a solid baseline for comparison. Unfortunately, Romanian is still a low-resource language and six datasets of various qualities were merged to obtain 104 hours of speech. To further increase the size of the gathered corpora, our experiments consider data augmentation techniques, specifically SpecAugment, applied on the most promising model. Besides using existing corpora, we publicly release a dataset of 11.5 hours generated from Governmental transcripts. The best performing model is obtained using the Kaldi architecture, considers a hybrid structure with a Deep Neural Network, and achieves a WER of 3.10% on the test partition.
{"title":"Establishing a Baseline of Romanian Speech-to-Text Models","authors":"D. Ungureanu, Madalina Badeanu, Gabriela-Catalina Marica, M. Dascalu, D. Tufis","doi":"10.1109/sped53181.2021.9587345","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587345","url":null,"abstract":"With the increasing usage of Natural Language Processing to facilitate the interactions between humans and machines, automatic speech recognition systems have become increasingly popular as a result of their utility in a wide range of applications. In this paper we explore well-known open-source speech-to-text engines, namely CMUSphinx, DeepSpeech, and Kaldi, to build a baseline of models to transcribe Romanian speech. These engines employ various underlying methods from hidden Markov models to deep neural networks that also integrate language models, thus providing a solid baseline for comparison. Unfortunately, Romanian is still a low-resource language and six datasets of various qualities were merged to obtain 104 hours of speech. To further increase the size of the gathered corpora, our experiments consider data augmentation techniques, specifically SpecAugment, applied on the most promising model. Besides using existing corpora, we publicly release a dataset of 11.5 hours generated from Governmental transcripts. The best performing model is obtained using the Kaldi architecture, considers a hybrid structure with a Deep Neural Network, and achieves a WER of 3.10% on the test partition.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129761117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-13DOI: 10.1109/sped53181.2021.9587364
R. Jaiswal
With the exponential increase of mobile users and internet subscribers, the utilization of voice over internet protocol (VoIP) application is increasing dramatically. People exploit different VoIP applications for effective communication, for example, Google Meet, Microsoft Skype, Zoom video conferencing applications, etc. The single-ended speech quality metrics are employed for measuring and monitoring the quality of speech. However, different types of degradations present in the surroundings distort the quality of speech. In order to meet the desired quality of experience (QoE) level of end-user while using VoIP applications, it is necessary to reduce VoIP degradations and obtain the optimized speech quality. Along that line, this paper investigates the conjunction of filtering of silence and noise as a pre-processing block with the single-ended speech quality metric under various common occurring degradations encountered during VoIP communication. This can help the internet service providers in understanding the potential root cause of decrement in quality of speech and then applying the QoE management service to fulfill desired human QoE level. Results demonstrate that the deployment of joint pre-processing on speech samples under various VoIP degradations improves the quality of speech to a great extent.
随着移动用户和互联网用户的指数级增长,VoIP (voice over internet protocol)应用的使用率急剧上升。人们利用不同的VoIP应用程序进行有效的通信,例如,Google Meet, Microsoft Skype, Zoom视频会议应用程序等。采用单端语音质量指标对语音质量进行测量和监控。然而,周围环境中存在的不同类型的退化会扭曲语音质量。为了满足终端用户在使用VoIP应用时所期望的QoE (quality of experience)水平,有必要减少VoIP的降级,从而获得最优的语音质量。沿着这条线,本文研究了在VoIP通信中遇到的各种常见退化情况下,沉默和噪声滤波作为预处理块与单端语音质量度量的结合。这可以帮助互联网服务提供商了解语音质量下降的潜在根本原因,然后应用质量质量管理服务来满足期望的人类质量质量水平。结果表明,在各种VoIP降级情况下对语音样本进行联合预处理,在很大程度上提高了语音质量。
{"title":"Influence of Silence and Noise Filtering on Speech Quality Monitoring","authors":"R. Jaiswal","doi":"10.1109/sped53181.2021.9587364","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587364","url":null,"abstract":"With the exponential increase of mobile users and internet subscribers, the utilization of voice over internet protocol (VoIP) application is increasing dramatically. People exploit different VoIP applications for effective communication, for example, Google Meet, Microsoft Skype, Zoom video conferencing applications, etc. The single-ended speech quality metrics are employed for measuring and monitoring the quality of speech. However, different types of degradations present in the surroundings distort the quality of speech. In order to meet the desired quality of experience (QoE) level of end-user while using VoIP applications, it is necessary to reduce VoIP degradations and obtain the optimized speech quality. Along that line, this paper investigates the conjunction of filtering of silence and noise as a pre-processing block with the single-ended speech quality metric under various common occurring degradations encountered during VoIP communication. This can help the internet service providers in understanding the potential root cause of decrement in quality of speech and then applying the QoE management service to fulfill desired human QoE level. Results demonstrate that the deployment of joint pre-processing on speech samples under various VoIP degradations improves the quality of speech to a great extent.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129137569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-13DOI: 10.1109/sped53181.2021.9587394
{"title":"[Copyright notice]","authors":"","doi":"10.1109/sped53181.2021.9587394","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587394","url":null,"abstract":"","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131846871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-13DOI: 10.1109/sped53181.2021.9587397
Alexandru Dinu, A. Vlad
The paper revisits the notion of statistical independence for printed Romanian with a focus for the case when the language is considered as a chain of words. The analysis is carried out on a literary corpus of approx. 6 million words. We aim to improve the perception of the concept of statistical independence for natural texts and to use this concept to evaluate the numerical properties of the printed language.One main objective is to correlate the statistical independence results with the type II statistical error and at the same time to expand and circle back on previous results of the authors. We investigated different scenarios in relation with different word clusters involved in the process of creating Artificial Words (corroborated with the statistical independence evaluation). The Artificial Words consist of groups of the low probability words (based on previous findings on the type II statistical error in word probability investigation) and the results support the d = 100 minimum statistical independence distance.
{"title":"Romanian printed language, statistical independence and the type II statistical error","authors":"Alexandru Dinu, A. Vlad","doi":"10.1109/sped53181.2021.9587397","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587397","url":null,"abstract":"The paper revisits the notion of statistical independence for printed Romanian with a focus for the case when the language is considered as a chain of words. The analysis is carried out on a literary corpus of approx. 6 million words. We aim to improve the perception of the concept of statistical independence for natural texts and to use this concept to evaluate the numerical properties of the printed language.One main objective is to correlate the statistical independence results with the type II statistical error and at the same time to expand and circle back on previous results of the authors. We investigated different scenarios in relation with different word clusters involved in the process of creating Artificial Words (corroborated with the statistical independence evaluation). The Artificial Words consist of groups of the low probability words (based on previous findings on the type II statistical error in word probability investigation) and the results support the d = 100 minimum statistical independence distance.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114809333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-13DOI: 10.1109/sped53181.2021.9587369
M. Soleymanpour, Michael T. Johnson, J. Berry
Dysarthria is a speech disorder often characterized by slow speech with reduced intelligibility. This preliminary study investigates suprasegmental characteristics between typical and dysarthric speakers at varying severity levels, with the long-term goal of improving methods for dysarthric speech synthesis/augmentation and enhancement. First, we aim to analyze phonemes, speaking rate and pause characteristics of typical and dysarthric speech using the phoneme- and word-level alignment information extracted by Montreal Forced Aligner (MFA). Then, pitch and intensity declination trends and range analysis are conducted. The pitch and intensity declination are measured by fitting a regression line. These analyses are conducted on dysarthric speech in TORGO, containing 8 dysarthric speakers involved with cerebral palsy or amyotrophic lateral sclerosis and 7 age- and gender-matched typical speakers. These results are important for the development of dysarthric speech synthesis, augmentation to statistically model and evaluate characteristics such as pause, speaking rate, pitch, and intensity.
{"title":"Comparison in Suprasegmental Characteristics between Typical and Dysarthric Talkers at Varying Severity Levels","authors":"M. Soleymanpour, Michael T. Johnson, J. Berry","doi":"10.1109/sped53181.2021.9587369","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587369","url":null,"abstract":"Dysarthria is a speech disorder often characterized by slow speech with reduced intelligibility. This preliminary study investigates suprasegmental characteristics between typical and dysarthric speakers at varying severity levels, with the long-term goal of improving methods for dysarthric speech synthesis/augmentation and enhancement. First, we aim to analyze phonemes, speaking rate and pause characteristics of typical and dysarthric speech using the phoneme- and word-level alignment information extracted by Montreal Forced Aligner (MFA). Then, pitch and intensity declination trends and range analysis are conducted. The pitch and intensity declination are measured by fitting a regression line. These analyses are conducted on dysarthric speech in TORGO, containing 8 dysarthric speakers involved with cerebral palsy or amyotrophic lateral sclerosis and 7 age- and gender-matched typical speakers. These results are important for the development of dysarthric speech synthesis, augmentation to statistically model and evaluate characteristics such as pause, speaking rate, pitch, and intensity.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114923141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-13DOI: 10.1109/sped53181.2021.9587408
Subash Khanal, Michael T. Johnson, M. Soleymanpour, Narjes Bozorg
This paper presents a Mispronunciation Detection and Diagnosis (MDD) system based on a range of Automatic Speech Recognition (ASR) models and feature types. The goals of this research are to assess the ability of speech recognition systems to detect and diagnose the common pronunciation errors seen in non-native speakers (L2) of English and to assess the contribution of the information offered by Electromagnetic Articulography (EMA) data in improving the performance of such MDD systems. To evaluate the ability of the ASR systems to detect and diagnose pronunciation errors, the recognized sequence of phonemes generated by the ASR models were aligned with human-labeled phonetic transcripts as well as with the original phonetic prompts. This three-way alignment determined the MDD related metrics of the ASR system. System architectures included GMM-HMM, DNN, and RNN based ASR engines for the MDD system. Articulatory features derived from the Electromagnetic Articulography corpus of Mandarin-Accented English (EMA-MAE) were utilized along with acoustic features to compare the performance of MDD systems. The best performing system using a combination of acoustic and articulatory features had an accuracy of 82.4%, diagnostic accuracy of 75.8% and a false rejection rate of 17.2%.
{"title":"Mispronunciation Detection and Diagnosis for Mandarin Accented English Speech","authors":"Subash Khanal, Michael T. Johnson, M. Soleymanpour, Narjes Bozorg","doi":"10.1109/sped53181.2021.9587408","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587408","url":null,"abstract":"This paper presents a Mispronunciation Detection and Diagnosis (MDD) system based on a range of Automatic Speech Recognition (ASR) models and feature types. The goals of this research are to assess the ability of speech recognition systems to detect and diagnose the common pronunciation errors seen in non-native speakers (L2) of English and to assess the contribution of the information offered by Electromagnetic Articulography (EMA) data in improving the performance of such MDD systems. To evaluate the ability of the ASR systems to detect and diagnose pronunciation errors, the recognized sequence of phonemes generated by the ASR models were aligned with human-labeled phonetic transcripts as well as with the original phonetic prompts. This three-way alignment determined the MDD related metrics of the ASR system. System architectures included GMM-HMM, DNN, and RNN based ASR engines for the MDD system. Articulatory features derived from the Electromagnetic Articulography corpus of Mandarin-Accented English (EMA-MAE) were utilized along with acoustic features to compare the performance of MDD systems. The best performing system using a combination of acoustic and articulatory features had an accuracy of 82.4%, diagnostic accuracy of 75.8% and a false rejection rate of 17.2%.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"1997 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123632494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-13DOI: 10.1109/SpeD53181.2021.9587355
V. Pais, Radu Ion, Andrei-Marius Avram, Elena Irimia, V. Mititelu, Maria Mitrofan
This paper introduces a new Romanian speech corpus from the ROBIN project, called ROBIN Technical Acquisition Speech Corpus (ROBINTASC). Its main purpose was to improve the behaviour of a conversational agent, allowing human-machine interaction in the context of purchasing technical equipment. The paper contains a detailed description of the acquisition process, corpus statistics as well as an evaluation of the corpus influence on a low-latency ASR system as well as a dialogue component.
{"title":"Human-Machine Interaction Speech Corpus from the ROBIN project","authors":"V. Pais, Radu Ion, Andrei-Marius Avram, Elena Irimia, V. Mititelu, Maria Mitrofan","doi":"10.1109/SpeD53181.2021.9587355","DOIUrl":"https://doi.org/10.1109/SpeD53181.2021.9587355","url":null,"abstract":"This paper introduces a new Romanian speech corpus from the ROBIN project, called ROBIN Technical Acquisition Speech Corpus (ROBINTASC). Its main purpose was to improve the behaviour of a conversational agent, allowing human-machine interaction in the context of purchasing technical equipment. The paper contains a detailed description of the acquisition process, corpus statistics as well as an evaluation of the corpus influence on a low-latency ASR system as well as a dialogue component.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126533934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-13DOI: 10.1109/sped53181.2021.9587411
G. Săracu, Adriana Stan
This paper introduces an evaluation of the amount of data required by the Tacotron2 speech synthesis model in order to achieve a good quality output synthesis. We evaluate the capabilities of the model to adapt to new speakers in very limited data scenarios. We use three Romanian speakers for which we gathered at most 5 minutes of speech, and use this data to fine tune a large pre-trained model over a few training epochs. We look at the performance of the system by evaluating the intelligibility, naturalness and speaker similarity measures, as well as performing an analysis of the trade-off between speech quality and overfitting of the network.The results show that the Tacotron2 network can replicate the identity of a speaker from as little as one speech sample. Also it inherently learns individual grapheme representations, such that if the training data is carefully selected to present all the common graphemes in the language, the adaptation data requirements can be significantly lowered.
{"title":"An analysis of the data efficiency in Tacotron2 speech synthesis system","authors":"G. Săracu, Adriana Stan","doi":"10.1109/sped53181.2021.9587411","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587411","url":null,"abstract":"This paper introduces an evaluation of the amount of data required by the Tacotron2 speech synthesis model in order to achieve a good quality output synthesis. We evaluate the capabilities of the model to adapt to new speakers in very limited data scenarios. We use three Romanian speakers for which we gathered at most 5 minutes of speech, and use this data to fine tune a large pre-trained model over a few training epochs. We look at the performance of the system by evaluating the intelligibility, naturalness and speaker similarity measures, as well as performing an analysis of the trade-off between speech quality and overfitting of the network.The results show that the Tacotron2 network can replicate the identity of a speaker from as little as one speech sample. Also it inherently learns individual grapheme representations, such that if the training data is carefully selected to present all the common graphemes in the language, the adaptation data requirements can be significantly lowered.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128125124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-13DOI: 10.1109/sped53181.2021.9587374
Costin Andrei Bratan, Mirela Gheorghe, Ioan Ispas, E. Franti, M. Dascalu, S. Stoicescu, Ioana Rosca, Florentina Gherghiceanu, Doina Dumitrache, L. Nastase
Several methods were reported in the scientific literature for the classification of the infant cries, in order to automatically detect the need behind their tears and help the parents and caretakers. In the same scope, this paper has an original approach in which the sounds that precede the cry are used. Such sounds can be considered primitive words and are classified according to the “Dunstan Baby Language”. The paper verifies the universal baby language hypothesis starting from the research reported in a previous article. A CNN architecture trained with recordings of babies from Australia was used for classifying the audio material coming from Romanian babies. It was an attempt to see what happens should the participants belong to a different cultural landscape. The database loaded with the sounds made by Romanian babies was labelled by doctors in the maternity hospitals and two Dunstan experts, separately. Finally, the results of the CNN automatic classification were compared to those obtained by the Dunstan coaches. The conclusions have proved that Dunstan language is universal.
{"title":"Dunstan Baby Language Classification with CNN","authors":"Costin Andrei Bratan, Mirela Gheorghe, Ioan Ispas, E. Franti, M. Dascalu, S. Stoicescu, Ioana Rosca, Florentina Gherghiceanu, Doina Dumitrache, L. Nastase","doi":"10.1109/sped53181.2021.9587374","DOIUrl":"https://doi.org/10.1109/sped53181.2021.9587374","url":null,"abstract":"Several methods were reported in the scientific literature for the classification of the infant cries, in order to automatically detect the need behind their tears and help the parents and caretakers. In the same scope, this paper has an original approach in which the sounds that precede the cry are used. Such sounds can be considered primitive words and are classified according to the “Dunstan Baby Language”. The paper verifies the universal baby language hypothesis starting from the research reported in a previous article. A CNN architecture trained with recordings of babies from Australia was used for classifying the audio material coming from Romanian babies. It was an attempt to see what happens should the participants belong to a different cultural landscape. The database loaded with the sounds made by Romanian babies was labelled by doctors in the maternity hospitals and two Dunstan experts, separately. Finally, the results of the CNN automatic classification were compared to those obtained by the Dunstan coaches. The conclusions have proved that Dunstan language is universal.","PeriodicalId":193702,"journal":{"name":"2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133194740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}