Richard X Touret, Nicholas Durofchalk, Karim G Sabra
This Letter investigates the influence of source motion on the performance of the ray-based blind deconvolution algorithm (RBD). RBD is used to estimate channel impulse responses and source signals from opportunistic sources such as shipping vessels but was derived under a stationary source assumption. A theoretical correction for Doppler from a simplified moving source model is used to quantify the biases in estimated arrival angles and travel times from RBD. This correction is numerically validated using environmental data from the SBCeX16 experiment in the Santa Barbara Channel. Implications for source localization and potential passive acoustic tomography using RBD are discussed.
{"title":"Quantifying the influence of source motion on the ray-based blind deconvolution algorithm.","authors":"Richard X Touret, Nicholas Durofchalk, Karim G Sabra","doi":"10.1121/10.0030344","DOIUrl":"10.1121/10.0030344","url":null,"abstract":"<p><p>This Letter investigates the influence of source motion on the performance of the ray-based blind deconvolution algorithm (RBD). RBD is used to estimate channel impulse responses and source signals from opportunistic sources such as shipping vessels but was derived under a stationary source assumption. A theoretical correction for Doppler from a simplified moving source model is used to quantify the biases in estimated arrival angles and travel times from RBD. This correction is numerically validated using environmental data from the SBCeX16 experiment in the Santa Barbara Channel. Implications for source localization and potential passive acoustic tomography using RBD are discussed.</p>","PeriodicalId":73538,"journal":{"name":"JASA express letters","volume":"4 11","pages":""},"PeriodicalIF":1.2,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142669859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Natarajan Balaji Shankar, Amber Afshan, Alexander Johnson, Aurosweta Mahapatra, Alejandra Martin, Haolun Ni, Hae Won Park, Marlen Quintero Perez, Gary Yeung, Alison Bailey, Cynthia Breazeal, Abeer Alwan
This paper describes an original dataset of children's speech, collected through the use of JIBO, a social robot. The dataset encompasses recordings from 110 children, aged 4-7 years old, who participated in a letter and digit identification task and extended oral discourse tasks requiring explanation skills, totaling 21 h of session data. Spanning a 2-year collection period, this dataset contains a longitudinal component with a subset of participants returning for repeat recordings. The dataset, with session recordings and transcriptions, is publicly available, providing researchers with a valuable resource to advance investigations into child language development.
{"title":"The JIBO Kids Corpus: A speech dataset of child-robot interactions in a classroom environment.","authors":"Natarajan Balaji Shankar, Amber Afshan, Alexander Johnson, Aurosweta Mahapatra, Alejandra Martin, Haolun Ni, Hae Won Park, Marlen Quintero Perez, Gary Yeung, Alison Bailey, Cynthia Breazeal, Abeer Alwan","doi":"10.1121/10.0034195","DOIUrl":"https://doi.org/10.1121/10.0034195","url":null,"abstract":"<p><p>This paper describes an original dataset of children's speech, collected through the use of JIBO, a social robot. The dataset encompasses recordings from 110 children, aged 4-7 years old, who participated in a letter and digit identification task and extended oral discourse tasks requiring explanation skills, totaling 21 h of session data. Spanning a 2-year collection period, this dataset contains a longitudinal component with a subset of participants returning for repeat recordings. The dataset, with session recordings and transcriptions, is publicly available, providing researchers with a valuable resource to advance investigations into child language development.</p>","PeriodicalId":73538,"journal":{"name":"JASA express letters","volume":"4 11","pages":""},"PeriodicalIF":1.2,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142559643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xuan Shi, Tiantian Feng, Kevin Huang, Sudarsana Reddy Kadiri, Jihwan Lee, Yijing Lu, Yubin Zhang, Louis Goldstein, Shrikanth Narayanan
Variability in speech pronunciation is widely observed across different linguistic backgrounds, which impacts modern automatic speech recognition performance. Here, we evaluate the performance of a self-supervised speech model in phoneme recognition using direct articulatory evidence. Findings indicate significant differences in phoneme recognition, especially in front vowels, between American English and Indian English speakers. To gain a deeper understanding of these differences, we conduct real-time MRI-based articulatory analysis, revealing distinct velar region patterns during the production of specific front vowels. This underscores the need to deepen the scientific understanding of self-supervised speech model variances to advance robust and inclusive speech technology.
{"title":"Direct articulatory observation reveals phoneme recognition performance characteristics of a self-supervised speech model.","authors":"Xuan Shi, Tiantian Feng, Kevin Huang, Sudarsana Reddy Kadiri, Jihwan Lee, Yijing Lu, Yubin Zhang, Louis Goldstein, Shrikanth Narayanan","doi":"10.1121/10.0034430","DOIUrl":"https://doi.org/10.1121/10.0034430","url":null,"abstract":"<p><p>Variability in speech pronunciation is widely observed across different linguistic backgrounds, which impacts modern automatic speech recognition performance. Here, we evaluate the performance of a self-supervised speech model in phoneme recognition using direct articulatory evidence. Findings indicate significant differences in phoneme recognition, especially in front vowels, between American English and Indian English speakers. To gain a deeper understanding of these differences, we conduct real-time MRI-based articulatory analysis, revealing distinct velar region patterns during the production of specific front vowels. This underscores the need to deepen the scientific understanding of self-supervised speech model variances to advance robust and inclusive speech technology.</p>","PeriodicalId":73538,"journal":{"name":"JASA express letters","volume":"4 11","pages":""},"PeriodicalIF":1.2,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142649750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ambient noise measurements from an Arctic fjord during summer and winter are analyzed using spectral, coherence, and directionality estimates from a vertically separated pair of hydrophones. The primary noise sources attributed to wind, shipping, and ice activity are categorized and coherence is arrived at. Estimates of the noise field directionality in the vertical and its variation over time and between seasons are used to strengthen the analysis of the time-varying nature of noise sources. Source identification using such processing techniques serves as a valuable tool in passive acoustic monitoring systems for studying ice dynamics in glacierized fjords.
{"title":"Ambient noise source characterization using spectral, coherence, and directionality estimates at Kongsfjorden.","authors":"Sanjana M C, Latha G, Thirunavukkarasu A","doi":"10.1121/10.0034307","DOIUrl":"https://doi.org/10.1121/10.0034307","url":null,"abstract":"<p><p>Ambient noise measurements from an Arctic fjord during summer and winter are analyzed using spectral, coherence, and directionality estimates from a vertically separated pair of hydrophones. The primary noise sources attributed to wind, shipping, and ice activity are categorized and coherence is arrived at. Estimates of the noise field directionality in the vertical and its variation over time and between seasons are used to strengthen the analysis of the time-varying nature of noise sources. Source identification using such processing techniques serves as a valuable tool in passive acoustic monitoring systems for studying ice dynamics in glacierized fjords.</p>","PeriodicalId":73538,"journal":{"name":"JASA express letters","volume":"4 11","pages":""},"PeriodicalIF":1.2,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142585172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alessio Lampis, Alexander Mayer, Vasileios Chatziioannou
This study investigates the influence of string properties on bowed string attack playability. To assess the attack playability of different string types, a variety of bow forces and bow accelerations were chosen to excite the strings and measure the transient response under different bowing control parameters. The experimentally obtained playability maps of transient duration as function of bow force and acceleration (Guettler diagram) were obtained with a robotic bowing machine, from four different types of cello G2 strings. Results indicate variations in playability across string types, suggesting that string properties impact attack duration.
{"title":"An experimental approach for comparing the influence of cello string type on bowed attack response.","authors":"Alessio Lampis, Alexander Mayer, Vasileios Chatziioannou","doi":"10.1121/10.0034330","DOIUrl":"https://doi.org/10.1121/10.0034330","url":null,"abstract":"<p><p>This study investigates the influence of string properties on bowed string attack playability. To assess the attack playability of different string types, a variety of bow forces and bow accelerations were chosen to excite the strings and measure the transient response under different bowing control parameters. The experimentally obtained playability maps of transient duration as function of bow force and acceleration (Guettler diagram) were obtained with a robotic bowing machine, from four different types of cello G2 strings. Results indicate variations in playability across string types, suggesting that string properties impact attack duration.</p>","PeriodicalId":73538,"journal":{"name":"JASA express letters","volume":"4 11","pages":""},"PeriodicalIF":1.2,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142607745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the development of automatic speech recognition systems, achieving human-like performance has been a long-held goal. Recent releases of large spoken language models have claimed to achieve such performance, although direct comparison to humans has been severely limited. The present study tested L1 British English listeners against two automatic speech recognition systems (wav2vec 2.0 and Whisper, base and large sizes) in adverse listening conditions: speech-shaped noise and pub noise, at different signal-to-noise ratios, and recordings produced with or without face masks. Humans maintained the advantage against all systems, except for Whisper large, which outperformed humans in every condition but pub noise.
在自动语音识别系统的开发过程中,实现与人类相似的性能是一个长期坚持的目标。最近发布的大型口语模型声称可以达到这样的性能,但与人类的直接比较却受到严重限制。本研究针对两个自动语音识别系统(wav2vec 2.0 和 Whisper,基本型和大型),在不利的听力条件下对英国英语中级听者进行了测试:不同信噪比下的语音噪声和酒吧噪声,以及带或不带面罩的录音。人类在所有系统中都保持优势,只有 Whisper large 除酒吧噪音外在所有条件下都优于人类。
{"title":"Speech recognition in adverse conditions by humans and machines.","authors":"Chloe Patman, Eleanor Chodroff","doi":"10.1121/10.0032473","DOIUrl":"https://doi.org/10.1121/10.0032473","url":null,"abstract":"<p><p>In the development of automatic speech recognition systems, achieving human-like performance has been a long-held goal. Recent releases of large spoken language models have claimed to achieve such performance, although direct comparison to humans has been severely limited. The present study tested L1 British English listeners against two automatic speech recognition systems (wav2vec 2.0 and Whisper, base and large sizes) in adverse listening conditions: speech-shaped noise and pub noise, at different signal-to-noise ratios, and recordings produced with or without face masks. Humans maintained the advantage against all systems, except for Whisper large, which outperformed humans in every condition but pub noise.</p>","PeriodicalId":73538,"journal":{"name":"JASA express letters","volume":"4 11","pages":""},"PeriodicalIF":1.2,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142633969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kent L Gee, Noah L Pulsipher, Makayle S Kellison, Logan T Mathews, Mark C Anderson, Grant W Hart
Far-field (9.7-35.5 km) noise measurements were made during the fifth flight test of SpaceX's Starship Super Heavy, which included the first-ever booster catch. Key results involving launch and flyback sonic boom sound levels include (a) A-weighted sound exposure levels during launch are 18 dB less than predicted at 35 km; (b) the flyback sonic boom exceeds 10 psf at 10 km; and (c) comparing Starship launch noise to Space Launch System and Falcon 9 shows that Starship is substantially louder; the far-field noise produced during a Starship launch is at least ten times that of Falcon 9.
{"title":"Starship super heavy acoustics: Far-field noise measurements during launch and the first-ever booster catch.","authors":"Kent L Gee, Noah L Pulsipher, Makayle S Kellison, Logan T Mathews, Mark C Anderson, Grant W Hart","doi":"10.1121/10.0034453","DOIUrl":"10.1121/10.0034453","url":null,"abstract":"<p><p>Far-field (9.7-35.5 km) noise measurements were made during the fifth flight test of SpaceX's Starship Super Heavy, which included the first-ever booster catch. Key results involving launch and flyback sonic boom sound levels include (a) A-weighted sound exposure levels during launch are 18 dB less than predicted at 35 km; (b) the flyback sonic boom exceeds 10 psf at 10 km; and (c) comparing Starship launch noise to Space Launch System and Falcon 9 shows that Starship is substantially louder; the far-field noise produced during a Starship launch is at least ten times that of Falcon 9.</p>","PeriodicalId":73538,"journal":{"name":"JASA express letters","volume":"4 11","pages":""},"PeriodicalIF":1.2,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142640165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep neural network (DNN) based speech enhancement techniques have shown superior performance compared to the traditional speech enhancement approaches in handling nonstationary noise. However, their performance is often compromised as a result of mismatch between their testing and training conditions. In this work, a codebook integrated deep neural network (CI-DNN) approach is introduced for speech enhancement, which mitigates this mismatch by employing existing speaker adapted codebooks with a DNN. The proposed CI-DNN demonstrates better speech enhancement performance compared to the corresponding speaker independent DNNs. The CI-DNN approach essentially involves a post processing operation for DNN and, hence, is applicable to any DNN architecture.
{"title":"Speaker adaptation using codebook integrated deep neural networks for speech enhancement.","authors":"B Chidambar, D Hanumanth Rao Naidu","doi":"10.1121/10.0034308","DOIUrl":"https://doi.org/10.1121/10.0034308","url":null,"abstract":"<p><p>Deep neural network (DNN) based speech enhancement techniques have shown superior performance compared to the traditional speech enhancement approaches in handling nonstationary noise. However, their performance is often compromised as a result of mismatch between their testing and training conditions. In this work, a codebook integrated deep neural network (CI-DNN) approach is introduced for speech enhancement, which mitigates this mismatch by employing existing speaker adapted codebooks with a DNN. The proposed CI-DNN demonstrates better speech enhancement performance compared to the corresponding speaker independent DNNs. The CI-DNN approach essentially involves a post processing operation for DNN and, hence, is applicable to any DNN architecture.</p>","PeriodicalId":73538,"journal":{"name":"JASA express letters","volume":"4 11","pages":""},"PeriodicalIF":1.2,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142585173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study investigates the relative perceptual distinction of the [n] vs [l] contrast in different vowel contexts ([_a] vs [_i]) and tonal contexts (high-initial such as HH, HL, vs low-initial such as LL, LH). The results of two speeded AX discrimination experiments indicated that a [n-l] contrast is perceptually more distinct in the [_a] context and with a high-initial tone. The results are consistent with the typology of the [n] vs [l] contrast across Chinese dialects, which is more frequently observed in the [_a] context and with a high-initial tone, supporting a connection between phonological typology and perceptual distinctiveness.
{"title":"The perceptual distinctiveness of the [n-l] contrast in different vowel and tonal contexts.","authors":"Pauline Bolin Liu, Mingxing Li","doi":"10.1121/10.0034196","DOIUrl":"https://doi.org/10.1121/10.0034196","url":null,"abstract":"<p><p>This study investigates the relative perceptual distinction of the [n] vs [l] contrast in different vowel contexts ([_a] vs [_i]) and tonal contexts (high-initial such as HH, HL, vs low-initial such as LL, LH). The results of two speeded AX discrimination experiments indicated that a [n-l] contrast is perceptually more distinct in the [_a] context and with a high-initial tone. The results are consistent with the typology of the [n] vs [l] contrast across Chinese dialects, which is more frequently observed in the [_a] context and with a high-initial tone, supporting a connection between phonological typology and perceptual distinctiveness.</p>","PeriodicalId":73538,"journal":{"name":"JASA express letters","volume":"4 11","pages":""},"PeriodicalIF":1.2,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142559644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deriving human neural responses to natural speech is now possible, but the responses to male- and female-uttered speech have been shown to differ. These talker differences may complicate interpretations or restrict experimental designs geared toward more realistic communication scenarios. This study found that when a male talker and a female talker had the same fundamental frequency, auditory brainstem responses (ABRs) were very similar. Those responses became smaller and later with increasing fundamental frequency, as did click ABRs with increasing stimulus rates. Modeled responses suggested that the speech and click ABR differences were reasonably predicted by peripheral and brainstem processing of stimulus acoustics.
现在已经可以推导出人类对自然语音的神经反应,但对男性和女性口述语音的反应已被证明是不同的。这些说话者的差异可能会使解释复杂化,或限制面向更真实交流场景的实验设计。本研究发现,当男性说话者和女性说话者的基频相同时,听觉脑干反应(ABRs)非常相似。随着基频的增加,这些反应变得更小更晚,随着刺激率的增加,点击 ABR 也是如此。模拟反应表明,外周和脑干对刺激声学的处理可合理预测语音和点击 ABR 的差异。
{"title":"Fundamental frequency predominantly drives talker differences in auditory brainstem responses to continuous speech.","authors":"Melissa J Polonenko, Ross K Maddox","doi":"10.1121/10.0034329","DOIUrl":"10.1121/10.0034329","url":null,"abstract":"<p><p>Deriving human neural responses to natural speech is now possible, but the responses to male- and female-uttered speech have been shown to differ. These talker differences may complicate interpretations or restrict experimental designs geared toward more realistic communication scenarios. This study found that when a male talker and a female talker had the same fundamental frequency, auditory brainstem responses (ABRs) were very similar. Those responses became smaller and later with increasing fundamental frequency, as did click ABRs with increasing stimulus rates. Modeled responses suggested that the speech and click ABR differences were reasonably predicted by peripheral and brainstem processing of stimulus acoustics.</p>","PeriodicalId":73538,"journal":{"name":"JASA express letters","volume":"4 11","pages":""},"PeriodicalIF":1.2,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11558516/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142592457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}