Diadochokinetic speech tasks (DDK) involve the repetitive production of consonant-vowel syllables. These tasks are useful in detecting impairments, differential diagnosis, and monitoring progress in speech-motor impairments. However, manual analysis of those tasks is time-consuming, subjective, and provides only a rough picture of speech. This paper presents several deep neural network models working on the raw waveform for the automatic segmentation of stop consonants and vowels from unannotated and untranscribed speech. A deep encoder serves as a features extractor module, replacing conventional signal processing features. In this context, diverse deep learning architectures, such as convolutional neural networks (CNNs) and large self-supervised models like HuBERT, are applied for the extraction process. A decoder model uses derived embeddings to identify frame types. Consequently, the paper studies diverse deep architectures, ranging from linear layers, LSTM, CNN, and transformers. These architectures are assessed for their ability to detect speech rate, sound duration, and boundary locations on a dataset of healthy individuals and an unseen dataset of older individuals with Parkinson’s Disease. The results reveal that an LSTM model performs better than all other models on both datasets and is comparable to trained human annotators.
Integrating allusions into poems is an advanced form of human poetry writing, which could clearly express the author’s thoughts and arouse the resonance of readers. However, existing poetry generation works mainly focus on improving the coherence and fluency of poetry, while generating poems with allusion knowledge is rarely considered. To solve this issue, we propose an Allusion-aware Chinese Poetry Generation (ACPG) framework in this study. Concretely, we first release an Allusion-Enriched Poetry (AEP) dataset by linking poems with historical allusions, which might enable a new research direction for poetry generation. Based on this dataset, we design a three-stage learning mechanism to encourage the training stage under a low-resource setting, which can effectively exploit the knowledge of large-scale poetry and allusion data to generate informative allusive poems. Extensive experiments demonstrate the effectiveness of ACPG among a series of proposed baselines. Moreover, the proposed ACPG framework can also be applied to lyrics generation or other controlled text generation tasks, which can incorporate allusion knowledge into the generated results and enhance the meaning and quality of the texts.
A novel feature, based on the chirp z-transform, that offers an improved representation of the underlying true spectrum is proposed. This feature, the chirp MFCC, is derived by computing the Mel frequency cepstral coefficients from the chirp magnitude spectrum, instead of the Fourier transform magnitude spectrum. The theoretical foundations for the proposal, and the experimental validation using product of likelihood Gaussians, to show the improved class separation offered by the proposed chirp MFCC, when compared with basic MFCC are discussed. Further, real world evaluation of the feature is performed using three diverse tasks, namely, speech–music classification, speaker identification, and speech commands recognition. It is shown in all three tasks that the proposed chirp MFCC offers considerable improvements.
Existing approaches for disfluency detection typically require the existence of large annotated datasets. However, current datasets for this task are limited, suffer from class imbalance, and lack some types of disfluencies that are encountered in real-world scenarios. At the same time, augmentation techniques for disfluency detection are not able to model complex types of disfluencies. This limits such approaches to only performing pre-training since the generated data are not indicative of disfluencies that occur in real scenarios and, as a result, cannot be directly used for training disfluency detection models, as we experimentally demonstrate. This imposes significant constraints on the usefulness of such approaches in practice since real disfluencies still have to be collected in order to train the models. In this work, we propose Large-scale ARtificial Disfluency Generation (LARD), a method for automatically generating artificial disfluencies, and more specifically repairs, from fluent text. Unlike existing augmentation techniques, LARD can simulate all the different and complex types of disfluencies. In addition, it incorporates contextual embeddings into the disfluency generation to produce realistic, context-aware artificial disfluencies. LARD can be used effectively for training disfluency detection models, bypassing the requirement of annotated disfluent data. Our empirical evaluation shows that LARD outperforms existing rule-based augmentation methods and increases the accuracy of existing disfluency detectors. In addition, experiments demonstrate that the proposed method can be effectively used in a low-resource setup.
The existence of correlation among mental disorders is a well-known phenomenon. Multi-task learning (MTL) has been reported to yield enhanced detection performance of a targeted mental disorder by leveraging its correlation with other related mental disorders, mainly in textual and visual modalities. The validation of the same on audio modality is yet to be explored. In this study, we explore homogeneous and heterogeneous MTL paradigms for detecting two correlated mental disorders, namely major depressive disorder (MDD) and post-traumatic stress disorder (PTSD), on a publicly available audio dataset. The detection of both disorders is interchangeably employed as an auxiliary task when the other is the main task. In addition, a few other tasks are employed as auxiliary tasks. The results show that both MTL paradigms, implemented using two considered deep-learning models, outperformed the corresponding single-task learning (STL). The best relative improvement in the detection performance of MDD and PTSD is found to be 29.9% and 28.8%, respectively. Furthermore, we analyzed the cross-corpus generalization of MTL using two distinct datasets that involve MDD/PTSD instances. The results indicate that the generalizability of MTL is significantly superior to that of STL. The best relative increment in the cross-corpus generalization performance of MDD and PTSD detection is found to be 25.0% and 56.5%, respectively.