Computer Speech and Language最新文献_第10页

Improving linear orthogonal mapping based cross-lingual representation using ridge regression and graph centrality 利用脊回归和图中心性改进基于线性正交映射的跨语言表示法

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-03-16 DOI: 10.1016/j.csl.2024.101640

Deepen Naorem, Sanasam Ranbir Singh, Priyankoo Sarmah

Orthogonal linear mapping is a commonly used approach for generating cross-lingual embedding between two monolingual corpora that uses a word frequency-based seed dictionary alignment approach. While this approach is found to be effective for isomorphic language pairs, they do not perform well for distant language pairs with different sentence structures and morphological properties. For a distance language pair, the existing frequency-aligned orthogonal mapping methods suffer from two problems - (i)the frequency of source and target word are not comparable, and (ii)different word pairs in the seed dictionary may have different contribution. Motivated by the above two concerns, this paper proposes a novel centrality-aligned ridge regression-based orthogonal mapping. The proposed method uses centrality-based alignment for seed dictionary selection and ridge regression framework for incorporating influential weights of different word pairs in the seed dictionary. From various experimental observations over five language pairs (both isomorphic and distant languages), it is evident that the proposed method outperforms baseline methods in the Bilingual Dictionary Induction(BDI) task, Sentence Retrieval Task(SRT), and Machine Translation. Further, several analyses are also included to support the proposed method.

正交线性映射是在两个单语语料库之间生成跨语言嵌入的常用方法，它使用基于词频的种子词典对齐方法。虽然这种方法对同构语言对很有效，但对于句子结构和形态属性不同的远距离语言对，效果并不理想。对于远距离语言对，现有的频率对齐正交映射方法存在两个问题--(i) 源词和目标词的频率不具有可比性；(ii) 种子词典中的不同词对可能具有不同的贡献率。基于上述两个问题，本文提出了一种基于中心性对齐脊回归的新型正交映射方法。该方法使用基于中心性的对齐来选择种子词典，并使用脊回归框架来纳入种子词典中不同词对的影响权重。通过对五种语言对（同构语言和远源语言）的各种实验观察，可以明显看出所提出的方法在双语词典归纳（BDI）任务、句子检索任务（SRT）和机器翻译方面优于基线方法。此外，还包括几项分析，以支持所提出的方法。

{"title":"Improving linear orthogonal mapping based cross-lingual representation using ridge regression and graph centrality","authors":"Deepen Naorem, Sanasam Ranbir Singh, Priyankoo Sarmah","doi":"10.1016/j.csl.2024.101640","DOIUrl":"10.1016/j.csl.2024.101640","url":null,"abstract":"<div><p>Orthogonal linear mapping is a commonly used approach for generating cross-lingual embedding between two monolingual corpora that uses a word frequency-based seed dictionary alignment approach. While this approach is found to be effective for isomorphic language pairs, they do not perform well for distant language pairs with different sentence structures and morphological properties. For a distance language pair, the existing frequency-aligned orthogonal mapping methods suffer from two problems - (i)the frequency of source and target word are not comparable, and (ii)different word pairs in the seed dictionary may have different contribution. Motivated by the above two concerns, this paper proposes a novel centrality-aligned ridge regression-based orthogonal mapping. The proposed method uses centrality-based alignment for seed dictionary selection and ridge regression framework for incorporating influential weights of different word pairs in the seed dictionary. From various experimental observations over five language pairs (both isomorphic and distant languages), it is evident that the proposed method outperforms baseline methods in the Bilingual Dictionary Induction(BDI) task, Sentence Retrieval Task(SRT), and Machine Translation. Further, several analyses are also included to support the proposed method.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"87 ","pages":"Article 101640"},"PeriodicalIF":4.3,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140181780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A closer look at reinforcement learning-based automatic speech recognition 进一步了解基于强化学习的自动语音识别技术

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-03-16 DOI: 10.1016/j.csl.2024.101641

Fan Yang , Muqiao Yang , Xiang Li , Yuxuan Wu , Zhiyuan Zhao , Bhiksha Raj , Rita Singh

Reinforcement learning (RL) has demonstrated effectiveness in improving model performance and robustness for automatic speech recognition (ASR) tasks. Researchers have employed RL-based training strategies to enhance performance beyond conventional supervised or semi-supervised learning. However, existing approaches treat RL as a supplementary tool, leaving the untapped potential of RL training largely unexplored. In this paper, we formulate a novel pure RL setting where an ASR model is trained exclusively through RL via human feedback metrics, e.g., word error rate (WER) or binary reward. This approach promises to significantly simplify the annotation process if we could replace the conventional onerous annotation with a single numeric value in a human–computer interaction (HCI) way. Our experiments demonstrate the feasibility of this new setting and also identify two main inherent issues in conventional RL-based ASR training that may lead to performance degradation: (1) the mismatch issue between the action and reward has been commonly overlooked in Connectionist Temporal Classification (CTC) based models, which is attributed to the inherent CTC alignment mapping issue; (2) the classic exploration–exploitation trade-off still exists in the sampling stage of RL-based ASR, and finding the balance between them becomes a challenge. To address these issues, we first propose a new RL-based approach named CTC-aligned Policy Gradient (CTC-PG), which provides a unified formulation for different sampling strategies and alleviates the mismatch issue of action and reward in CTC-based models. Moreover, we propose Focal Sampling to balance the trade-off between exploration and exploitation with a flexible temperature parameter. Experiment results on LibriSpeech dataset showcase the effectiveness and robustness of our methods by harnessing the full potential of RL in training ASR models.

强化学习（RL）在提高自动语音识别（ASR）任务的模型性能和鲁棒性方面表现出了有效性。研究人员采用了基于 RL 的训练策略，以提高性能，超越传统的监督或半监督学习。然而，现有的方法将 RL 视为一种辅助工具，从而在很大程度上忽略了 RL 训练尚未开发的潜力。在本文中，我们提出了一种新颖的纯 RL 设置，即完全通过人的反馈指标（如单词错误率 (WER) 或二元奖励）进行 RL 训练 ASR 模型。如果我们能以人机交互（HCI）的方式用单一数值取代传统繁琐的注释，那么这种方法有望大大简化注释过程。我们的实验证明了这种新设置的可行性，同时也发现了传统基于 RL 的 ASR 训练中可能导致性能下降的两个主要固有问题：（1）在基于连接时态分类（CTC）的模型中，动作和奖励之间的不匹配问题通常被忽视，这归因于固有的 CTC 配对映射问题；（2）在基于 RL 的 ASR 的采样阶段，经典的探索-开发权衡仍然存在，如何在两者之间找到合适的平衡点成为了一个挑战。为了解决这些问题，我们首先提出了一种新的基于 RL 的方法，命名为 CTC 对齐策略梯度（CTC-PG），它为不同的采样策略提供了统一的表述，缓解了基于 CTC 模型中行动与回报不匹配的问题。此外，我们还提出了 "焦点采样"（Focal Sampling）方法，利用温度参数来平衡探索与开发之间的权衡。在 LibriSpeech 数据集上的实验结果表明，我们的方法在训练 ASR 模型时充分发挥了 RL 的潜力，从而展示了这些方法的有效性和鲁棒性。

{"title":"A closer look at reinforcement learning-based automatic speech recognition","authors":"Fan Yang , Muqiao Yang , Xiang Li , Yuxuan Wu , Zhiyuan Zhao , Bhiksha Raj , Rita Singh","doi":"10.1016/j.csl.2024.101641","DOIUrl":"10.1016/j.csl.2024.101641","url":null,"abstract":"<div><p>Reinforcement learning (RL) has demonstrated effectiveness in improving model performance and robustness for automatic speech recognition (ASR) tasks. Researchers have employed RL-based training strategies to enhance performance beyond conventional supervised or semi-supervised learning. However, existing approaches treat RL as a supplementary tool, leaving the untapped potential of RL training largely unexplored. In this paper, we formulate a novel pure RL setting where an ASR model is trained exclusively through RL via human feedback metrics, e.g., word error rate (WER) or binary reward. This approach promises to significantly simplify the annotation process if we could replace the conventional onerous annotation with a single numeric value in a human–computer interaction (HCI) way. Our experiments demonstrate the feasibility of this new setting and also identify two main inherent issues in conventional RL-based ASR training that may lead to performance degradation: (1) the mismatch issue between the action and reward has been commonly overlooked in Connectionist Temporal Classification (CTC) based models, which is attributed to the inherent CTC alignment mapping issue; (2) the classic exploration–exploitation trade-off still exists in the sampling stage of RL-based ASR, and finding the balance between them becomes a challenge. To address these issues, we first propose a new RL-based approach named CTC-aligned Policy Gradient (CTC-PG), which provides a unified formulation for different sampling strategies and alleviates the mismatch issue of action and reward in CTC-based models. Moreover, we propose Focal Sampling to balance the trade-off between exploration and exploitation with a flexible temperature parameter. Experiment results on LibriSpeech dataset showcase the effectiveness and robustness of our methods by harnessing the full potential of RL in training ASR models.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"87 ","pages":"Article 101641"},"PeriodicalIF":4.3,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140181821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Higher order statistics-driven magnitude and phase spectrum estimation for speech enhancement 用于语音增强的高阶统计驱动的幅度和相位频谱估计

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-03-16 DOI: 10.1016/j.csl.2024.101639

T. Lavanya , P. Vijayalakshmi , K. Mrinalini , T. Nagarajan

Higher order statistics (HOS), can be effectively employed for noise suppression, provided the noise follows a Gaussian distribution. Since most of the noises are distributed normally, HOS can be effectively used for speech enhancement in noisy environments. In the current work, HOS-based parametric modelling for magnitude spectrum estimation is proposed to improve the SNR under noisy conditions. To establish this, a non-Gaussian reduced ARMA model formulated using third order cumulant sequences (Giannakis, 1990) is used. Here, the AR and MA model orders, $p$ and $q$ , are dynamically estimated by the well-established periodicity estimation technique under noisy conditions namely the Ramanujan Filter Bank (RFB) approach. The AR coefficients estimated from the reduced ARMA model are used to obtain the partially enhanced speech output, whose magnitude spectrum is then subjected to second-level enhancement using log MMSE with modified speech presence uncertainty (SPU) estimation technique. The refined magnitude spectrum, is combined with the phase spectrum extracted using proposed bicoherence-based phase compensation (BPC) technique, to estimate the enhanced speech output. The HOS-driven speech enhancement technique proposed in the current work is observed to be efficient for white, pink, babble and buccaneer noises. The objective measures, PESQ and STOI, indicate that the proposed method works well under all the noise conditions considered for evaluation.

高阶统计（HOS）可有效用于噪声抑制，前提是噪声服从高斯分布。由于大多数噪声都呈正态分布，因此高阶统计可有效用于噪声环境下的语音增强。在当前的工作中，我们提出了基于 HOS 的幅度谱估计参数建模，以提高噪声条件下的信噪比。为此，我们使用了一个使用三阶累积序列（Giannakis，1990 年）建立的非高斯还原 ARMA 模型。在这里，AR 和 MA 模型的阶数 p 和 q 是通过噪声条件下成熟的周期性估计技术（即 Ramanujan 滤波器库 (RFB) 方法）来动态估计的。根据简化的 ARMA 模型估算出的 AR 系数用于获得部分增强的语音输出，然后使用对数 MMSE 和改进的语音存在不确定性 (SPU) 估算技术对其幅度频谱进行二级增强。细化后的幅度频谱与使用基于双相干的相位补偿（BPC）技术提取的相位频谱相结合，估算出增强后的语音输出。据观察，当前工作中提出的 HOS 驱动语音增强技术对白噪声、粉红噪声、咿呀学语噪声和海怪噪声都很有效。PESQ 和 STOI 这两个客观指标表明，在评估所考虑的所有噪声条件下，所提出的方法都能很好地发挥作用。

{"title":"Higher order statistics-driven magnitude and phase spectrum estimation for speech enhancement","authors":"T. Lavanya , P. Vijayalakshmi , K. Mrinalini , T. Nagarajan","doi":"10.1016/j.csl.2024.101639","DOIUrl":"https://doi.org/10.1016/j.csl.2024.101639","url":null,"abstract":"<div><p>Higher order statistics (HOS), can be effectively employed for noise suppression, provided the noise follows a Gaussian distribution. Since most of the noises are distributed normally, HOS can be effectively used for speech enhancement in noisy environments. In the current work, HOS-based parametric modelling for magnitude spectrum estimation is proposed to improve the SNR under noisy conditions. To establish this, a non-Gaussian reduced ARMA model formulated using third order cumulant sequences (Giannakis, 1990) is used. Here, the AR and MA model orders, <span><math><mi>p</mi></math></span> and <span><math><mi>q</mi></math></span>, are dynamically estimated by the well-established periodicity estimation technique under noisy conditions namely the Ramanujan Filter Bank (RFB) approach. The AR coefficients estimated from the reduced ARMA model are used to obtain the partially enhanced speech output, whose magnitude spectrum is then subjected to second-level enhancement using log MMSE with modified speech presence uncertainty (SPU) estimation technique. The refined magnitude spectrum, is combined with the phase spectrum extracted using proposed bicoherence-based phase compensation (BPC) technique, to estimate the enhanced speech output. The HOS-driven speech enhancement technique proposed in the current work is observed to be efficient for white, pink, babble and buccaneer noises. The objective measures, PESQ and STOI, indicate that the proposed method works well under all the noise conditions considered for evaluation.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"87 ","pages":"Article 101639"},"PeriodicalIF":4.3,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140162524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multipath-guided heterogeneous graph neural networks for sequential recommendation 用于顺序推荐的多路径引导异构图神经网络

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-03-16 DOI: 10.1016/j.csl.2024.101642

Fulian Yin , Tongtong Xing , Meiqi Ji , Zebin Yao , Ruiling Fu , Yuewei Wu

With the explosion of information and users’ changing interest, program sequential recommendation becomes increasingly important for TV program platforms to help their users find interesting programs. Existing sequential recommendation methods mainly focus on modeling user preferences from users’ historical interaction behaviors directly, with insufficient learning about the dynamics of programs and users, while ignoring the rich semantic information from the heterogeneous graph. To address these issues, we propose the multipath-guided heterogeneous graph neural networks for TV program sequential recommendation (MHG-PSR), which can enhance the representations of programs and users through multiple paths in heterogeneous graphs. In our method, the auxiliary information is fused to supplement the semantics of program and user to obtain initial representations. Then, we explore the interactive behaviors of programs and users with temporal and auxiliary information to model the collaborative signals in the heterogeneous graph and extract the users’ dynamic preferences of programs. Extensive experiments on real-world datasets verify the proposed method can effectively improve the performance of TV program sequential recommendation.

随着信息爆炸和用户兴趣的变化，节目顺序推荐对于电视节目平台帮助用户找到感兴趣的节目变得越来越重要。现有的顺序推荐方法主要是直接从用户的历史交互行为中建立用户偏好模型，对节目和用户的动态学习不足，同时忽略了异构图中丰富的语义信息。针对这些问题，我们提出了用于电视节目顺序推荐的多路径引导异构图神经网络（MHG-PSR），它可以通过异构图中的多路径增强节目和用户的表征。在我们的方法中，通过融合辅助信息来补充节目和用户的语义，从而获得初始表征。然后，我们利用时间信息和辅助信息探索程序和用户的交互行为，从而为异构图中的协作信号建模，并提取用户对程序的动态偏好。在实际数据集上的大量实验验证了所提出的方法能有效提高电视节目顺序推荐的性能。

{"title":"Multipath-guided heterogeneous graph neural networks for sequential recommendation","authors":"Fulian Yin , Tongtong Xing , Meiqi Ji , Zebin Yao , Ruiling Fu , Yuewei Wu","doi":"10.1016/j.csl.2024.101642","DOIUrl":"10.1016/j.csl.2024.101642","url":null,"abstract":"<div><p>With the explosion of information and users’ changing interest, program sequential recommendation becomes increasingly important for TV program platforms to help their users find interesting programs. Existing sequential recommendation methods mainly focus on modeling user preferences from users’ historical interaction behaviors directly, with insufficient learning about the dynamics of programs and users, while ignoring the rich semantic information from the heterogeneous graph. To address these issues, we propose the multipath-guided heterogeneous graph neural networks for TV program sequential recommendation (MHG-PSR), which can enhance the representations of programs and users through multiple paths in heterogeneous graphs. In our method, the auxiliary information is fused to supplement the semantics of program and user to obtain initial representations. Then, we explore the interactive behaviors of programs and users with temporal and auxiliary information to model the collaborative signals in the heterogeneous graph and extract the users’ dynamic preferences of programs. Extensive experiments on real-world datasets verify the proposed method can effectively improve the performance of TV program sequential recommendation.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"87 ","pages":"Article 101642"},"PeriodicalIF":4.3,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140181846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Incorporating external knowledge for text matching model 将外部知识纳入文本匹配模型

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-03-12 DOI: 10.1016/j.csl.2024.101638

Kexin Jiang , Guozhe Jin , Zhenguo Zhang , Rongyi Cui , Yahui Zhao

Text matching is a computational task that involves comparing and establishing the semantic relationship between two textual inputs. The prevailing approach in text matching entails the computation of textual representations or employing attention mechanisms to facilitate interaction with the text. These techniques have demonstrated notable efficacy in various text-matching scenarios. However, these methods primarily focus on modeling the sentence pairs themselves and rarely incorporate additional information to enrich the models. In this study, we address the challenge of text matching in natural language processing by proposing a novel approach that leverages external knowledge sources, namely Wiktionary for word definitions and a knowledge graph for text triplet information. Unlike conventional methods that primarily rely on textual representations and attention mechanisms, our approach enhances semantic understanding by integrating relevant external information. We introduce a fusion module to amalgamate the semantic insights derived from the text and the external knowledge. Our methodology’s efficacy is evidenced through comprehensive experiments conducted on diverse datasets, encompassing natural language inference, text classification, and medical natural language inference. The results unequivocally indicate a significant enhancement in model performance, underscoring the effectiveness of incorporating external knowledge into text-matching tasks.

文本匹配是一项涉及比较和建立两个文本输入之间语义关系的计算任务。文本匹配的主流方法是计算文本表征或采用注意力机制来促进与文本的交互。这些技术在各种文本匹配场景中都取得了显著的效果。然而，这些方法主要侧重于对句子本身进行建模，很少结合其他信息来丰富模型。在本研究中，我们针对自然语言处理中文本匹配所面临的挑战，提出了一种利用外部知识源（即维基词典中的词义和知识图谱中的文本三连音信息）的新方法。与主要依赖文本表征和注意力机制的传统方法不同，我们的方法通过整合相关外部信息来增强语义理解。我们引入了一个融合模块，以整合从文本和外部知识中获得的语义见解。我们在各种数据集（包括自然语言推理、文本分类和医学自然语言推理）上进行的综合实验证明了我们方法的有效性。实验结果清楚地表明，模型的性能得到了显著提高，这突出了将外部知识融入文本匹配任务的有效性。

{"title":"Incorporating external knowledge for text matching model","authors":"Kexin Jiang , Guozhe Jin , Zhenguo Zhang , Rongyi Cui , Yahui Zhao","doi":"10.1016/j.csl.2024.101638","DOIUrl":"10.1016/j.csl.2024.101638","url":null,"abstract":"<div><p>Text matching is a computational task that involves comparing and establishing the semantic relationship between two textual inputs. The prevailing approach in text matching entails the computation of textual representations or employing attention mechanisms to facilitate interaction with the text. These techniques have demonstrated notable efficacy in various text-matching scenarios. However, these methods primarily focus on modeling the sentence pairs themselves and rarely incorporate additional information to enrich the models. In this study, we address the challenge of text matching in natural language processing by proposing a novel approach that leverages external knowledge sources, namely Wiktionary for word definitions and a knowledge graph for text triplet information. Unlike conventional methods that primarily rely on textual representations and attention mechanisms, our approach enhances semantic understanding by integrating relevant external information. We introduce a fusion module to amalgamate the semantic insights derived from the text and the external knowledge. Our methodology’s efficacy is evidenced through comprehensive experiments conducted on diverse datasets, encompassing natural language inference, text classification, and medical natural language inference. The results unequivocally indicate a significant enhancement in model performance, underscoring the effectiveness of incorporating external knowledge into text-matching tasks.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"87 ","pages":"Article 101638"},"PeriodicalIF":4.3,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140124573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

KddRES: A Multi-level Knowledge-driven Dialogue Dataset for Restaurant Towards Customized Dialogue System KddRES：面向餐厅的多层次知识驱动型对话数据集--迈向定制化对话系统

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-03-07 DOI: 10.1016/j.csl.2024.101637

Hongru Wang , Wai-Chung Kwan , Min Li , Zimo Zhou , Kam-Fai Wong

To alleviate the shortage of dialogue datasets for Cantonese, one of the low-resource languages, and facilitate the development of customized task-oriented dialogue systems, we propose KddRES, the first Cantonese Knowledge-driven dialogue dataset for REStaurants. It contains 834 multi-turn dialogues, 8000 utterances, and 26 distinct slots. The slots are hierarchical, and beneath the 26 coarse-grained slots are the additional 16 fine-grained slots. Annotations of dialogue states and dialogue actions at both the user and system sides are provided to suit multiple downstream tasks such as natural language understanding and dialogue state tracking. To effectively detect hierarchical slots, we propose a framework HierBERT by modelling label semantics and relationships between different slots. Experimental results demonstrate that KddRES is more challenging compared with existing datasets due to the introduction of hierarchical slots and our framework is particularly effective in detecting secondary slots and achieving a new state-of-the-art. Given the rich annotation and hierarchical slot structure of KddRES, we hope it will promote research on the development of customized dialogue systems in Cantonese and other conversational AI tasks, such as dialogue state tracking and policy learning.

粤语是低资源语言之一，为了缓解粤语对话数据集的不足，促进面向任务的定制化对话系统的开发，我们提出了 KddRES--首个面向 REStaurants 的粤语知识驱动对话数据集。该数据集包含 834 个多轮对话、8000 个语句和 26 个不同的插槽。对话槽是分层的，在 26 个粗粒度对话槽之下还有 16 个细粒度对话槽。用户和系统双方都提供了对话状态和对话动作的注释，以适应多种下游任务，如自然语言理解和对话状态跟踪。为了有效地检测分层插槽，我们提出了一个框架 HierBERT，通过模拟标签语义和不同插槽之间的关系来实现。实验结果表明，由于引入了分层插槽，KddRES 与现有数据集相比更具挑战性。鉴于 KddRES 丰富的注释和分层槽结构，我们希望它能促进定制化粤语对话系统的开发研究，以及对话状态跟踪和策略学习等其他会话人工智能任务的研究。

{"title":"KddRES: A Multi-level Knowledge-driven Dialogue Dataset for Restaurant Towards Customized Dialogue System","authors":"Hongru Wang , Wai-Chung Kwan , Min Li , Zimo Zhou , Kam-Fai Wong","doi":"10.1016/j.csl.2024.101637","DOIUrl":"https://doi.org/10.1016/j.csl.2024.101637","url":null,"abstract":"<div><p>To alleviate the shortage of dialogue datasets for Cantonese, one of the low-resource languages, and facilitate the development of customized task-oriented dialogue systems, we propose <strong>KddRES</strong>, the first Cantonese <strong>K</strong>nowledge-driven <strong>d</strong>ialogue <strong>d</strong>ataset for <strong>RES</strong>taurants. It contains 834 multi-turn dialogues, 8000 utterances, and 26 distinct slots. The slots are hierarchical, and beneath the 26 coarse-grained slots are the additional 16 fine-grained slots. Annotations of dialogue states and dialogue actions at both the user and system sides are provided to suit multiple downstream tasks such as natural language understanding and dialogue state tracking. To effectively detect hierarchical slots, we propose a framework HierBERT by modelling label semantics and relationships between different slots. Experimental results demonstrate that KddRES is more challenging compared with existing datasets due to the introduction of hierarchical slots and our framework is particularly effective in detecting secondary slots and achieving a new state-of-the-art. Given the rich annotation and hierarchical slot structure of KddRES, we hope it will promote research on the development of customized dialogue systems in Cantonese and other conversational AI tasks, such as dialogue state tracking and policy learning.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"87 ","pages":"Article 101637"},"PeriodicalIF":4.3,"publicationDate":"2024-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140095934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Model discrepancy policy optimization for task-oriented dialogue 面向任务对话的模型差异策略优化

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-03-06 DOI: 10.1016/j.csl.2024.101636

Zhenyou Zhou, Zhibin Liu, Zhaoan Dong, Yuhan Liu

Task-oriented dialogue systems use deep reinforcement learning (DRL) to learn policies, and agent interaction with user models can help the agent enhance its generalization capacity. But user models frequently lack the language complexity of human interlocutors and contain generative errors, and their design biases can impair the agent’s ability to function well in certain situations. In this paper, we incorporate an evaluator based on inverse reinforcement learning into the model to determine the quality of the dialogue of user models in order to recruit high-quality user models for training. We can successfully regulate the quality of training trajectories while maintaining their diversity by constructing a sampling environment distribution to pick high-quality user models to participate in policy learning. The evaluation on the Multiwoz dataset demonstrates that it is capable of successfully improving the dialogue agents’ performance.

以任务为导向的对话系统使用深度强化学习（DRL）来学习策略，而代理与用户模型的交互可以帮助代理增强其泛化能力。但是，用户模型往往缺乏人类对话者的语言复杂性，而且包含生成错误，其设计偏差会损害代理在某些情况下的良好运作能力。在本文中，我们在模型中加入了基于反强化学习的评估器，以确定用户模型的对话质量，从而招募高质量的用户模型进行训练。我们可以通过构建采样环境分布来挑选高质量的用户模型参与策略学习，从而成功地调节训练轨迹的质量，同时保持其多样性。在 Multiwoz 数据集上进行的评估表明，它能够成功提高对话代理的性能。

引用次数: 0

Next word prediction for Urdu language using deep learning models 使用深度学习模型预测乌尔都语的下一个单词

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-03-02 DOI: 10.1016/j.csl.2024.101635

Ramish Shahid, Aamir Wali, Maryam Bashir

Deep learning models are being used for natural language processing. Despite their success, these models have been employed for only a few languages. Pretrained models also exist but they are mostly available for the English language. Low resource languages like Urdu are not able to benefit from these pre-trained deep learning models and their effectiveness in Urdu language processing remains a question. This paper investigates the usefulness of deep learning models for the next word prediction and suggestion model for Urdu. For this purpose, this study considers and proposes two word prediction models for Urdu. Firstly, we propose to use LSTM for neural language modeling of Urdu. LSTMs are a popular approach for language modeling due to their ability to process sequential data. Secondly, we employ BERT which was specifically designed for natural language modeling. We train BERT from scratch using an Urdu corpus consisting of 1.1 million sentences thus paving the way for further studies in the Urdu language. We achieved an accuracy of 52.4% with LSTM and 73.7% with BERT. Our proposed BERT model outperformed two other pre-trained BERT models developed for Urdu. Since this is a multi-class problem and the number of classes is equal to the vocabulary size, this accuracy is still promising. Based on the present performance, BERT seems to be effective for the Urdu language, and this paper lays the groundwork for future studies.

深度学习模型正被用于自然语言处理。尽管取得了成功，但这些模型仅用于少数语言。预训练模型也存在，但它们大多适用于英语。像乌尔都语这样的低资源语言无法从这些预训练的深度学习模型中获益，它们在乌尔都语处理中的有效性仍然是个问题。本文研究了深度学习模型在乌尔都语下一个词预测和建议模型中的实用性。为此，本研究考虑并提出了两种乌尔都语单词预测模型。首先，我们建议使用 LSTM 对乌尔都语进行神经语言建模。LSTM 具有处理连续数据的能力，因此是一种流行的语言建模方法。其次，我们采用了专为自然语言建模而设计的 BERT。我们使用由 110 万个句子组成的乌尔都语语料库从头开始训练 BERT，从而为乌尔都语的进一步研究铺平了道路。LSTM 的准确率为 52.4%，BERT 的准确率为 73.7%。我们提出的 BERT 模型优于为乌尔都语开发的其他两个预训练 BERT 模型。由于这是一个多类问题，而类的数量与词汇量相等，因此这一准确率仍然很有希望。根据目前的表现，BERT 似乎对乌尔都语很有效，本文为今后的研究奠定了基础。

{"title":"Next word prediction for Urdu language using deep learning models","authors":"Ramish Shahid, Aamir Wali, Maryam Bashir","doi":"10.1016/j.csl.2024.101635","DOIUrl":"https://doi.org/10.1016/j.csl.2024.101635","url":null,"abstract":"<div><p>Deep learning models are being used for natural language processing. Despite their success, these models have been employed for only a few languages. Pretrained models also exist but they are mostly available for the English language. Low resource languages like Urdu are not able to benefit from these pre-trained deep learning models and their effectiveness in Urdu language processing remains a question. This paper investigates the usefulness of deep learning models for the next word prediction and suggestion model for Urdu. For this purpose, this study considers and proposes two word prediction models for Urdu. Firstly, we propose to use LSTM for neural language modeling of Urdu. LSTMs are a popular approach for language modeling due to their ability to process sequential data. Secondly, we employ BERT which was specifically designed for natural language modeling. We train BERT from scratch using an Urdu corpus consisting of 1.1 million sentences thus paving the way for further studies in the Urdu language. We achieved an accuracy of 52.4% with LSTM and 73.7% with BERT. Our proposed BERT model outperformed two other pre-trained BERT models developed for Urdu. Since this is a multi-class problem and the number of classes is equal to the vocabulary size, this accuracy is still promising. Based on the present performance, BERT seems to be effective for the Urdu language, and this paper lays the groundwork for future studies.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"87 ","pages":"Article 101635"},"PeriodicalIF":4.3,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140016379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MECOS: A bilingual Manipuri–English spontaneous code-switching speech corpus for automatic speech recognition MECOS：用于自动语音识别的曼尼普尔语-英语双语自发代码转换语音库

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-02-20 DOI: 10.1016/j.csl.2024.101627

Naorem Karline Singh, Yambem Jina Chanu, Hoomexsun Pangsatabam

In this study, we introduce a new code-switched speech database with 57h of Manipuri–English annotated spontaneous speech. Manipuri is an official language of India and is primarily spoken in the north–eastern Indian state of Manipur. Most Manipur native speakers today are bilingual and frequently use code switching in everyday discussions. By carefully assessing the amount of code-switched speech in each video, recordings from YouTube are gathered. 21,339 utterances and 291,731 instances of code switching are present in the database. Given the code-switching nature of the data, a proper annotation procedure is used, and the data are manually annotated using the Meitei Mayek unicode font and the roman alphabets for Manipuri and English, respectively. The transcription includes the information of the speakers, non-speech information, and the corresponding annotation. The aim of this research is to construct an automatic speech recognition (ASR) system as well as offer a thorough analysis and details of the speech corpus. We believe that our research is the first to use an ASR system for Manipuri–English code-switched speech. To evaluate the performance, ASR systems based on hybrid deep neural network and hidden Markov model (DNN–HMM), time delay neural network (TDNN), hybrid time delay neural network and long short-term memory (TDNN–LSTM) and three end-to-end (E2E) models i.e. hybrid connectionist temporal classification and attention model (CTC-Attention), Conformer, wav2vec XLSR are developed for Manipuri–English language. In comparison to other models, pure TDNN produces outcomes that are clearly superior.

在本研究中，我们引入了一个新的代码切换语音数据库，其中包含 57h 曼尼普尔语-英语注释自发语音。曼尼普尔语是印度的官方语言，主要在印度东北部的曼尼普尔邦使用。如今，大多数曼尼普尔本地人都会说两种语言，并且在日常讨论中经常使用语码转换。通过仔细评估每段视频中的代码转换语音量，我们收集了 YouTube 上的录音。数据库中共有 21,339 个语句和 291,731 个语码转换实例。考虑到数据的代码切换性质，我们使用了适当的注释程序，并分别使用曼尼普尔语和英语的迈特马耶克统一编码字体和罗马字母对数据进行了人工注释。转录包括说话人信息、非语音信息和相应的注释。本研究的目的是构建一个自动语音识别（ASR）系统，并对语音语料库进行全面分析和详细说明。我们相信，我们的研究是首个将 ASR 系统用于曼尼普尔语-英语代码转换语音的研究。为了评估性能，我们为曼尼普尔语-英语开发了基于混合深度神经网络和隐马尔可夫模型（DNN-HMM）、时延神经网络（TDNN）、混合时延神经网络和长短期记忆（TDNN-LSTM）以及三种端到端（E2E）模型（即混合连接主义时间分类和注意力模型（CTC-Attention）、Conformer、wav2vec XLSR）的 ASR 系统。与其他模型相比，纯 TDNN 产生的结果明显更优。

{"title":"MECOS: A bilingual Manipuri–English spontaneous code-switching speech corpus for automatic speech recognition","authors":"Naorem Karline Singh, Yambem Jina Chanu, Hoomexsun Pangsatabam","doi":"10.1016/j.csl.2024.101627","DOIUrl":"10.1016/j.csl.2024.101627","url":null,"abstract":"<div><p>In this study, we introduce a new code-switched speech database with 57h of Manipuri–English annotated spontaneous speech. Manipuri is an official language of India and is primarily spoken in the north–eastern Indian state of Manipur. Most Manipur native speakers today are bilingual and frequently use code switching in everyday discussions. By carefully assessing the amount of code-switched speech in each video, recordings from YouTube are gathered. 21,339 utterances and 291,731 instances of code switching are present in the database. Given the code-switching nature of the data, a proper annotation procedure is used, and the data are manually annotated using the Meitei Mayek unicode font and the roman alphabets for Manipuri and English, respectively. The transcription includes the information of the speakers, non-speech information, and the corresponding annotation. The aim of this research is to construct an automatic speech recognition (ASR) system as well as offer a thorough analysis and details of the speech corpus. We believe that our research is the first to use an ASR system for Manipuri–English code-switched speech. To evaluate the performance, ASR systems based on hybrid deep neural network and hidden Markov model (DNN–HMM), time delay neural network (TDNN), hybrid time delay neural network and long short-term memory (TDNN–LSTM) and three end-to-end (E2E) models i.e. hybrid connectionist temporal classification and attention model (CTC-Attention), Conformer, wav2vec XLSR are developed for Manipuri–English language. In comparison to other models, pure TDNN produces outcomes that are clearly superior.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"87 ","pages":"Article 101627"},"PeriodicalIF":4.3,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139925331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Translating scientific abstracts in the bio-medical domain with structure-aware models 翻译和预测医学领域科学摘要的文档结构

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-02-09 DOI: 10.1016/j.csl.2024.101623

Sadaf Abdul Rauf , François Yvon

Machine Translation (MT) technologies have improved in many ways and generate usable outputs for a growing number of domains and language pairs. Yet, most sentence based MT systems struggle with contextual dependencies, processing small chunks of texts, typically sentences, in isolation from their textual context. This is likely to cause systematic errors or inconsistencies when processing long documents. While various attempts are made to handle extended contexts in translation, the relevance of these contextual cues, especially those related to the structural organization, and the extent to which they affect translation quality remains an under explored area. In this work, we explore ways to take these structural aspects into account, by integrating document structure as an extra conditioning context. Our experiments on biomedical abstracts, which are usually structured in a rigid way, suggest that this type of structural information can be useful for MT and document structure prediction. We also present in detail the impact of structural information on MT output and assess the degree to which structural information can be learned from the data.

机器翻译（MT）技术在很多方面都有所改进，并为越来越多的领域和语言对生成可用的输出结果。然而，大多数基于句子的 MT 系统在处理上下文依存关系时都很吃力，它们在处理小块文本（通常是句子）时脱离了文本上下文。在处理长篇文档时，这很可能会导致系统错误或不一致。虽然在翻译中处理扩展上下文的尝试层出不穷，但这些上下文线索（尤其是与结构组织相关的线索）的相关性及其对翻译质量的影响程度仍是一个尚未充分探索的领域。在这项工作中，我们通过将文档结构整合为额外的条件语境，探索将这些结构方面考虑在内的方法。我们在生物医学摘要上进行的实验表明，这类结构信息对于 MT 和文档结构预测非常有用。我们还详细介绍了结构信息对 MT 输出的影响，并评估了从数据中学习结构信息的程度。

{"title":"Translating scientific abstracts in the bio-medical domain with structure-aware models","authors":"Sadaf Abdul Rauf , François Yvon","doi":"10.1016/j.csl.2024.101623","DOIUrl":"10.1016/j.csl.2024.101623","url":null,"abstract":"<div><p>Machine Translation (MT) technologies have improved in many ways and generate usable outputs for a growing number of domains and language pairs. Yet, most sentence based MT systems struggle with contextual dependencies, processing small chunks of texts, typically sentences, in isolation from their textual context. This is likely to cause systematic errors or inconsistencies when processing long documents. While various attempts are made to handle extended contexts in translation, the relevance of these contextual cues, especially those related to the structural organization, and the extent to which they affect translation quality remains an under explored area. In this work, we explore ways to take these structural aspects into account, by integrating document structure as an extra conditioning context. Our experiments on biomedical abstracts, which are usually structured in a rigid way, suggest that this type of structural information can be useful for MT and document structure prediction. We also present in detail the impact of structural information on MT output and assess the degree to which structural information can be learned from the data.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"87 ","pages":"Article 101623"},"PeriodicalIF":4.3,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139884004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0