Pub Date : 2024-04-06DOI: 10.1016/j.csl.2024.101647
Qian Wang , Weiqi Zhang , Tianyi Lei , Yu Cao , Dezhong Peng , Xu Wang
Sentence embedding, which aims to learn an effective representation of a sentence, is a significant part for downstream tasks. Recently, using contrastive learning and pre-trained model, most methods of sentence embedding achieve encouraging results. However, on the one hand, these methods utilize discrete data augmentation to obtain positive samples performing contrastive learning, which could distort the original semantic of sentences. On the other hand, most methods directly employ the contrastive frameworks of computer vision to perform contrastive learning, which could confine the contrastive training due to the discrete and sparse text data compared with image data. To solve the issues above, we design a novel contrastive framework based on generation model with multi-task learning by supervised contrastive training on the dataset of natural language inference (NLI) to obtain meaningful sentence embedding (SEBGM). SEBGM makes use of multi-task learning to enhance the usage of word-level and sentence-level semantic information of samples. In this way, the positive samples of SEBGM are from NLI rather than data augmentation. Extensive experiments show that our proposed SEBGM can advance the state-of-the-art sentence embedding on the semantic textual similarity (STS) tasks by utilizing multi-task learning.
{"title":"SEBGM: Sentence Embedding Based on Generation Model with multi-task learning","authors":"Qian Wang , Weiqi Zhang , Tianyi Lei , Yu Cao , Dezhong Peng , Xu Wang","doi":"10.1016/j.csl.2024.101647","DOIUrl":"https://doi.org/10.1016/j.csl.2024.101647","url":null,"abstract":"<div><p>Sentence embedding, which aims to learn an effective representation of a sentence, is a significant part for downstream tasks. Recently, using contrastive learning and pre-trained model, most methods of sentence embedding achieve encouraging results. However, on the one hand, these methods utilize discrete data augmentation to obtain positive samples performing contrastive learning, which could distort the original semantic of sentences. On the other hand, most methods directly employ the contrastive frameworks of computer vision to perform contrastive learning, which could confine the contrastive training due to the discrete and sparse text data compared with image data. To solve the issues above, we design a novel contrastive framework based on generation model with multi-task learning by supervised contrastive training on the dataset of natural language inference (NLI) to obtain meaningful sentence embedding (SEBGM). SEBGM makes use of multi-task learning to enhance the usage of word-level and sentence-level semantic information of samples. In this way, the positive samples of SEBGM are from NLI rather than data augmentation. Extensive experiments show that our proposed SEBGM can advance the state-of-the-art sentence embedding on the semantic textual similarity (STS) tasks by utilizing multi-task learning.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140535342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-04DOI: 10.1016/j.csl.2024.101646
Ting Hu, Christoph Meinel, Haojin Yang
Fine-tuning and inference on Large Language Models like BERT have become increasingly expensive regarding memory cost and computation resources. The recently proposed computation-flexible BERT models facilitate their deployment in varied computational environments. Training such flexible BERT models involves jointly optimizing multiple BERT subnets, which will unavoidably interfere with one another. Besides, the performance of large subnets is limited by the performance gap between the smallest subnet and the supernet, despite efforts to enhance the smaller subnets. In this regard, we propose layer-wise Neural grafting to boost BERT subnets, especially the larger ones. The proposed method improves the average performance of the subnets on six GLUE tasks and boosts the supernets on all GLUE tasks and the SQuAD data set. Based on the boosted subnets, we further build an inference framework enabling practical width- and depth-dynamic inference regarding different inputs by combining width-dynamic gating modules and early exit off-ramps in the depth dimension. Experimental results show that the proposed framework achieves a better dynamic inference range than other methods in terms of trading off performance and computational complexity on four GLUE tasks and SQuAD. In particular, our best-tradeoff inference result outperforms other fixed-size models with similar amount of computations. Compared to BERT-Base, the proposed inference framework yields a 1.3-point improvement in the average GLUE score and a 2.2-point increase in the F1 score on SQuAD, while reducing computations by around 45%.
{"title":"A flexible BERT model enabling width- and depth-dynamic inference","authors":"Ting Hu, Christoph Meinel, Haojin Yang","doi":"10.1016/j.csl.2024.101646","DOIUrl":"https://doi.org/10.1016/j.csl.2024.101646","url":null,"abstract":"<div><p>Fine-tuning and inference on Large Language Models like BERT have become increasingly expensive regarding memory cost and computation resources. The recently proposed computation-flexible BERT models facilitate their deployment in varied computational environments. Training such flexible BERT models involves jointly optimizing multiple BERT subnets, which will unavoidably interfere with one another. Besides, the performance of large subnets is limited by the performance gap between the smallest subnet and the supernet, despite efforts to enhance the smaller subnets. In this regard, we propose layer-wise Neural grafting to boost BERT subnets, especially the larger ones. The proposed method improves the average performance of the subnets on six GLUE tasks and boosts the supernets on all GLUE tasks and the SQuAD data set. Based on the boosted subnets, we further build an inference framework enabling practical width- and depth-dynamic inference regarding different inputs by combining width-dynamic gating modules and early exit off-ramps in the depth dimension. Experimental results show that the proposed framework achieves a better dynamic inference range than other methods in terms of trading off performance and computational complexity on four GLUE tasks and SQuAD. In particular, our best-tradeoff inference result outperforms other fixed-size models with similar amount of computations. Compared to BERT-Base, the proposed inference framework yields a 1.3-point improvement in the average GLUE score and a 2.2-point increase in the F1 score on SQuAD, while reducing computations by around 45%.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000299/pdfft?md5=bb08debe9a20bb5be9d04abbaca1b345&pid=1-s2.0-S0885230824000299-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140546290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Currently, the majority of retrieval-based question-answering systems depend on supervised training using question pairs. However, there is still a significant need for further exploration of how to employ unsupervised methods to improve the accuracy of retrieval-based question-answering systems. From the perspective of question topic keywords, this paper presents TFCSG, an unsupervised question-retrieval approach based on topic keyword filtering and multi-task learning. Firstly, we design the topic keyword filtering algorithm, which, unlike the topic model, can sequentially filter out the keywords of the question and can provide a training corpus for subsequent unsupervised learning. Then, three tasks are designed in this paper to complete the training of the question-retrieval model. The first task is a question contrastive learning task based on topic keywords repetition strategy, the second is questions and its corresponding sequential topic keywords similarity distribution task, and the third is a sequential topic keywords generation task using questions. These three tasks are trained in parallel in order to obtain quality question representations and thus improve the accuracy of question-retrieval task. Finally, our experimental results on the four publicly available datasets demonstrate the effectiveness of the TFCSG, with an average improvement of 7.1%, 4.4%, and 3.5% in the P@1, MAP, and MRR metrics when using the BERT model compared to the baseline model. The corresponding metrics improved by 5.7%, 3.5% and 3.0% on average when using the RoBERTa model. The accuracy of unsupervised similar question-retrieval task is effectively improved. In particular, the values of P@1, P@5, and P@10 are close, the retrieved similar questions are ranked more advance.
{"title":"Unsupervised question-retrieval approach based on topic keywords filtering and multi-task learning","authors":"Aiguo Shang , Xinjuan Zhu , Michael Danner , Matthias Rätsch","doi":"10.1016/j.csl.2024.101644","DOIUrl":"10.1016/j.csl.2024.101644","url":null,"abstract":"<div><p>Currently, the majority of retrieval-based question-answering systems depend on supervised training using question pairs. However, there is still a significant need for further exploration of how to employ unsupervised methods to improve the accuracy of retrieval-based question-answering systems. From the perspective of question topic keywords, this paper presents TFCSG, an unsupervised question-retrieval approach based on topic keyword filtering and multi-task learning. Firstly, we design the topic keyword filtering algorithm, which, unlike the topic model, can sequentially filter out the keywords of the question and can provide a training corpus for subsequent unsupervised learning. Then, three tasks are designed in this paper to complete the training of the question-retrieval model. The first task is a question contrastive learning task based on topic keywords repetition strategy, the second is questions and its corresponding sequential topic keywords similarity distribution task, and the third is a sequential topic keywords generation task using questions. These three tasks are trained in parallel in order to obtain quality question representations and thus improve the accuracy of question-retrieval task. Finally, our experimental results on the four publicly available datasets demonstrate the effectiveness of the TFCSG, with an average improvement of 7.1%, 4.4%, and 3.5% in the P@1, MAP, and MRR metrics when using the BERT model compared to the baseline model. The corresponding metrics improved by 5.7%, 3.5% and 3.0% on average when using the RoBERTa model. The accuracy of unsupervised similar question-retrieval task is effectively improved. In particular, the values of P@1, P@5, and P@10 are close, the retrieved similar questions are ranked more advance.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140280805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-18DOI: 10.1016/j.csl.2024.101643
Zhengwei Zhai , Rongli Fan , Jie Huang , Neal Xiong , Lijuan Zhang , Jian Wan , Lei Zhang
Relational triple extraction is a critical step in knowledge graph construction. Compared to pipeline-based extraction, joint extraction is gaining more attention because it can better utilize entity and relation information without causing error propagation issues. Yet, the challenge with joint extraction lies in handling overlapping triples. Existing approaches adopt sequential steps or multiple modules, which often accumulate errors and interfere with redundant data. In this study, we propose an innovative joint extraction model with cross-attention mechanism and global pointers with context shield window. Specifically, our methodology begins by inputting text data into a pre-trained RoBERTa model to generate word vector representations. Subsequently, these embeddings are passed through a modified cross-attention layer along with entity type embeddings to address missing entity type information. Next, we employ the global pointer to transform the extraction problem into a quintuple extraction problem, which skillfully solves the issue of overlapping triples. It is worth mentioning that we design a context shield window on the global pointer, which facilitates the identification of correct entities within a limited range during the entity extraction process. Finally, the capability of our model against malicious samples is improved by adding adversarial training during the training process. Demonstrating superiority over mainstream models, our approach achieves impressive results on three publicly available datasets.
关系三重抽取是知识图谱构建的关键步骤。与基于流水线的提取相比,联合提取更受关注,因为它能更好地利用实体和关系信息,而不会造成错误传播问题。然而,联合提取的挑战在于如何处理重叠的三元组。现有方法采用顺序步骤或多个模块,往往会积累错误并干扰冗余数据。在本研究中,我们提出了一种创新的联合提取模型,该模型具有交叉关注机制和带上下文屏蔽窗口的全局指针。具体来说,我们的方法首先将文本数据输入预先训练好的 RoBERTa 模型,生成词向量表示。随后,这些嵌入将与实体类型嵌入一起通过修改后的交叉注意层,以解决实体类型信息缺失的问题。接下来,我们利用全局指针将提取问题转化为五元提取问题,巧妙地解决了三元重叠的问题。值得一提的是,我们在全局指针上设计了一个上下文屏蔽窗口,这有助于在实体提取过程中识别有限范围内的正确实体。最后,我们在训练过程中加入了对抗训练,从而提高了模型对抗恶意样本的能力。与主流模型相比,我们的方法在三个公开数据集上取得了令人印象深刻的结果。
{"title":"A novel joint extraction model based on cross-attention mechanism and global pointer using context shield window","authors":"Zhengwei Zhai , Rongli Fan , Jie Huang , Neal Xiong , Lijuan Zhang , Jian Wan , Lei Zhang","doi":"10.1016/j.csl.2024.101643","DOIUrl":"10.1016/j.csl.2024.101643","url":null,"abstract":"<div><p>Relational triple extraction is a critical step in knowledge graph construction. Compared to pipeline-based extraction, joint extraction is gaining more attention because it can better utilize entity and relation information without causing error propagation issues. Yet, the challenge with joint extraction lies in handling overlapping triples. Existing approaches adopt sequential steps or multiple modules, which often accumulate errors and interfere with redundant data. In this study, we propose an innovative joint extraction model with cross-attention mechanism and global pointers with context shield window. Specifically, our methodology begins by inputting text data into a pre-trained RoBERTa model to generate word vector representations. Subsequently, these embeddings are passed through a modified cross-attention layer along with entity type embeddings to address missing entity type information. Next, we employ the global pointer to transform the extraction problem into a quintuple extraction problem, which skillfully solves the issue of overlapping triples. It is worth mentioning that we design a context shield window on the global pointer, which facilitates the identification of correct entities within a limited range during the entity extraction process. Finally, the capability of our model against malicious samples is improved by adding adversarial training during the training process. Demonstrating superiority over mainstream models, our approach achieves impressive results on three publicly available datasets.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000263/pdfft?md5=6db6d29053e0503fc07e8e1ded002d0e&pid=1-s2.0-S0885230824000263-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140181814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Orthogonal linear mapping is a commonly used approach for generating cross-lingual embedding between two monolingual corpora that uses a word frequency-based seed dictionary alignment approach. While this approach is found to be effective for isomorphic language pairs, they do not perform well for distant language pairs with different sentence structures and morphological properties. For a distance language pair, the existing frequency-aligned orthogonal mapping methods suffer from two problems - (i)the frequency of source and target word are not comparable, and (ii)different word pairs in the seed dictionary may have different contribution. Motivated by the above two concerns, this paper proposes a novel centrality-aligned ridge regression-based orthogonal mapping. The proposed method uses centrality-based alignment for seed dictionary selection and ridge regression framework for incorporating influential weights of different word pairs in the seed dictionary. From various experimental observations over five language pairs (both isomorphic and distant languages), it is evident that the proposed method outperforms baseline methods in the Bilingual Dictionary Induction(BDI) task, Sentence Retrieval Task(SRT), and Machine Translation. Further, several analyses are also included to support the proposed method.
{"title":"Improving linear orthogonal mapping based cross-lingual representation using ridge regression and graph centrality","authors":"Deepen Naorem, Sanasam Ranbir Singh, Priyankoo Sarmah","doi":"10.1016/j.csl.2024.101640","DOIUrl":"10.1016/j.csl.2024.101640","url":null,"abstract":"<div><p>Orthogonal linear mapping is a commonly used approach for generating cross-lingual embedding between two monolingual corpora that uses a word frequency-based seed dictionary alignment approach. While this approach is found to be effective for isomorphic language pairs, they do not perform well for distant language pairs with different sentence structures and morphological properties. For a distance language pair, the existing frequency-aligned orthogonal mapping methods suffer from two problems - (i)the frequency of source and target word are not comparable, and (ii)different word pairs in the seed dictionary may have different contribution. Motivated by the above two concerns, this paper proposes a novel centrality-aligned ridge regression-based orthogonal mapping. The proposed method uses centrality-based alignment for seed dictionary selection and ridge regression framework for incorporating influential weights of different word pairs in the seed dictionary. From various experimental observations over five language pairs (both isomorphic and distant languages), it is evident that the proposed method outperforms baseline methods in the Bilingual Dictionary Induction(BDI) task, Sentence Retrieval Task(SRT), and Machine Translation. Further, several analyses are also included to support the proposed method.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140181780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-16DOI: 10.1016/j.csl.2024.101641
Fan Yang , Muqiao Yang , Xiang Li , Yuxuan Wu , Zhiyuan Zhao , Bhiksha Raj , Rita Singh
Reinforcement learning (RL) has demonstrated effectiveness in improving model performance and robustness for automatic speech recognition (ASR) tasks. Researchers have employed RL-based training strategies to enhance performance beyond conventional supervised or semi-supervised learning. However, existing approaches treat RL as a supplementary tool, leaving the untapped potential of RL training largely unexplored. In this paper, we formulate a novel pure RL setting where an ASR model is trained exclusively through RL via human feedback metrics, e.g., word error rate (WER) or binary reward. This approach promises to significantly simplify the annotation process if we could replace the conventional onerous annotation with a single numeric value in a human–computer interaction (HCI) way. Our experiments demonstrate the feasibility of this new setting and also identify two main inherent issues in conventional RL-based ASR training that may lead to performance degradation: (1) the mismatch issue between the action and reward has been commonly overlooked in Connectionist Temporal Classification (CTC) based models, which is attributed to the inherent CTC alignment mapping issue; (2) the classic exploration–exploitation trade-off still exists in the sampling stage of RL-based ASR, and finding the balance between them becomes a challenge. To address these issues, we first propose a new RL-based approach named CTC-aligned Policy Gradient (CTC-PG), which provides a unified formulation for different sampling strategies and alleviates the mismatch issue of action and reward in CTC-based models. Moreover, we propose Focal Sampling to balance the trade-off between exploration and exploitation with a flexible temperature parameter. Experiment results on LibriSpeech dataset showcase the effectiveness and robustness of our methods by harnessing the full potential of RL in training ASR models.
强化学习(RL)在提高自动语音识别(ASR)任务的模型性能和鲁棒性方面表现出了有效性。研究人员采用了基于 RL 的训练策略,以提高性能,超越传统的监督或半监督学习。然而,现有的方法将 RL 视为一种辅助工具,从而在很大程度上忽略了 RL 训练尚未开发的潜力。在本文中,我们提出了一种新颖的纯 RL 设置,即完全通过人的反馈指标(如单词错误率 (WER) 或二元奖励)进行 RL 训练 ASR 模型。如果我们能以人机交互(HCI)的方式用单一数值取代传统繁琐的注释,那么这种方法有望大大简化注释过程。我们的实验证明了这种新设置的可行性,同时也发现了传统基于 RL 的 ASR 训练中可能导致性能下降的两个主要固有问题:(1)在基于连接时态分类(CTC)的模型中,动作和奖励之间的不匹配问题通常被忽视,这归因于固有的 CTC 配对映射问题;(2)在基于 RL 的 ASR 的采样阶段,经典的探索-开发权衡仍然存在,如何在两者之间找到合适的平衡点成为了一个挑战。为了解决这些问题,我们首先提出了一种新的基于 RL 的方法,命名为 CTC 对齐策略梯度(CTC-PG),它为不同的采样策略提供了统一的表述,缓解了基于 CTC 模型中行动与回报不匹配的问题。此外,我们还提出了 "焦点采样"(Focal Sampling)方法,利用温度参数来平衡探索与开发之间的权衡。在 LibriSpeech 数据集上的实验结果表明,我们的方法在训练 ASR 模型时充分发挥了 RL 的潜力,从而展示了这些方法的有效性和鲁棒性。
{"title":"A closer look at reinforcement learning-based automatic speech recognition","authors":"Fan Yang , Muqiao Yang , Xiang Li , Yuxuan Wu , Zhiyuan Zhao , Bhiksha Raj , Rita Singh","doi":"10.1016/j.csl.2024.101641","DOIUrl":"10.1016/j.csl.2024.101641","url":null,"abstract":"<div><p>Reinforcement learning (RL) has demonstrated effectiveness in improving model performance and robustness for automatic speech recognition (ASR) tasks. Researchers have employed RL-based training strategies to enhance performance beyond conventional supervised or semi-supervised learning. However, existing approaches treat RL as a supplementary tool, leaving the untapped potential of RL training largely unexplored. In this paper, we formulate a novel pure RL setting where an ASR model is trained exclusively through RL via human feedback metrics, e.g., word error rate (WER) or binary reward. This approach promises to significantly simplify the annotation process if we could replace the conventional onerous annotation with a single numeric value in a human–computer interaction (HCI) way. Our experiments demonstrate the feasibility of this new setting and also identify two main inherent issues in conventional RL-based ASR training that may lead to performance degradation: (1) the mismatch issue between the action and reward has been commonly overlooked in Connectionist Temporal Classification (CTC) based models, which is attributed to the inherent CTC alignment mapping issue; (2) the classic exploration–exploitation trade-off still exists in the sampling stage of RL-based ASR, and finding the balance between them becomes a challenge. To address these issues, we first propose a new RL-based approach named CTC-aligned Policy Gradient (CTC-PG), which provides a unified formulation for different sampling strategies and alleviates the mismatch issue of action and reward in CTC-based models. Moreover, we propose Focal Sampling to balance the trade-off between exploration and exploitation with a flexible temperature parameter. Experiment results on LibriSpeech dataset showcase the effectiveness and robustness of our methods by harnessing the full potential of RL in training ASR models.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140181821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-16DOI: 10.1016/j.csl.2024.101639
T. Lavanya , P. Vijayalakshmi , K. Mrinalini , T. Nagarajan
Higher order statistics (HOS), can be effectively employed for noise suppression, provided the noise follows a Gaussian distribution. Since most of the noises are distributed normally, HOS can be effectively used for speech enhancement in noisy environments. In the current work, HOS-based parametric modelling for magnitude spectrum estimation is proposed to improve the SNR under noisy conditions. To establish this, a non-Gaussian reduced ARMA model formulated using third order cumulant sequences (Giannakis, 1990) is used. Here, the AR and MA model orders, and , are dynamically estimated by the well-established periodicity estimation technique under noisy conditions namely the Ramanujan Filter Bank (RFB) approach. The AR coefficients estimated from the reduced ARMA model are used to obtain the partially enhanced speech output, whose magnitude spectrum is then subjected to second-level enhancement using log MMSE with modified speech presence uncertainty (SPU) estimation technique. The refined magnitude spectrum, is combined with the phase spectrum extracted using proposed bicoherence-based phase compensation (BPC) technique, to estimate the enhanced speech output. The HOS-driven speech enhancement technique proposed in the current work is observed to be efficient for white, pink, babble and buccaneer noises. The objective measures, PESQ and STOI, indicate that the proposed method works well under all the noise conditions considered for evaluation.
高阶统计(HOS)可有效用于噪声抑制,前提是噪声服从高斯分布。由于大多数噪声都呈正态分布,因此高阶统计可有效用于噪声环境下的语音增强。在当前的工作中,我们提出了基于 HOS 的幅度谱估计参数建模,以提高噪声条件下的信噪比。为此,我们使用了一个使用三阶累积序列(Giannakis,1990 年)建立的非高斯还原 ARMA 模型。在这里,AR 和 MA 模型的阶数 p 和 q 是通过噪声条件下成熟的周期性估计技术(即 Ramanujan 滤波器库 (RFB) 方法)来动态估计的。根据简化的 ARMA 模型估算出的 AR 系数用于获得部分增强的语音输出,然后使用对数 MMSE 和改进的语音存在不确定性 (SPU) 估算技术对其幅度频谱进行二级增强。细化后的幅度频谱与使用基于双相干的相位补偿(BPC)技术提取的相位频谱相结合,估算出增强后的语音输出。据观察,当前工作中提出的 HOS 驱动语音增强技术对白噪声、粉红噪声、咿呀学语噪声和海怪噪声都很有效。PESQ 和 STOI 这两个客观指标表明,在评估所考虑的所有噪声条件下,所提出的方法都能很好地发挥作用。
{"title":"Higher order statistics-driven magnitude and phase spectrum estimation for speech enhancement","authors":"T. Lavanya , P. Vijayalakshmi , K. Mrinalini , T. Nagarajan","doi":"10.1016/j.csl.2024.101639","DOIUrl":"https://doi.org/10.1016/j.csl.2024.101639","url":null,"abstract":"<div><p>Higher order statistics (HOS), can be effectively employed for noise suppression, provided the noise follows a Gaussian distribution. Since most of the noises are distributed normally, HOS can be effectively used for speech enhancement in noisy environments. In the current work, HOS-based parametric modelling for magnitude spectrum estimation is proposed to improve the SNR under noisy conditions. To establish this, a non-Gaussian reduced ARMA model formulated using third order cumulant sequences (Giannakis, 1990) is used. Here, the AR and MA model orders, <span><math><mi>p</mi></math></span> and <span><math><mi>q</mi></math></span>, are dynamically estimated by the well-established periodicity estimation technique under noisy conditions namely the Ramanujan Filter Bank (RFB) approach. The AR coefficients estimated from the reduced ARMA model are used to obtain the partially enhanced speech output, whose magnitude spectrum is then subjected to second-level enhancement using log MMSE with modified speech presence uncertainty (SPU) estimation technique. The refined magnitude spectrum, is combined with the phase spectrum extracted using proposed bicoherence-based phase compensation (BPC) technique, to estimate the enhanced speech output. The HOS-driven speech enhancement technique proposed in the current work is observed to be efficient for white, pink, babble and buccaneer noises. The objective measures, PESQ and STOI, indicate that the proposed method works well under all the noise conditions considered for evaluation.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140162524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-16DOI: 10.1016/j.csl.2024.101642
Fulian Yin , Tongtong Xing , Meiqi Ji , Zebin Yao , Ruiling Fu , Yuewei Wu
With the explosion of information and users’ changing interest, program sequential recommendation becomes increasingly important for TV program platforms to help their users find interesting programs. Existing sequential recommendation methods mainly focus on modeling user preferences from users’ historical interaction behaviors directly, with insufficient learning about the dynamics of programs and users, while ignoring the rich semantic information from the heterogeneous graph. To address these issues, we propose the multipath-guided heterogeneous graph neural networks for TV program sequential recommendation (MHG-PSR), which can enhance the representations of programs and users through multiple paths in heterogeneous graphs. In our method, the auxiliary information is fused to supplement the semantics of program and user to obtain initial representations. Then, we explore the interactive behaviors of programs and users with temporal and auxiliary information to model the collaborative signals in the heterogeneous graph and extract the users’ dynamic preferences of programs. Extensive experiments on real-world datasets verify the proposed method can effectively improve the performance of TV program sequential recommendation.
{"title":"Multipath-guided heterogeneous graph neural networks for sequential recommendation","authors":"Fulian Yin , Tongtong Xing , Meiqi Ji , Zebin Yao , Ruiling Fu , Yuewei Wu","doi":"10.1016/j.csl.2024.101642","DOIUrl":"10.1016/j.csl.2024.101642","url":null,"abstract":"<div><p>With the explosion of information and users’ changing interest, program sequential recommendation becomes increasingly important for TV program platforms to help their users find interesting programs. Existing sequential recommendation methods mainly focus on modeling user preferences from users’ historical interaction behaviors directly, with insufficient learning about the dynamics of programs and users, while ignoring the rich semantic information from the heterogeneous graph. To address these issues, we propose the multipath-guided heterogeneous graph neural networks for TV program sequential recommendation (MHG-PSR), which can enhance the representations of programs and users through multiple paths in heterogeneous graphs. In our method, the auxiliary information is fused to supplement the semantics of program and user to obtain initial representations. Then, we explore the interactive behaviors of programs and users with temporal and auxiliary information to model the collaborative signals in the heterogeneous graph and extract the users’ dynamic preferences of programs. Extensive experiments on real-world datasets verify the proposed method can effectively improve the performance of TV program sequential recommendation.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140181846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-12DOI: 10.1016/j.csl.2024.101638
Kexin Jiang , Guozhe Jin , Zhenguo Zhang , Rongyi Cui , Yahui Zhao
Text matching is a computational task that involves comparing and establishing the semantic relationship between two textual inputs. The prevailing approach in text matching entails the computation of textual representations or employing attention mechanisms to facilitate interaction with the text. These techniques have demonstrated notable efficacy in various text-matching scenarios. However, these methods primarily focus on modeling the sentence pairs themselves and rarely incorporate additional information to enrich the models. In this study, we address the challenge of text matching in natural language processing by proposing a novel approach that leverages external knowledge sources, namely Wiktionary for word definitions and a knowledge graph for text triplet information. Unlike conventional methods that primarily rely on textual representations and attention mechanisms, our approach enhances semantic understanding by integrating relevant external information. We introduce a fusion module to amalgamate the semantic insights derived from the text and the external knowledge. Our methodology’s efficacy is evidenced through comprehensive experiments conducted on diverse datasets, encompassing natural language inference, text classification, and medical natural language inference. The results unequivocally indicate a significant enhancement in model performance, underscoring the effectiveness of incorporating external knowledge into text-matching tasks.
{"title":"Incorporating external knowledge for text matching model","authors":"Kexin Jiang , Guozhe Jin , Zhenguo Zhang , Rongyi Cui , Yahui Zhao","doi":"10.1016/j.csl.2024.101638","DOIUrl":"10.1016/j.csl.2024.101638","url":null,"abstract":"<div><p>Text matching is a computational task that involves comparing and establishing the semantic relationship between two textual inputs. The prevailing approach in text matching entails the computation of textual representations or employing attention mechanisms to facilitate interaction with the text. These techniques have demonstrated notable efficacy in various text-matching scenarios. However, these methods primarily focus on modeling the sentence pairs themselves and rarely incorporate additional information to enrich the models. In this study, we address the challenge of text matching in natural language processing by proposing a novel approach that leverages external knowledge sources, namely Wiktionary for word definitions and a knowledge graph for text triplet information. Unlike conventional methods that primarily rely on textual representations and attention mechanisms, our approach enhances semantic understanding by integrating relevant external information. We introduce a fusion module to amalgamate the semantic insights derived from the text and the external knowledge. Our methodology’s efficacy is evidenced through comprehensive experiments conducted on diverse datasets, encompassing natural language inference, text classification, and medical natural language inference. The results unequivocally indicate a significant enhancement in model performance, underscoring the effectiveness of incorporating external knowledge into text-matching tasks.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140124573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-07DOI: 10.1016/j.csl.2024.101637
Hongru Wang , Wai-Chung Kwan , Min Li , Zimo Zhou , Kam-Fai Wong
To alleviate the shortage of dialogue datasets for Cantonese, one of the low-resource languages, and facilitate the development of customized task-oriented dialogue systems, we propose KddRES, the first Cantonese Knowledge-driven dialogue dataset for REStaurants. It contains 834 multi-turn dialogues, 8000 utterances, and 26 distinct slots. The slots are hierarchical, and beneath the 26 coarse-grained slots are the additional 16 fine-grained slots. Annotations of dialogue states and dialogue actions at both the user and system sides are provided to suit multiple downstream tasks such as natural language understanding and dialogue state tracking. To effectively detect hierarchical slots, we propose a framework HierBERT by modelling label semantics and relationships between different slots. Experimental results demonstrate that KddRES is more challenging compared with existing datasets due to the introduction of hierarchical slots and our framework is particularly effective in detecting secondary slots and achieving a new state-of-the-art. Given the rich annotation and hierarchical slot structure of KddRES, we hope it will promote research on the development of customized dialogue systems in Cantonese and other conversational AI tasks, such as dialogue state tracking and policy learning.
{"title":"KddRES: A Multi-level Knowledge-driven Dialogue Dataset for Restaurant Towards Customized Dialogue System","authors":"Hongru Wang , Wai-Chung Kwan , Min Li , Zimo Zhou , Kam-Fai Wong","doi":"10.1016/j.csl.2024.101637","DOIUrl":"https://doi.org/10.1016/j.csl.2024.101637","url":null,"abstract":"<div><p>To alleviate the shortage of dialogue datasets for Cantonese, one of the low-resource languages, and facilitate the development of customized task-oriented dialogue systems, we propose <strong>KddRES</strong>, the first Cantonese <strong>K</strong>nowledge-driven <strong>d</strong>ialogue <strong>d</strong>ataset for <strong>RES</strong>taurants. It contains 834 multi-turn dialogues, 8000 utterances, and 26 distinct slots. The slots are hierarchical, and beneath the 26 coarse-grained slots are the additional 16 fine-grained slots. Annotations of dialogue states and dialogue actions at both the user and system sides are provided to suit multiple downstream tasks such as natural language understanding and dialogue state tracking. To effectively detect hierarchical slots, we propose a framework HierBERT by modelling label semantics and relationships between different slots. Experimental results demonstrate that KddRES is more challenging compared with existing datasets due to the introduction of hierarchical slots and our framework is particularly effective in detecting secondary slots and achieving a new state-of-the-art. Given the rich annotation and hierarchical slot structure of KddRES, we hope it will promote research on the development of customized dialogue systems in Cantonese and other conversational AI tasks, such as dialogue state tracking and policy learning.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140095934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}