Transactions of the Association for Computational Linguistics最新文献

IF 10.9 1区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Transactions of the Association for Computational Linguistics

Pub Date : 2023-12-01 DOI: 10.1162/tacl_a_00614

Jintao Wen, Geng Tu, Rui Li, Dazhi Jiang, Wenhua Zhu

Abstract One-hot labels are commonly employed as ground truth in Emotion Recognition in Conversations (ERC). However, this approach may not fully encompass all the emotions conveyed in a single utterance, leading to suboptimal performance. Regrettably, current ERC datasets lack comprehensive emotionally distributed labels. To address this issue, we propose the Emotion Label Refinement (EmoLR) method, which utilizes context- and speaker-sensitive information to infer mixed emotional labels. EmoLR comprises an Emotion Predictor (EP) module and a Label Refinement (LR) module. The EP module recognizes emotions and provides context/speaker states for the LR module. Subsequently, the LR module calculates the similarity between these states and ground-truth labels, generating a refined label distribution (RLD). The RLD captures a more comprehensive range of emotions than the original one-hot labels. These refined labels are then used for model training in place of the one-hot labels. Experimental results on three public conversational datasets demonstrate that our EmoLR achieves state-of-the-art performance.

摘要会话中的情感识别（ERC）通常使用单热标签作为基本事实。然而，这种方法可能无法完全涵盖单个语句中传达的所有情感，从而导致性能不佳。遗憾的是，目前的 ERC 数据集缺乏全面的情感分布标签。为了解决这个问题，我们提出了情感标签细化（EmoLR）方法，该方法利用上下文和说话者敏感信息来推断混合情感标签。EmoLR 由情感预测器（EP）模块和标签细化（LR）模块组成。EP 模块识别情绪，并为 LR 模块提供上下文/说话者状态。随后，LR 模块计算这些状态与地面实况标签之间的相似度，并生成精炼标签分布 (RLD)。与原始的单次标签相比，RLD 能捕捉到更全面的情感范围。然后，这些经过提炼的标签将被用于模型训练，以替代单次标签。在三个公共对话数据集上的实验结果表明，我们的 EmoLR 达到了最先进的性能。

{"title":"Learning More from Mixed Emotions: A Label Refinement Method for Emotion Recognition in Conversations","authors":"Jintao Wen, Geng Tu, Rui Li, Dazhi Jiang, Wenhua Zhu","doi":"10.1162/tacl_a_00614","DOIUrl":"https://doi.org/10.1162/tacl_a_00614","url":null,"abstract":"Abstract One-hot labels are commonly employed as ground truth in Emotion Recognition in Conversations (ERC). However, this approach may not fully encompass all the emotions conveyed in a single utterance, leading to suboptimal performance. Regrettably, current ERC datasets lack comprehensive emotionally distributed labels. To address this issue, we propose the Emotion Label Refinement (EmoLR) method, which utilizes context- and speaker-sensitive information to infer mixed emotional labels. EmoLR comprises an Emotion Predictor (EP) module and a Label Refinement (LR) module. The EP module recognizes emotions and provides context/speaker states for the LR module. Subsequently, the LR module calculates the similarity between these states and ground-truth labels, generating a refined label distribution (RLD). The RLD captures a more comprehensive range of emotions than the original one-hot labels. These refined labels are then used for model training in place of the one-hot labels. Experimental results on three public conversational datasets demonstrate that our EmoLR achieves state-of-the-art performance.","PeriodicalId":33559,"journal":{"name":"Transactions of the Association for Computational Linguistics","volume":"88 3","pages":"1485-1499"},"PeriodicalIF":10.9,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139015511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MissModal: Increasing Robustness to Missing Modality in Multimodal Sentiment Analysis MissModal：提高多模态情感分析中缺失模态的鲁棒性

IF 10.9 1区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Transactions of the Association for Computational Linguistics

Pub Date : 2023-12-01 DOI: 10.1162/tacl_a_00628

Ronghao Lin, Haifeng Hu

Abstract When applying multimodal machine learning in downstream inference, both joint and coordinated multimodal representations rely on the complete presence of modalities as in training. However, modal-incomplete data, where certain modalities are missing, greatly reduces performance in Multimodal Sentiment Analysis (MSA) due to varying input forms and semantic information deficiencies. This limits the applicability of the predominant MSA methods in the real world, where the completeness of multimodal data is uncertain and variable. The generation-based methods attempt to generate the missing modality, yet they require complex hierarchical architecture with huge computational costs and struggle with the representation gaps across different modalities. Diversely, we propose a novel representation learning approach named MissModal, devoting to increasing robustness to missing modality in a classification approach. Specifically, we adopt constraints with geometric contrastive loss, distribution distance loss, and sentiment semantic loss to align the representations of modal-missing and modal-complete data, without impacting the sentiment inference for the complete modalities. Furthermore, we do not demand any changes in the multimodal fusion stage, highlighting the generality of our method in other multimodal learning systems. Extensive experiments demonstrate that the proposed method achieves superior performance with minimal computational costs in various missing modalities scenarios (flexibility), including severely missing modality (efficiency) on two public MSA datasets.

摘要在下游推理中应用多模态机器学习时，联合多模态表征和协调多模态表征都依赖于训练中模态的完整存在。然而，模态不完整数据，即某些模态缺失的数据，由于输入形式的变化和语义信息的缺失，大大降低了多模态情感分析（MSA）的性能。这限制了主流 MSA 方法在现实世界中的适用性，因为在现实世界中，多模态数据的完整性是不确定和可变的。基于生成的方法试图生成缺失的模态，但它们需要复杂的分层架构，计算成本高昂，而且难以解决不同模态之间的表征差距。我们提出了一种名为 "MissModal "的新型表征学习方法，致力于提高分类方法对缺失模态的鲁棒性。具体来说，我们采用几何对比损失、分布距离损失和情感语义损失等约束条件来调整模态缺失数据和模态完整数据的表征，而不影响完整模态的情感推断。此外，我们不要求在多模态融合阶段做出任何改变，这突出了我们的方法在其他多模态学习系统中的通用性。广泛的实验证明，所提出的方法在各种模态缺失情况下（灵活性），包括在两个公共 MSA 数据集上的严重模态缺失情况下（效率），都能以最小的计算成本实现卓越的性能。

{"title":"MissModal: Increasing Robustness to Missing Modality in Multimodal Sentiment Analysis","authors":"Ronghao Lin, Haifeng Hu","doi":"10.1162/tacl_a_00628","DOIUrl":"https://doi.org/10.1162/tacl_a_00628","url":null,"abstract":"Abstract When applying multimodal machine learning in downstream inference, both joint and coordinated multimodal representations rely on the complete presence of modalities as in training. However, modal-incomplete data, where certain modalities are missing, greatly reduces performance in Multimodal Sentiment Analysis (MSA) due to varying input forms and semantic information deficiencies. This limits the applicability of the predominant MSA methods in the real world, where the completeness of multimodal data is uncertain and variable. The generation-based methods attempt to generate the missing modality, yet they require complex hierarchical architecture with huge computational costs and struggle with the representation gaps across different modalities. Diversely, we propose a novel representation learning approach named MissModal, devoting to increasing robustness to missing modality in a classification approach. Specifically, we adopt constraints with geometric contrastive loss, distribution distance loss, and sentiment semantic loss to align the representations of modal-missing and modal-complete data, without impacting the sentiment inference for the complete modalities. Furthermore, we do not demand any changes in the multimodal fusion stage, highlighting the generality of our method in other multimodal learning systems. Extensive experiments demonstrate that the proposed method achieves superior performance with minimal computational costs in various missing modalities scenarios (flexibility), including severely missing modality (efficiency) on two public MSA datasets.","PeriodicalId":33559,"journal":{"name":"Transactions of the Association for Computational Linguistics","volume":"81 1","pages":"1686-1702"},"PeriodicalIF":10.9,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138988172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

General then Personal: Decoupling and Pre-training for Personalized Headline Generation 先通用后个人：个性化标题生成的解耦与预训练

IF 10.9 1区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Transactions of the Association for Computational Linguistics

Pub Date : 2023-12-01 DOI: 10.1162/tacl_a_00621

Yun-Zhu Song, Yi-Syuan Chen, Lu Wang, Hong-Han Shuai

Abstract Personalized Headline Generation aims to generate unique headlines tailored to users’ browsing history. In this task, understanding user preferences from click history and incorporating them into headline generation pose challenges. Existing approaches typically rely on predefined styles as control codes, but personal style lacks explicit definition or enumeration, making it difficult to leverage traditional techniques. To tackle these challenges, we propose General Then Personal (GTP), a novel framework comprising user modeling, headline generation, and customization. We train the framework using tailored designs that emphasize two central ideas: (a) task decoupling and (b) model pre-training. With the decoupling mechanism separating the task into generation and customization, two mechanisms, i.e., information self-boosting and mask user modeling, are further introduced to facilitate the training and text control. Additionally, we introduce a new evaluation metric to address existing limitations. Extensive experiments conducted on the PENS dataset, considering both zero-shot and few-shot scenarios, demonstrate that GTP outperforms state-of-the-art methods. Furthermore, ablation studies and analysis emphasize the significance of decoupling and pre-training. Finally, the human evaluation validates the effectiveness of our approaches.1

摘要个性化标题生成的目的是根据用户的浏览历史生成独特的标题。在这项任务中，从点击历史记录中了解用户偏好并将其纳入标题生成是一项挑战。现有的方法通常依赖预定义的风格作为控制代码，但个人风格缺乏明确的定义或枚举，因此很难利用传统技术。为了应对这些挑战，我们提出了 "先通用后个人"（General Then Personal，简称 GTP）这一包含用户建模、标题生成和定制的新型框架。我们使用强调两个核心理念的定制设计来训练该框架：(a) 任务解耦和 (b) 模型预训练。通过解耦机制将任务分为生成和定制，进一步引入了两种机制，即信息自增强和掩码用户建模，以促进训练和文本控制。此外，我们还引入了新的评估指标，以解决现有的局限性。我们在 PENS 数据集上进行了广泛的实验，考虑了零镜头和少镜头两种情况，结果表明 GTP 优于最先进的方法。此外，消融研究和分析强调了解耦和预训练的重要性。最后，人工评估验证了我们方法的有效性1。

{"title":"General then Personal: Decoupling and Pre-training for Personalized Headline Generation","authors":"Yun-Zhu Song, Yi-Syuan Chen, Lu Wang, Hong-Han Shuai","doi":"10.1162/tacl_a_00621","DOIUrl":"https://doi.org/10.1162/tacl_a_00621","url":null,"abstract":"Abstract Personalized Headline Generation aims to generate unique headlines tailored to users’ browsing history. In this task, understanding user preferences from click history and incorporating them into headline generation pose challenges. Existing approaches typically rely on predefined styles as control codes, but personal style lacks explicit definition or enumeration, making it difficult to leverage traditional techniques. To tackle these challenges, we propose General Then Personal (GTP), a novel framework comprising user modeling, headline generation, and customization. We train the framework using tailored designs that emphasize two central ideas: (a) task decoupling and (b) model pre-training. With the decoupling mechanism separating the task into generation and customization, two mechanisms, i.e., information self-boosting and mask user modeling, are further introduced to facilitate the training and text control. Additionally, we introduce a new evaluation metric to address existing limitations. Extensive experiments conducted on the PENS dataset, considering both zero-shot and few-shot scenarios, demonstrate that GTP outperforms state-of-the-art methods. Furthermore, ablation studies and analysis emphasize the significance of decoupling and pre-training. Finally, the human evaluation validates the effectiveness of our approaches.1","PeriodicalId":33559,"journal":{"name":"Transactions of the Association for Computational Linguistics","volume":"449 ","pages":"1588-1607"},"PeriodicalIF":10.9,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138985900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Removing Backdoors in Pre-trained Models by Regularized Continual Pre-training 通过正则化持续预训练消除预训练模型中的后门

IF 10.9 1区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Transactions of the Association for Computational Linguistics

Pub Date : 2023-12-01 DOI: 10.1162/tacl_a_00622

Biru Zhu, Ganqu Cui, Yangyi Chen, Yujia Qin, Lifan Yuan, Chong Fu, Yangdong Deng, Zhiyuan Liu, Maosong Sun, Ming Gu

Abstract Recent research has revealed that pre-trained models (PTMs) are vulnerable to backdoor attacks before the fine-tuning stage. The attackers can implant transferable task-agnostic backdoors in PTMs, and control model outputs on any downstream task, which poses severe security threats to all downstream applications. Existing backdoor-removal defenses focus on task-specific classification models and they are not suitable for defending PTMs against task-agnostic backdoor attacks. To this end, we propose the first task-agnostic backdoor removal method for PTMs. Based on the selective activation phenomenon in backdoored PTMs, we design a simple and effective backdoor eraser, which continually pre-trains the backdoored PTMs with a regularization term in an end-to-end approach. The regularization term removes backdoor functionalities from PTMs while the continual pre-training maintains the normal functionalities of PTMs. We conduct extensive experiments on pre-trained models across different modalities and architectures. The experimental results show that our method can effectively remove backdoors inside PTMs and preserve benign functionalities of PTMs with a few downstream-task-irrelevant auxiliary data, e.g., unlabeled plain texts. The average attack success rate on three downstream datasets is reduced from 99.88% to 8.10% after our defense on the backdoored BERT. The codes are publicly available at https://github.com/thunlp/RECIPE.

摘要最近的研究发现，预训练模型（PTM）在微调阶段之前很容易受到后门攻击。攻击者可以在 PTM 中植入可转移的任务无关后门，并控制模型在任何下游任务中的输出，这对所有下游应用都构成了严重的安全威胁。现有的后门清除防御措施主要针对特定任务的分类模型，并不适用于防御 PTM 的任务无关后门攻击。为此，我们首次提出了针对 PTM 的任务无关后门清除方法。基于后门 PTM 中的选择性激活现象，我们设计了一种简单有效的后门清除器，它以端到端的方式，通过正则化项对后门 PTM 进行持续的预训练。正则化项可以清除 PTM 的后门功能，而持续的预训练则可以保持 PTM 的正常功能。我们对不同模式和架构的预训练模型进行了广泛的实验。实验结果表明，我们的方法可以有效清除 PTM 内部的后门，并保留 PTM 的良性功能，只需少量与下游任务无关的辅助数据，如未标记的纯文本。在我们对有后门的 BERT 进行防御后，三个下游数据集的平均攻击成功率从 99.88% 降至 8.10%。代码可在 https://github.com/thunlp/RECIPE 公开获取。

{"title":"Removing Backdoors in Pre-trained Models by Regularized Continual Pre-training","authors":"Biru Zhu, Ganqu Cui, Yangyi Chen, Yujia Qin, Lifan Yuan, Chong Fu, Yangdong Deng, Zhiyuan Liu, Maosong Sun, Ming Gu","doi":"10.1162/tacl_a_00622","DOIUrl":"https://doi.org/10.1162/tacl_a_00622","url":null,"abstract":"Abstract Recent research has revealed that pre-trained models (PTMs) are vulnerable to backdoor attacks before the fine-tuning stage. The attackers can implant transferable task-agnostic backdoors in PTMs, and control model outputs on any downstream task, which poses severe security threats to all downstream applications. Existing backdoor-removal defenses focus on task-specific classification models and they are not suitable for defending PTMs against task-agnostic backdoor attacks. To this end, we propose the first task-agnostic backdoor removal method for PTMs. Based on the selective activation phenomenon in backdoored PTMs, we design a simple and effective backdoor eraser, which continually pre-trains the backdoored PTMs with a regularization term in an end-to-end approach. The regularization term removes backdoor functionalities from PTMs while the continual pre-training maintains the normal functionalities of PTMs. We conduct extensive experiments on pre-trained models across different modalities and architectures. The experimental results show that our method can effectively remove backdoors inside PTMs and preserve benign functionalities of PTMs with a few downstream-task-irrelevant auxiliary data, e.g., unlabeled plain texts. The average attack success rate on three downstream datasets is reduced from 99.88% to 8.10% after our defense on the backdoored BERT. The codes are publicly available at https://github.com/thunlp/RECIPE.","PeriodicalId":33559,"journal":{"name":"Transactions of the Association for Computational Linguistics","volume":"184 ","pages":"1608-1623"},"PeriodicalIF":10.9,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139013302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

An Efficient Self-Supervised Cross-View Training For Sentence Embedding 用于句子嵌入的高效自监督交叉视图训练

IF 10.9 1区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Transactions of the Association for Computational Linguistics

Pub Date : 2023-11-06 DOI: 10.1162/tacl_a_00620

Peerat Limkonchotiwat, Wuttikorn Ponwitayarat, Lalita Lowphansirikul, Can Udomcharoenchaikit, E. Chuangsuwanich, Sarana Nutanong

Abstract Self-supervised sentence representation learning is the task of constructing an embedding space for sentences without relying on human annotation efforts. One straightforward approach is to finetune a pretrained language model (PLM) with a representation learning method such as contrastive learning. While this approach achieves impressive performance on larger PLMs, the performance rapidly degrades as the number of parameters decreases. In this paper, we propose a framework called Self-supervised Cross-View Training (SCT) to narrow the performance gap between large and small PLMs. To evaluate the effectiveness of SCT, we compare it to 5 baseline and state-of-the-art competitors on seven Semantic Textual Similarity (STS) benchmarks using 5 PLMs with the number of parameters ranging from 4M to 340M. The experimental results show that STC outperforms the competitors for PLMs with less than 100M parameters in 18 of 21 cases.1

摘要自我监督的句子表征学习是在不依赖人工标注的情况下构建句子嵌入空间的任务。一种直接的方法是使用对比学习等表征学习方法对预先训练好的语言模型（PLM）进行微调。虽然这种方法在较大的 PLM 上取得了令人印象深刻的性能，但随着参数数量的减少，性能会迅速下降。在本文中，我们提出了一个名为自监督交叉视图训练（SCT）的框架，以缩小大型和小型 PLM 之间的性能差距。为了评估 SCT 的有效性，我们在 7 个语义文本相似性（STS）基准上将其与 5 个基准和最先进的竞争对手进行了比较，使用的是 5 个 PLM，参数数量从 400 万到 340 万不等。实验结果表明，对于参数少于 1 亿的 PLM，STC 在 21 个案例中有 18 个案例优于竞争对手1。

引用次数: 0

U-CORE: A Unified Deep Cluster-wise Contrastive Framework for Open Relation Extraction U-CORE：用于开放关系提取的统一深度聚类对比框架

IF 10.9 1区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Transactions of the Association for Computational Linguistics

Pub Date : 2023-11-01 DOI: 10.1162/tacl_a_00604

Jie Zhou, Shenpo Dong, Yunxin Huang, Meihan Wu, Haili Li, Jingnan Wang, Hongkui Tu, Xiaodong Wang

Abstract Within Open Relation Extraction (ORE) tasks, the Zero-shot ORE method is to generalize undefined relations from predefined relations, while the Unsupervised ORE method is to extract undefined relations without the need for annotations. However, despite the possibility of overlap between predefined and undefined relations in the training data, a unified framework for both Zero-shot and Unsupervised ORE has yet to be established. To address this gap, we propose U-CORE: A Unified Deep Cluster-wise Contrastive Framework for both Zero-shot and Unsupervised ORE, by leveraging techniques from Contrastive Learning (CL) and Clustering.1 U-CORE overcomes the limitations of CL-based Zero-shot ORE methods by employing Cluster-wise CL that preserves both local smoothness as well as global semantics. Additionally, we employ a deep-cluster-based updater that optimizes the cluster center, thus enhancing the accuracy and efficiency of the model. To increase the stability of the model, we adopt Adaptive Self-paced Learning that effectively addresses the data-shifting problems. Experimental results on three well-known datasets demonstrate that U-CORE significantly improves upon existing methods by showing an average improvement of 7.35% ARI on Zero-shot ORE tasks and 15.24% ARI on Unsupervised ORE tasks.

摘要在开放关系抽取（ORE）任务中，零点抽取（Zero-shot ORE）方法是从预定义关系中归纳出未定义关系，而无监督抽取（Unsupervised ORE）方法则是在不需要注释的情况下抽取未定义关系。然而，尽管训练数据中的预定义关系和未定义关系之间可能存在重叠，但目前还没有为 "零点 "和 "无监督 "ORE 建立统一的框架。为了填补这一空白，我们提出了 U-CORE：U-CORE 克服了基于对比学习（Contrastive Learning，CL）的零拍摄 ORE 方法的局限性，采用了既能保持局部平滑性又能保持全局语义的集群对比学习（Cluster-wise CL）。此外，我们还采用了基于深度簇的更新器，优化了簇中心，从而提高了模型的准确性和效率。为了提高模型的稳定性，我们采用了自适应自步调学习（Adaptive Self-paced Learning）技术，有效地解决了数据转移问题。在三个知名数据集上的实验结果表明，U-CORE 显著提高了现有方法的性能，在 "Zero-shot ORE "任务中平均提高了 7.35% 的 ARI，在 "Unsupervised ORE "任务中平均提高了 15.24% 的 ARI。

{"title":"U-CORE: A Unified Deep Cluster-wise Contrastive Framework for Open Relation Extraction","authors":"Jie Zhou, Shenpo Dong, Yunxin Huang, Meihan Wu, Haili Li, Jingnan Wang, Hongkui Tu, Xiaodong Wang","doi":"10.1162/tacl_a_00604","DOIUrl":"https://doi.org/10.1162/tacl_a_00604","url":null,"abstract":"Abstract Within Open Relation Extraction (ORE) tasks, the Zero-shot ORE method is to generalize undefined relations from predefined relations, while the Unsupervised ORE method is to extract undefined relations without the need for annotations. However, despite the possibility of overlap between predefined and undefined relations in the training data, a unified framework for both Zero-shot and Unsupervised ORE has yet to be established. To address this gap, we propose U-CORE: A Unified Deep Cluster-wise Contrastive Framework for both Zero-shot and Unsupervised ORE, by leveraging techniques from Contrastive Learning (CL) and Clustering.1 U-CORE overcomes the limitations of CL-based Zero-shot ORE methods by employing Cluster-wise CL that preserves both local smoothness as well as global semantics. Additionally, we employ a deep-cluster-based updater that optimizes the cluster center, thus enhancing the accuracy and efficiency of the model. To increase the stability of the model, we adopt Adaptive Self-paced Learning that effectively addresses the data-shifting problems. Experimental results on three well-known datasets demonstrate that U-CORE significantly improves upon existing methods by showing an average improvement of 7.35% ARI on Zero-shot ORE tasks and 15.24% ARI on Unsupervised ORE tasks.","PeriodicalId":33559,"journal":{"name":"Transactions of the Association for Computational Linguistics","volume":"5 1","pages":"1301-1315"},"PeriodicalIF":10.9,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139297367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AfriSpeech-200: Pan-African Accented Speech Dataset for Clinical and General Domain ASR AfriSpeech-200：用于临床和通用领域 ASR 的泛非洲重音语音数据集

IF 10.9 1区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Transactions of the Association for Computational Linguistics

Pub Date : 2023-09-30 DOI: 10.1162/tacl_a_00627

Tobi Olatunji, Tejumade Afonja, Aditya Yadavalli, C. Emezue, Sahib Singh, Bonaventure F. P. Dossou, Joanne Osuchukwu, Salomey Osei, A. Tonja, Naome A. Etori, Clinton Mbataku

Abstract Africa has a very poor doctor-to-patient ratio. At very busy clinics, doctors could see 30+ patients per day—a heavy patient burden compared with developed countries—but productivity tools such as clinical automatic speech recognition (ASR) are lacking for these overworked clinicians. However, clinical ASR is mature, even ubiquitous, in developed nations, and clinician-reported performance of commercial clinical ASR systems is generally satisfactory. Furthermore, the recent performance of general domain ASR is approaching human accuracy. However, several gaps exist. Several publications have highlighted racial bias with speech-to-text algorithms and performance on minority accents lags significantly. To our knowledge, there is no publicly available research or benchmark on accented African clinical ASR, and speech data is non-existent for the majority of African accents. We release AfriSpeech, 200hrs of Pan-African English speech, 67,577 clips from 2,463 unique speakers across 120 indigenous accents from 13 countries for clinical and general domain ASR, a benchmark test set, with publicly available pre-trained models with SOTA performance on the AfriSpeech benchmark.

摘要非洲的医患比例非常低。在非常繁忙的诊所，医生每天要看 30 多位病人，与发达国家相比，病人负担沉重，但这些过度劳累的临床医生却缺乏临床自动语音识别 (ASR) 等提高工作效率的工具。然而，在发达国家，临床自动语音识别技术已经成熟，甚至无处不在，而且临床医生报告的商用临床自动语音识别系统的性能普遍令人满意。此外，通用领域 ASR 的最新性能也接近人类准确度。然而，仍存在一些差距。一些出版物强调了语音到文本算法的种族偏见，少数民族口音的性能明显落后。据我们所知，目前还没有关于非洲口音临床 ASR 的公开研究或基准，大多数非洲口音的语音数据也不存在。我们发布了 AfriSpeech、200 小时的泛非英语语音、67,577 个片段，这些片段来自 13 个国家的 2,463 位独特的演讲者，涉及 120 种本地口音，用于临床和通用领域的 ASR，这是一个基准测试集，并公开了在 AfriSpeech 基准上具有 SOTA 性能的预训练模型。

{"title":"AfriSpeech-200: Pan-African Accented Speech Dataset for Clinical and General Domain ASR","authors":"Tobi Olatunji, Tejumade Afonja, Aditya Yadavalli, C. Emezue, Sahib Singh, Bonaventure F. P. Dossou, Joanne Osuchukwu, Salomey Osei, A. Tonja, Naome A. Etori, Clinton Mbataku","doi":"10.1162/tacl_a_00627","DOIUrl":"https://doi.org/10.1162/tacl_a_00627","url":null,"abstract":"Abstract Africa has a very poor doctor-to-patient ratio. At very busy clinics, doctors could see 30+ patients per day—a heavy patient burden compared with developed countries—but productivity tools such as clinical automatic speech recognition (ASR) are lacking for these overworked clinicians. However, clinical ASR is mature, even ubiquitous, in developed nations, and clinician-reported performance of commercial clinical ASR systems is generally satisfactory. Furthermore, the recent performance of general domain ASR is approaching human accuracy. However, several gaps exist. Several publications have highlighted racial bias with speech-to-text algorithms and performance on minority accents lags significantly. To our knowledge, there is no publicly available research or benchmark on accented African clinical ASR, and speech data is non-existent for the majority of African accents. We release AfriSpeech, 200hrs of Pan-African English speech, 67,577 clips from 2,463 unique speakers across 120 indigenous accents from 13 countries for clinical and general domain ASR, a benchmark test set, with publicly available pre-trained models with SOTA performance on the AfriSpeech benchmark.","PeriodicalId":33559,"journal":{"name":"Transactions of the Association for Computational Linguistics","volume":"89 1","pages":"1669-1685"},"PeriodicalIF":10.9,"publicationDate":"2023-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139332019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages MIRACL:一个涵盖18种不同语言的多语言检索数据集

IF 10.9 1区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Transactions of the Association for Computational Linguistics

Pub Date : 2023-09-01 DOI: 10.1162/tacl_a_00595

Xinyu Crystina Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, Jimmy Lin

Abstract MIRACL is a multilingual dataset for ad hoc retrieval across 18 languages that collectively encompass over three billion native speakers around the world. This resource is designed to support monolingual retrieval tasks, where the queries and the corpora are in the same language. In total, we have gathered over 726k high-quality relevance judgments for 78k queries over Wikipedia in these languages, where all annotations have been performed by native speakers hired by our team. MIRACL covers languages that are both typologically close as well as distant from 10 language families and 13 sub-families, associated with varying amounts of publicly available resources. Extensive automatic heuristic verification and manual assessments were performed during the annotation process to control data quality. In total, MIRACL represents an investment of around five person-years of human annotator effort. Our goal is to spur research on improving retrieval across a continuum of languages, thus enhancing information access capabilities for diverse populations around the world, particularly those that have traditionally been underserved. MIRACL is available at http://miracl.ai/.

MIRACL是一个多语言数据集，用于跨18种语言的临时检索，这些语言共同涵盖了全球超过30亿的母语使用者。该资源旨在支持单语言检索任务，其中查询和语料库使用相同的语言。总的来说，我们在维基百科上收集了726k个高质量的相关判断，其中78k个查询是用这些语言进行的，所有的注释都是由我们团队雇佣的母语人士执行的。MIRACL涵盖了在类型学上接近或远离10个语系和13个亚语系的语言，这些语言与不同数量的公共可用资源有关。在标注过程中进行了广泛的自动启发式验证和手动评估，以控制数据质量。总的来说，MIRACL代表了大约五人年的人类注释工作的投资。我们的目标是促进对跨语言连续体检索的改进研究，从而提高世界各地不同人群的信息访问能力，特别是那些传统上服务不足的人群。MIRACL可在http://miracl.ai/上获得。

{"title":"MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages","authors":"Xinyu Crystina Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, Jimmy Lin","doi":"10.1162/tacl_a_00595","DOIUrl":"https://doi.org/10.1162/tacl_a_00595","url":null,"abstract":"Abstract MIRACL is a multilingual dataset for ad hoc retrieval across 18 languages that collectively encompass over three billion native speakers around the world. This resource is designed to support monolingual retrieval tasks, where the queries and the corpora are in the same language. In total, we have gathered over 726k high-quality relevance judgments for 78k queries over Wikipedia in these languages, where all annotations have been performed by native speakers hired by our team. MIRACL covers languages that are both typologically close as well as distant from 10 language families and 13 sub-families, associated with varying amounts of publicly available resources. Extensive automatic heuristic verification and manual assessments were performed during the annotation process to control data quality. In total, MIRACL represents an investment of around five person-years of human annotator effort. Our goal is to spur research on improving retrieval across a continuum of languages, thus enhancing information access capabilities for diverse populations around the world, particularly those that have traditionally been underserved. MIRACL is available at http://miracl.ai/.","PeriodicalId":33559,"journal":{"name":"Transactions of the Association for Computational Linguistics","volume":"11 1","pages":"1114-1131"},"PeriodicalIF":10.9,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64440768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Shared Lexical Items as Triggers of Code Switching 共享词条是代码转换的触发器

IF 10.9 1区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Transactions of the Association for Computational Linguistics

Pub Date : 2023-08-29 DOI: 10.1162/tacl_a_00613

S. Wintner, Safaa Shehadi, Yuli Zeira, Doreen Osmelak, Yuval Nov

Abstract Why do bilingual speakers code-switch (mix their two languages)? Among the several theories that attempt to explain this natural and ubiquitous phenomenon, the triggering hypothesis relates code-switching to the presence of lexical triggers, specifically cognates and proper names, adjacent to the switch point. We provide a fuller, more nuanced and refined exploration of the triggering hypothesis, based on five large datasets in three language pairs, reflecting both spoken and written bilingual interactions. Our results show that words that are assumed to reside in a mental lexicon shared by both languages indeed trigger code-switching, that the tendency to switch depends on the distance of the trigger from the switch point and on whether the trigger precedes or succeeds the switch, but not on the etymology of the trigger words. We thus provide strong, robust, evidence-based confirmation to several hypotheses on the relationships between lexical triggers and code-switching.

摘要为什么说二语的人会进行语码转换（混合使用两种语言）？在试图解释这一无处不在的自然现象的几种理论中，"触发假说"（triggering hypothesis）将代码转换与词汇触发器（特别是与转换点相邻的同义词和专有名词）的存在联系起来。我们基于三个语言对的五个大型数据集，对触发假说进行了更全面、更细致、更精炼的探讨，这些数据集反映了口语和书面双语互动。我们的研究结果表明，假设存在于两种语言共享的心理词典中的词确实会触发代码转换，转换倾向取决于触发词与转换点的距离，以及触发词是在转换之前还是之后，但与触发词的词源无关。因此，我们为关于词汇触发和代码转换之间关系的几个假设提供了强有力的、基于证据的证实。

引用次数: 0

Can Authorship Representation Learning Capture Stylistic Features? 作者表征学习能否捕捉文体特征？

IF 10.9 1区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Transactions of the Association for Computational Linguistics

Pub Date : 2023-08-22 DOI: 10.1162/tacl_a_00610

Andrew Wang, Cristina Aggazzotti, R. Kotula, Rafael A. Rivera Soto, M. Bishop, Nicholas Andrews

Abstract Automatically disentangling an author’s style from the content of their writing is a longstanding and possibly insurmountable problem in computational linguistics. At the same time, the availability of large text corpora furnished with author labels has recently enabled learning authorship representations in a purely data-driven manner for authorship attribution, a task that ostensibly depends to a greater extent on encoding writing style than encoding content. However, success on this surrogate task does not ensure that such representations capture writing style since authorship could also be correlated with other latent variables, such as topic. In an effort to better understand the nature of the information these representations convey, and specifically to validate the hypothesis that they chiefly encode writing style, we systematically probe these representations through a series of targeted experiments. The results of these experiments suggest that representations learned for the surrogate authorship prediction task are indeed sensitive to writing style. As a consequence, authorship representations may be expected to be robust to certain kinds of data shift, such as topic drift over time. Additionally, our findings may open the door to downstream applications that require stylistic representations, such as style transfer.

摘要在计算语言学中，自动将作者的风格与其写作内容区分开来是一个长期存在且可能无法解决的问题。与此同时，由于可以获得带有作者标签的大型文本语料库，因此最近能够以纯数据驱动的方式学习作者身份表征，用于作者身份归属，这项任务表面上看更依赖于对写作风格的编码，而不是对内容的编码。然而，这种代用任务的成功并不能确保此类表征能够捕捉到写作风格，因为作者身份还可能与其他潜在变量（如主题）相关。为了更好地理解这些表征所传达信息的性质，特别是验证它们主要编码写作风格的假设，我们通过一系列有针对性的实验对这些表征进行了系统的探究。这些实验的结果表明，在代理作者身份预测任务中学习到的表征确实对写作风格很敏感。因此，作者身份表征有望对某些类型的数据转移（如主题随时间的漂移）具有稳健性。此外，我们的发现可能会为需要风格表征的下游应用（如风格转移）打开大门。

{"title":"Can Authorship Representation Learning Capture Stylistic Features?","authors":"Andrew Wang, Cristina Aggazzotti, R. Kotula, Rafael A. Rivera Soto, M. Bishop, Nicholas Andrews","doi":"10.1162/tacl_a_00610","DOIUrl":"https://doi.org/10.1162/tacl_a_00610","url":null,"abstract":"Abstract Automatically disentangling an author’s style from the content of their writing is a longstanding and possibly insurmountable problem in computational linguistics. At the same time, the availability of large text corpora furnished with author labels has recently enabled learning authorship representations in a purely data-driven manner for authorship attribution, a task that ostensibly depends to a greater extent on encoding writing style than encoding content. However, success on this surrogate task does not ensure that such representations capture writing style since authorship could also be correlated with other latent variables, such as topic. In an effort to better understand the nature of the information these representations convey, and specifically to validate the hypothesis that they chiefly encode writing style, we systematically probe these representations through a series of targeted experiments. The results of these experiments suggest that representations learned for the surrogate authorship prediction task are indeed sensitive to writing style. As a consequence, authorship representations may be expected to be robust to certain kinds of data shift, such as topic drift over time. Additionally, our findings may open the door to downstream applications that require stylistic representations, such as style transfer.","PeriodicalId":33559,"journal":{"name":"Transactions of the Association for Computational Linguistics","volume":"21 1","pages":"1416-1431"},"PeriodicalIF":10.9,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139349572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0