Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing最新文献_第9页

Automatic Generation of Socratic Subquestions for Teaching Math Word Problems 数学应用题教学中苏格拉底式子题的自动生成

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing

Pub Date : 2022-11-23 DOI: 10.48550/arXiv.2211.12835

K. Shridhar, Jakub Macina, Mennatallah El-Assady, Tanmay Sinha, Manu Kapur, Mrinmaya Sachan

Socratic questioning is an educational method that allows students to discover answers to complex problems by asking them a series of thoughtful questions. Generation of didactically sound questions is challenging, requiring understanding of the reasoning process involved in the problem. We hypothesize that such questioning strategy can not only enhance the human performance, but also assist the math word problem (MWP) solvers.In this work, we explore the ability of large language models (LMs) in generating sequential questions for guiding math word problem-solving. We propose various guided question generation schemes based on input conditioning and reinforcement learning.On both automatic and human quality evaluations, we find that LMs constrained with desirable question properties generate superior questions and improve the overall performance of a math word problem solver. We conduct a preliminary user study to examine the potential value of such question generation models in the education domain. Results suggest that the difficulty level of problems plays an important role in determining whether questioning improves or hinders human performance. We discuss the future of using such questioning strategies in education.

苏格拉底式提问是一种教育方法，它允许学生通过提出一系列深思熟虑的问题来发现复杂问题的答案。生成具有教学意义的合理问题是具有挑战性的，需要理解问题中涉及的推理过程。我们假设这种提问策略不仅可以提高人类的表现，而且可以帮助数学单词问题(MWP)的解决者。在这项工作中，我们探索了大型语言模型(LMs)在生成顺序问题以指导数学单词解决问题方面的能力。我们提出了各种基于输入条件和强化学习的引导问题生成方案。在自动和人工质量评估中，我们发现具有理想问题属性的LMs生成了更好的问题，并提高了数学单词问题解决器的整体性能。我们进行了初步的用户研究，以检验这些问题生成模型在教育领域的潜在价值。结果表明，问题的难度水平在决定问题是提高还是阻碍人的表现方面起着重要作用。我们讨论了在教育中使用这种提问策略的未来。

{"title":"Automatic Generation of Socratic Subquestions for Teaching Math Word Problems","authors":"K. Shridhar, Jakub Macina, Mennatallah El-Assady, Tanmay Sinha, Manu Kapur, Mrinmaya Sachan","doi":"10.48550/arXiv.2211.12835","DOIUrl":"https://doi.org/10.48550/arXiv.2211.12835","url":null,"abstract":"Socratic questioning is an educational method that allows students to discover answers to complex problems by asking them a series of thoughtful questions. Generation of didactically sound questions is challenging, requiring understanding of the reasoning process involved in the problem. We hypothesize that such questioning strategy can not only enhance the human performance, but also assist the math word problem (MWP) solvers.In this work, we explore the ability of large language models (LMs) in generating sequential questions for guiding math word problem-solving. We propose various guided question generation schemes based on input conditioning and reinforcement learning.On both automatic and human quality evaluations, we find that LMs constrained with desirable question properties generate superior questions and improve the overall performance of a math word problem solver. We conduct a preliminary user study to examine the potential value of such question generation models in the education domain. Results suggest that the difficulty level of problems plays an important role in determining whether questioning improves or hinders human performance. We discuss the future of using such questioning strategies in education.","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"83 1","pages":"4136-4149"},"PeriodicalIF":0.0,"publicationDate":"2022-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89884272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Leveraging Data Recasting to Enhance Tabular Reasoning 利用数据重铸来增强表格推理

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing

Pub Date : 2022-11-23 DOI: 10.48550/arXiv.2211.12641

Aashna Jena, Vivek Gupta, Manish Shrivastava, Julian Martin Eisenschlos

Creating challenging tabular inference data is essential for learning complex reasoning. Prior work has mostly relied on two data generation strategies. The first is human annotation, which yields linguistically diverse data but is difficult to scale. The second category for creation is synthetic generation, which is scalable and cost effective but lacks inventiveness. In this research, we present a framework for semi-automatically recasting existing tabular data to make use of the benefits of both approaches. We utilize our framework to build tabular NLI instances from five datasets that were initially intended for tasks like table2text creation, tabular Q/A, and semantic parsing. We demonstrate that recasted data could be used as evaluation benchmarks as well as augmentation data to enhance performance on tabular NLI tasks. Furthermore, we investigate the effectiveness of models trained on recasted data in the zero-shot scenario, and analyse trends in performance across different recasted datasets types.

创建具有挑战性的表格推理数据对于学习复杂推理至关重要。先前的工作主要依赖于两种数据生成策略。第一种是人工注释，它产生语言多样的数据，但难以扩展。第二类创造是合成生成，它具有可扩展性和成本效益，但缺乏创造性。在这项研究中，我们提出了一个框架，用于半自动重铸现有的表格数据，以利用这两种方法的优点。我们利用框架从五个数据集构建表格NLI实例，这些数据集最初用于table2text创建、表格Q/A和语义解析等任务。我们证明，重铸数据可以用作评估基准，也可以用作增强数据，以提高表格NLI任务的性能。此外，我们还研究了零射击场景中基于重铸数据训练的模型的有效性，并分析了不同重铸数据集类型的性能趋势。

引用次数: 2

Mask the Correct Tokens: An Embarrassingly Simple Approach for Error Correction 屏蔽正确的令牌:一种令人尴尬的简单的纠错方法

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing

Pub Date : 2022-11-23 DOI: 10.48550/arXiv.2211.13252

Kai Shen, Yichong Leng, Xuejiao Tan, Si-Qi Tang, Yuan Zhang, Wenjie Liu, Ed Lin

Text error correction aims to correct the errors in text sequences such as those typed by humans or generated by speech recognition models.Previous error correction methods usually take the source (incorrect) sentence as encoder input and generate the target (correct) sentence through the decoder. Since the error rate of the incorrect sentence is usually low (e.g., 10%), the correction model can only learn to correct on limited error tokens but trivially copy on most tokens (correct tokens), which harms the effective training of error correction. In this paper, we argue that the correct tokens should be better utilized to facilitate effective training and then propose a simple yet effective masking strategy to achieve this goal.Specifically, we randomly mask out a part of the correct tokens in the source sentence and let the model learn to not only correct the original error tokens but also predict the masked tokens based on their context information. Our method enjoys several advantages: 1) it alleviates trivial copy; 2) it leverages effective training signals from correct tokens; 3) it is a plug-and-play module and can be applied to different models and tasks. Experiments on spelling error correction and speech recognition error correction on Mandarin datasets and grammar error correction on English datasets with both autoregressive and non-autoregressive generation models show that our method improves the correctionaccuracy consistently.

文本纠错的目的是纠正文本序列中的错误，如人类输入的错误或由语音识别模型产生的错误。以往的纠错方法通常将源(错误)句作为编码器输入，通过解码器生成目标(正确)句。由于错误句子的错误率通常很低(例如10%)，因此纠错模型只能在有限的错误标记上学习纠错，而在大多数标记(正确标记)上简单地复制，这损害了纠错的有效训练。在本文中，我们认为应该更好地利用正确的令牌来促进有效的训练，然后提出一个简单而有效的屏蔽策略来实现这一目标。具体来说，我们在源句子中随机屏蔽掉一部分正确的标记，让模型学习不仅纠正原始的错误标记，而且根据它们的上下文信息预测被屏蔽的标记。我们的方法有几个优点:1)它减少了琐碎的复制;2)利用正确token的有效训练信号;3)它是一个即插即用模块，可以应用于不同的模型和任务。使用自回归和非自回归生成模型对汉语数据集的拼写错误纠错和语音识别错误纠错以及英语数据集的语法错误纠错进行的实验表明，我们的方法一致地提高了纠错准确率。

{"title":"Mask the Correct Tokens: An Embarrassingly Simple Approach for Error Correction","authors":"Kai Shen, Yichong Leng, Xuejiao Tan, Si-Qi Tang, Yuan Zhang, Wenjie Liu, Ed Lin","doi":"10.48550/arXiv.2211.13252","DOIUrl":"https://doi.org/10.48550/arXiv.2211.13252","url":null,"abstract":"Text error correction aims to correct the errors in text sequences such as those typed by humans or generated by speech recognition models.Previous error correction methods usually take the source (incorrect) sentence as encoder input and generate the target (correct) sentence through the decoder. Since the error rate of the incorrect sentence is usually low (e.g., 10%), the correction model can only learn to correct on limited error tokens but trivially copy on most tokens (correct tokens), which harms the effective training of error correction. In this paper, we argue that the correct tokens should be better utilized to facilitate effective training and then propose a simple yet effective masking strategy to achieve this goal.Specifically, we randomly mask out a part of the correct tokens in the source sentence and let the model learn to not only correct the original error tokens but also predict the masked tokens based on their context information. Our method enjoys several advantages: 1) it alleviates trivial copy; 2) it leverages effective training signals from correct tokens; 3) it is a plug-and-play module and can be applied to different models and tasks. Experiments on spelling error correction and speech recognition error correction on Mandarin datasets and grammar error correction on English datasets with both autoregressive and non-autoregressive generation models show that our method improves the correctionaccuracy consistently.","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"17 1","pages":"10367-10380"},"PeriodicalIF":0.0,"publicationDate":"2022-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82015323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Enhancing Self-Consistency and Performance of Pre-Trained Language Models through Natural Language Inference 通过自然语言推理增强预训练语言模型的自一致性和性能

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing

Pub Date : 2022-11-21 DOI: 10.48550/arXiv.2211.11875

E. Mitchell, Joseph J. Noh, Siyan Li, William S. Armstrong, Ananth Agarwal, Patrick Liu, Chelsea Finn, Christopher D. Manning

While large pre-trained language models are powerful, their predictions often lack logical consistency across test inputs. For example, a state-of-the-art Macaw question-answering (QA) model answers Yes to Is a sparrow a bird? and Does a bird have feet? but answers No to Does a sparrow have feet?. To address this failure mode, we propose a framework, Consistency Correction through Relation Detection, or ConCoRD, for boosting the consistency and accuracy of pre-trained NLP models using pre-trained natural language inference (NLI) models without fine-tuning or re-training. Given a batch of test inputs, ConCoRD samples several candidate outputs for each input and instantiates a factor graph that accounts for both the model’s belief about the likelihood of each answer choice in isolation and the NLI model’s beliefs about pair-wise answer choice compatibility. We show that a weighted MaxSAT solver can efficiently compute high-quality answer choices under this factor graph, improving over the raw model’s predictions. Our experiments demonstrate that ConCoRD consistently boosts accuracy and consistency of off-the-shelf closed-book QA and VQA models using off-the-shelf NLI models, notably increasing accuracy of LXMERT on ConVQA by 5% absolute. See the project website (https://ericmitchell.ai/emnlp-2022-concord/) for code and data.

虽然大型预训练语言模型很强大，但它们的预测在测试输入之间往往缺乏逻辑一致性。例如，最先进的金刚鹦鹉问答(QA)模型回答“麻雀是鸟吗?”鸟有脚吗?麻雀有脚吗?为了解决这种失败模式，我们提出了一个框架，即通过关系检测一致性校正(ConCoRD)，用于使用预训练的自然语言推理(NLI)模型提高预训练的NLP模型的一致性和准确性，而无需微调或重新训练。给定一批测试输入，ConCoRD为每个输入采样几个候选输出，并实例化一个因子图，该因子图既说明了模型对孤立的每个答案选择的可能性的信念，也说明了NLI模型对成对的答案选择兼容性的信念。我们证明了一个加权的MaxSAT求解器可以在这个因素图下有效地计算出高质量的答案选择，比原始模型的预测有所改进。我们的实验表明，ConCoRD使用现成的NLI模型持续提高了现成的闭卷QA和VQA模型的准确性和一致性，特别是将LXMERT在ConVQA上的准确性绝对提高了5%。请参阅项目网站(https://ericmitchell.ai/emnlp-2022-concord/)获取代码和数据。

{"title":"Enhancing Self-Consistency and Performance of Pre-Trained Language Models through Natural Language Inference","authors":"E. Mitchell, Joseph J. Noh, Siyan Li, William S. Armstrong, Ananth Agarwal, Patrick Liu, Chelsea Finn, Christopher D. Manning","doi":"10.48550/arXiv.2211.11875","DOIUrl":"https://doi.org/10.48550/arXiv.2211.11875","url":null,"abstract":"While large pre-trained language models are powerful, their predictions often lack logical consistency across test inputs. For example, a state-of-the-art Macaw question-answering (QA) model answers Yes to Is a sparrow a bird? and Does a bird have feet? but answers No to Does a sparrow have feet?. To address this failure mode, we propose a framework, Consistency Correction through Relation Detection, or ConCoRD, for boosting the consistency and accuracy of pre-trained NLP models using pre-trained natural language inference (NLI) models without fine-tuning or re-training. Given a batch of test inputs, ConCoRD samples several candidate outputs for each input and instantiates a factor graph that accounts for both the model’s belief about the likelihood of each answer choice in isolation and the NLI model’s beliefs about pair-wise answer choice compatibility. We show that a weighted MaxSAT solver can efficiently compute high-quality answer choices under this factor graph, improving over the raw model’s predictions. Our experiments demonstrate that ConCoRD consistently boosts accuracy and consistency of off-the-shelf closed-book QA and VQA models using off-the-shelf NLI models, notably increasing accuracy of LXMERT on ConVQA by 5% absolute. See the project website (https://ericmitchell.ai/emnlp-2022-concord/) for code and data.","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"10 1","pages":"1754-1768"},"PeriodicalIF":0.0,"publicationDate":"2022-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79094445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition 面向统一的多模态情感分析和情感识别

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing

Pub Date : 2022-11-21 DOI: 10.48550/arXiv.2211.11256

Guimin Hu, Ting-En Lin, Yi Zhao, Guangming Lu, Yuchuan Wu, Yongbin Li

Multimodal sentiment analysis (MSA) and emotion recognition in conversation (ERC) are key research topics for computers to understand human behaviors. From a psychological perspective, emotions are the expression of affect or feelings during a short period, while sentiments are formed and held for a longer period. However, most existing works study sentiment and emotion separately and do not fully exploit the complementary knowledge behind the two. In this paper, we propose a multimodal sentiment knowledge-sharing framework (UniMSE) that unifies MSA and ERC tasks from features, labels, and models. We perform modality fusion at the syntactic and semantic levels and introduce contrastive learning between modalities and samples to better capture the difference and consistency between sentiments and emotions. Experiments on four public benchmark datasets, MOSI, MOSEI, MELD, and IEMOCAP, demonstrate the effectiveness of the proposed method and achieve consistent improvements compared with state-of-the-art methods.

多模态情感分析(MSA)和会话情感识别(ERC)是计算机理解人类行为的重要研究课题。从心理学的角度来看，情绪是短期内的情感或感觉的表达，而情感是长期形成和保持的。然而，大多数现有的作品将情感和情感分开研究，并没有充分挖掘两者背后的互补知识。在本文中，我们提出了一个多模态情感知识共享框架(UniMSE)，该框架将来自特征、标签和模型的MSA和ERC任务统一起来。我们在句法和语义层面进行情态融合，并引入情态和样本之间的对比学习，以更好地捕捉情感和情绪之间的差异和一致性。在MOSI、MOSEI、MELD和IEMOCAP四个公共基准数据集上的实验证明了该方法的有效性，并且与最先进的方法相比取得了一致的改进。

{"title":"UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition","authors":"Guimin Hu, Ting-En Lin, Yi Zhao, Guangming Lu, Yuchuan Wu, Yongbin Li","doi":"10.48550/arXiv.2211.11256","DOIUrl":"https://doi.org/10.48550/arXiv.2211.11256","url":null,"abstract":"Multimodal sentiment analysis (MSA) and emotion recognition in conversation (ERC) are key research topics for computers to understand human behaviors. From a psychological perspective, emotions are the expression of affect or feelings during a short period, while sentiments are formed and held for a longer period. However, most existing works study sentiment and emotion separately and do not fully exploit the complementary knowledge behind the two. In this paper, we propose a multimodal sentiment knowledge-sharing framework (UniMSE) that unifies MSA and ERC tasks from features, labels, and models. We perform modality fusion at the syntactic and semantic levels and introduce contrastive learning between modalities and samples to better capture the difference and consistency between sentiments and emotions. Experiments on four public benchmark datasets, MOSI, MOSEI, MELD, and IEMOCAP, demonstrate the effectiveness of the proposed method and achieve consistent improvements compared with state-of-the-art methods.","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"44 21","pages":"7837-7851"},"PeriodicalIF":0.0,"publicationDate":"2022-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91402216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Evaluating the Knowledge Dependency of Questions 评估问题的知识依赖性

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing

Pub Date : 2022-11-21 DOI: 10.48550/arXiv.2211.11902

Hyeongdon Moon, Yoonseok Yang, Jamin Shin, Hangyeol Yu, Seunghyun Lee, Myeongho Jeong, Juneyoung Park, Minsam Kim, Seungtaek Choi

The automatic generation of Multiple Choice Questions (MCQ) has the potential to reduce the time educators spend on student assessment significantly. However, existing evaluation metrics for MCQ generation, such as BLEU, ROUGE, and METEOR, focus on the n-gram based similarity of the generated MCQ to the gold sample in the dataset and disregard their educational value.They fail to evaluate the MCQ’s ability to assess the student’s knowledge of the corresponding target fact. To tackle this issue, we propose a novel automatic evaluation metric, coined Knowledge Dependent Answerability (KDA), which measures the MCQ’s answerability given knowledge of the target fact. Specifically, we first show how to measure KDA based on student responses from a human survey.Then, we propose two automatic evaluation metrics, KDA_disc and KDA_cont, that approximate KDA by leveraging pre-trained language models to imitate students’ problem-solving behavior.Through our human studies, we show that KDA_disc and KDA_soft have strong correlations with both (1) KDA and (2) usability in an actual classroom setting, labeled by experts. Furthermore, when combined with n-gram based similarity metrics, KDA_disc and KDA_cont are shown to have a strong predictive power for various expert-labeled MCQ quality measures.

多项选择题(MCQ)的自动生成有可能大大减少教育工作者花在学生评估上的时间。然而，现有的MCQ生成评估指标，如BLEU、ROUGE和METEOR，关注的是生成的MCQ与数据集中的黄金样本基于n图的相似性，而忽略了它们的教育价值。他们没有评估MCQ评估学生对相应目标事实的知识的能力。为了解决这个问题，我们提出了一种新的自动评估指标，即知识依赖的可答性(KDA)，它测量MCQ在给定目标事实知识的情况下的可答性。具体来说，我们首先展示了如何根据学生对人类调查的反应来测量KDA。然后，我们提出了两个自动评估指标，KDA_disc和KDA_cont，它们通过利用预训练的语言模型来模仿学生的问题解决行为来近似KDA。通过我们的人体研究，我们表明KDA_disc和KDA_soft在专家标记的实际课堂环境中与(1)KDA和(2)可用性具有很强的相关性。此外，当与基于n-gram的相似性度量相结合时，KDA_disc和KDA_cont对各种专家标记的MCQ质量度量具有很强的预测能力。

{"title":"Evaluating the Knowledge Dependency of Questions","authors":"Hyeongdon Moon, Yoonseok Yang, Jamin Shin, Hangyeol Yu, Seunghyun Lee, Myeongho Jeong, Juneyoung Park, Minsam Kim, Seungtaek Choi","doi":"10.48550/arXiv.2211.11902","DOIUrl":"https://doi.org/10.48550/arXiv.2211.11902","url":null,"abstract":"The automatic generation of Multiple Choice Questions (MCQ) has the potential to reduce the time educators spend on student assessment significantly. However, existing evaluation metrics for MCQ generation, such as BLEU, ROUGE, and METEOR, focus on the n-gram based similarity of the generated MCQ to the gold sample in the dataset and disregard their educational value.They fail to evaluate the MCQ’s ability to assess the student’s knowledge of the corresponding target fact. To tackle this issue, we propose a novel automatic evaluation metric, coined Knowledge Dependent Answerability (KDA), which measures the MCQ’s answerability given knowledge of the target fact. Specifically, we first show how to measure KDA based on student responses from a human survey.Then, we propose two automatic evaluation metrics, KDA_disc and KDA_cont, that approximate KDA by leveraging pre-trained language models to imitate students’ problem-solving behavior.Through our human studies, we show that KDA_disc and KDA_soft have strong correlations with both (1) KDA and (2) usability in an actual classroom setting, labeled by experts. Furthermore, when combined with n-gram based similarity metrics, KDA_disc and KDA_cont are shown to have a strong predictive power for various expert-labeled MCQ quality measures.","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"19 1","pages":"10512-10526"},"PeriodicalIF":0.0,"publicationDate":"2022-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82754187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

CGoDial: A Large-Scale Benchmark for Chinese Goal-oriented Dialog Evaluation CGoDial:中文目标导向对话评估的大规模基准

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing

Pub Date : 2022-11-21 DOI: 10.48550/arXiv.2211.11617

Yinpei Dai, Wanwei He, Bowen Li, Yuchuan Wu, Zhen Cao, Zhongqi An, Jian Sun, Yongbin Li

Practical dialog systems need to deal with various knowledge sources, noisy user expressions, and the shortage of annotated data. To better solve the above problems, we propose CGoDial, a new challenging and comprehensive Chinese benchmark for multi-domain Goal-oriented Dialog evaluation. It contains 96,763 dialog sessions, and 574,949 dialog turns totally, covering three datasets with different knowledge sources: 1) a slot-based dialog (SBD) dataset with table-formed knowledge, 2) a flow-based dialog (FBD) dataset with tree-formed knowledge, and a retrieval-based dialog (RBD) dataset with candidate-formed knowledge. To bridge the gap between academic benchmarks and spoken dialog scenarios, we either collect data from real conversations or add spoken features to existing datasets via crowd-sourcing. The proposed experimental settings include the combinations of training with either the entire training set or a few-shot training set, and testing with either the standard test set or a hard test subset, which can assess model capabilities in terms of general prediction, fast adaptability and reliable robustness.

实用的对话系统需要处理各种各样的知识来源、嘈杂的用户表达和缺乏注释的数据。为了更好地解决上述问题，我们提出了一种新的具有挑战性和综合性的中文多领域目标导向对话评估基准CGoDial。它包含96,763个对话会话，574,949个对话回合，涵盖了三个不同知识来源的数据集:1)具有表形式知识的基于槽的对话(SBD)数据集，2)具有树形式知识的基于流的对话(FBD)数据集，以及具有候选形式知识的基于检索的对话(RBD)数据集。为了弥合学术基准和口语对话场景之间的差距，我们要么从真实对话中收集数据，要么通过众包向现有数据集中添加口语特征。本文提出的实验设置包括使用整个训练集或少量训练集进行训练的组合，以及使用标准测试集或硬测试子集进行测试，可以从一般预测、快速自适应和可靠鲁棒性方面评估模型的能力。

{"title":"CGoDial: A Large-Scale Benchmark for Chinese Goal-oriented Dialog Evaluation","authors":"Yinpei Dai, Wanwei He, Bowen Li, Yuchuan Wu, Zhen Cao, Zhongqi An, Jian Sun, Yongbin Li","doi":"10.48550/arXiv.2211.11617","DOIUrl":"https://doi.org/10.48550/arXiv.2211.11617","url":null,"abstract":"Practical dialog systems need to deal with various knowledge sources, noisy user expressions, and the shortage of annotated data. To better solve the above problems, we propose CGoDial, a new challenging and comprehensive Chinese benchmark for multi-domain Goal-oriented Dialog evaluation. It contains 96,763 dialog sessions, and 574,949 dialog turns totally, covering three datasets with different knowledge sources: 1) a slot-based dialog (SBD) dataset with table-formed knowledge, 2) a flow-based dialog (FBD) dataset with tree-formed knowledge, and a retrieval-based dialog (RBD) dataset with candidate-formed knowledge. To bridge the gap between academic benchmarks and spoken dialog scenarios, we either collect data from real conversations or add spoken features to existing datasets via crowd-sourcing. The proposed experimental settings include the combinations of training with either the entire training set or a few-shot training set, and testing with either the standard test set or a hard test subset, which can assess model capabilities in terms of general prediction, fast adaptability and reliable robustness.","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"182 1","pages":"4097-4111"},"PeriodicalIF":0.0,"publicationDate":"2022-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82689420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Speaker Overlap-aware Neural Diarization for Multi-party Meeting Analysis 基于说话人重叠感知的多方会议分析神经分化

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing

Pub Date : 2022-11-18 DOI: 10.48550/arXiv.2211.10243

Zhihao Du, Shiliang Zhang, Siqi Zheng, Zhijie Yan

Recently, hybrid systems of clustering and neural diarization models have been successfully applied in multi-party meeting analysis. However, current models always treat overlapped speaker diarization as a multi-label classification problem, where speaker dependency and overlaps are not well considered. To overcome the disadvantages, we reformulate overlapped speaker diarization task as a single-label prediction problem via the proposed power set encoding (PSE). Through this formulation, speaker dependency and overlaps can be explicitly modeled. To fully leverage this formulation, we further propose the speaker overlap-aware neural diarization (SOND) model, which consists of a context-independent (CI) scorer to model global speaker discriminability, a context-dependent scorer (CD) to model local discriminability, and a speaker combining network (SCN) to combine and reassign speaker activities. Experimental results show that using the proposed formulation can outperform the state-of-the-art methods based on target speaker voice activity detection, and the performance can be further improved with SOND, resulting in a 6.30% relative diarization error reduction.

近年来，聚类和神经化模型的混合系统已成功地应用于多方会议分析。然而，目前的模型总是将重叠的说话人划分作为一个多标签分类问题，没有很好地考虑说话人的依赖性和重叠。为了克服这些缺点，我们通过提出的功率集编码(PSE)将重叠说话人拨号任务重新定义为单标签预测问题。通过这个公式，说话者依赖和重叠可以显式建模。为了充分利用这一公式，我们进一步提出了说话人重叠感知神经diarization (sod)模型，该模型包括一个上下文无关(CI)评分器来模拟全局说话人可判别性，一个上下文依赖评分器(CD)来模拟局部可判别性，以及一个说话人组合网络(SCN)来组合和重新分配说话人活动。实验结果表明，该方法优于现有的基于目标说话人语音活动检测的方法，并且可以进一步提高声学检测的性能，使相对拨号误差降低6.30%。

{"title":"Speaker Overlap-aware Neural Diarization for Multi-party Meeting Analysis","authors":"Zhihao Du, Shiliang Zhang, Siqi Zheng, Zhijie Yan","doi":"10.48550/arXiv.2211.10243","DOIUrl":"https://doi.org/10.48550/arXiv.2211.10243","url":null,"abstract":"Recently, hybrid systems of clustering and neural diarization models have been successfully applied in multi-party meeting analysis. However, current models always treat overlapped speaker diarization as a multi-label classification problem, where speaker dependency and overlaps are not well considered. To overcome the disadvantages, we reformulate overlapped speaker diarization task as a single-label prediction problem via the proposed power set encoding (PSE). Through this formulation, speaker dependency and overlaps can be explicitly modeled. To fully leverage this formulation, we further propose the speaker overlap-aware neural diarization (SOND) model, which consists of a context-independent (CI) scorer to model global speaker discriminability, a context-dependent scorer (CD) to model local discriminability, and a speaker combining network (SCN) to combine and reassign speaker activities. Experimental results show that using the proposed formulation can outperform the state-of-the-art methods based on target speaker voice activity detection, and the performance can be further improved with SOND, resulting in a 6.30% relative diarization error reduction.","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"112 1","pages":"7458-7469"},"PeriodicalIF":0.0,"publicationDate":"2022-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80173916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

A Dataset for Hyper-Relational Extraction and a Cube-Filling Approach 一种用于超关系提取的数据集和立方体填充方法

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing

Pub Date : 2022-11-18 DOI: 10.48550/arXiv.2211.10018

Yew Ken Chia, Lidong Bing, Sharifah Mahani Aljunied, Luo Si, Soujanya Poria

Relation extraction has the potential for large-scale knowledge graph construction, but current methods do not consider the qualifier attributes for each relation triplet, such as time, quantity or location. The qualifiers form hyper-relational facts which better capture the rich and complex knowledge graph structure. For example, the relation triplet (Leonard Parker, Educated At, Harvard University) can be factually enriched by including the qualifier (End Time, 1967). Hence, we propose the task of hyper-relational extraction to extract more specific and complete facts from text. To support the task, we construct HyperRED, a large-scale and general-purpose dataset. Existing models cannot perform hyper-relational extraction as it requires a model to consider the interaction between three entities. Hence, we propose CubeRE, a cube-filling model inspired by table-filling approaches and explicitly considers the interaction between relation triplets and qualifiers. To improve model scalability and reduce negative class imbalance, we further propose a cube-pruning method. Our experiments show that CubeRE outperforms strong baselines and reveal possible directions for future research. Our code and data are available at github.com/declare-lab/HyperRED.

关系提取具有构建大规模知识图谱的潜力，但目前的方法没有考虑每个关系三元组的限定词属性，如时间、数量或位置。限定词形成超关系事实，可以更好地捕获丰富而复杂的知识图谱结构。例如，关系三元组(Leonard Parker, educate At, Harvard University)可以通过包含限定词(End Time, 1967)来丰富事实。因此，我们提出了超关系提取的任务，以从文本中提取更具体和完整的事实。为了支持这项任务，我们构建了HyperRED，一个大规模的通用数据集。现有模型不能执行超关系提取，因为它需要一个模型来考虑三个实体之间的交互。因此，我们提出了CubeRE，这是一个受表填充方法启发的立方体填充模型，明确地考虑了关系三元组和限定符之间的相互作用。为了提高模型的可扩展性和减少负类不平衡，我们进一步提出了一种立方体剪枝方法。我们的实验表明，CubeRE优于强基线，并为未来的研究揭示了可能的方向。我们的代码和数据可在github.com/declare-lab/HyperRED上获得。

{"title":"A Dataset for Hyper-Relational Extraction and a Cube-Filling Approach","authors":"Yew Ken Chia, Lidong Bing, Sharifah Mahani Aljunied, Luo Si, Soujanya Poria","doi":"10.48550/arXiv.2211.10018","DOIUrl":"https://doi.org/10.48550/arXiv.2211.10018","url":null,"abstract":"Relation extraction has the potential for large-scale knowledge graph construction, but current methods do not consider the qualifier attributes for each relation triplet, such as time, quantity or location. The qualifiers form hyper-relational facts which better capture the rich and complex knowledge graph structure. For example, the relation triplet (Leonard Parker, Educated At, Harvard University) can be factually enriched by including the qualifier (End Time, 1967). Hence, we propose the task of hyper-relational extraction to extract more specific and complete facts from text. To support the task, we construct HyperRED, a large-scale and general-purpose dataset. Existing models cannot perform hyper-relational extraction as it requires a model to consider the interaction between three entities. Hence, we propose CubeRE, a cube-filling model inspired by table-filling approaches and explicitly considers the interaction between relation triplets and qualifiers. To improve model scalability and reduce negative class imbalance, we further propose a cube-pruning method. Our experiments show that CubeRE outperforms strong baselines and reveal possible directions for future research. Our code and data are available at github.com/declare-lab/HyperRED.","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"9 1","pages":"10114-10133"},"PeriodicalIF":0.0,"publicationDate":"2022-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81819812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Consultation Checklists: Standardising the Human Evaluation of Medical Note Generation 咨询检查表:标准化医疗记录生成的人类评估

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing

Pub Date : 2022-11-17 DOI: 10.48550/arXiv.2211.09455

Aleksandar Savkov, Francesco Moramarco, Alex Papadopoulos Korfiatis, Mark Perera, Anya Belz, E. Reiter

Evaluating automatically generated text is generally hard due to the inherently subjective nature of many aspects of the output quality. This difficulty is compounded in automatic consultation note generation by differing opinions between medical experts both about which patient statements should be included in generated notes and about their respective importance in arriving at a diagnosis. Previous real-world evaluations of note-generation systems saw substantial disagreement between expert evaluators. In this paper we propose a protocol that aims to increase objectivity by grounding evaluations in Consultation Checklists, which are created in a preliminary step and then used as a common point of reference during quality assessment. We observed good levels of inter-annotator agreement in a first evaluation study using the protocol; further, using Consultation Checklists produced in the study as reference for automatic metrics such as ROUGE or BERTScore improves their correlation with human judgements compared to using the original human note.

由于输出质量的许多方面具有固有的主观性，评估自动生成的文本通常是困难的。在自动会诊记录生成中，由于医学专家对生成的记录中应包括哪些患者陈述以及它们各自在达到诊断时的重要性的不同意见，这种困难变得更加复杂。以前对笔记生成系统的实际评估中，专家评估人员之间存在重大分歧。在本文中，我们提出了一项协议，旨在通过在咨询清单中进行评估来提高客观性，咨询清单是在初步步骤中创建的，然后用作质量评估期间的共同参考点。我们在使用该协议的第一次评估研究中观察到注释者之间的良好一致性;此外，与使用原始的人类笔记相比，使用研究中生成的咨询检查表作为自动指标(如ROUGE或BERTScore)的参考，可以提高它们与人类判断的相关性。

{"title":"Consultation Checklists: Standardising the Human Evaluation of Medical Note Generation","authors":"Aleksandar Savkov, Francesco Moramarco, Alex Papadopoulos Korfiatis, Mark Perera, Anya Belz, E. Reiter","doi":"10.48550/arXiv.2211.09455","DOIUrl":"https://doi.org/10.48550/arXiv.2211.09455","url":null,"abstract":"Evaluating automatically generated text is generally hard due to the inherently subjective nature of many aspects of the output quality. This difficulty is compounded in automatic consultation note generation by differing opinions between medical experts both about which patient statements should be included in generated notes and about their respective importance in arriving at a diagnosis. Previous real-world evaluations of note-generation systems saw substantial disagreement between expert evaluators. In this paper we propose a protocol that aims to increase objectivity by grounding evaluations in Consultation Checklists, which are created in a preliminary step and then used as a common point of reference during quality assessment. We observed good levels of inter-annotator agreement in a first evaluation study using the protocol; further, using Consultation Checklists produced in the study as reference for automatic metrics such as ROUGE or BERTScore improves their correlation with human judgements compared to using the original human note.","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"19 1","pages":"111-120"},"PeriodicalIF":0.0,"publicationDate":"2022-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76622153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5