First Workshop on Insights from Negative Results in NLP最新文献

英文中文

What GPT Knows About Who is Who GPT知道谁是谁

First Workshop on Insights from Negative Results in NLP

Pub Date : 2022-05-16 DOI: 10.48550/arXiv.2205.07407

Xiaohan Yang, Eduardo Peynetti, Vasco Meerman, Christy Tanner

Coreference resolution – which is a crucial task for understanding discourse and language at large – has yet to witness widespread benefits from large language models (LLMs). Moreover, coreference resolution systems largely rely on supervised labels, which are highly expensive and difficult to annotate, thus making it ripe for prompt engineering. In this paper, we introduce a QA-based prompt-engineering method and discern generative, pre-trained LLMs’ abilities and limitations toward the task of coreference resolution. Our experiments show that GPT-2 and GPT-Neo can return valid answers, but that their capabilities to identify coreferent mentions are limited and prompt-sensitive, leading to inconsistent results.

共同参考解析是理解话语和语言的关键任务，但它尚未从大型语言模型(llm)中得到广泛的应用。此外，共参解析系统很大程度上依赖于监督标签，这是非常昂贵和难以注释的，从而使其成熟的快速工程。在本文中，我们引入了一种基于问答的提示工程方法，并识别出生成的、预先训练的法学硕士在共同参考解析任务中的能力和局限性。我们的实验表明，GPT-2和GPT-Neo可以返回有效的答案，但它们识别共同提及的能力有限且对时间敏感，导致结果不一致。

引用次数: 6

Pathologies of Pre-trained Language Models in Few-shot Fine-tuning 预训练语言模型在几次微调中的病态

First Workshop on Insights from Negative Results in NLP

Pub Date : 2022-04-17 DOI: 10.48550/arXiv.2204.08039

Hanjie Chen, Guoqing Zheng, A. Awadallah, Yangfeng Ji

Although adapting pre-trained language models with few examples has shown promising performance on text classification, there is a lack of understanding of where the performance gain comes from. In this work, we propose to answer this question by interpreting the adaptation behavior using post-hoc explanations from model predictions. By modeling feature statistics of explanations, we discover that (1) without fine-tuning, pre-trained models (e.g. BERT and RoBERTa) show strong prediction bias across labels; (2) although few-shot fine-tuning can mitigate the prediction bias and demonstrate promising prediction performance, our analysis shows models gain performance improvement by capturing non-task-related features (e.g. stop words) or shallow data patterns (e.g. lexical overlaps). These observations alert that pursuing model performance with fewer examples may incur pathological prediction behavior, which requires further sanity check on model predictions and careful design in model evaluations in few-shot fine-tuning.

尽管使用少量示例调整预训练的语言模型在文本分类上显示出有希望的性能，但缺乏对性能增益的来源的理解。在这项工作中，我们建议通过使用模型预测的事后解释来解释适应行为来回答这个问题。通过对解释的特征统计建模，我们发现(1)在没有微调的情况下，预训练模型(例如BERT和RoBERTa)在标签上显示出很强的预测偏差;(2)虽然少量的微调可以减轻预测偏差并显示出有希望的预测性能，但我们的分析表明，模型通过捕获与任务无关的特征(如停止词)或浅层数据模式(如词汇重叠)来获得性能改进。这些观察结果提醒我们，用更少的样本追求模型性能可能会导致病态的预测行为，这需要对模型预测进行进一步的健全检查，并在几次微调中仔细设计模型评估。

引用次数: 1

Can Question Rewriting Help Conversational Question Answering? 改写问题有助于对话式问题回答吗?

First Workshop on Insights from Negative Results in NLP

Pub Date : 2022-04-13 DOI: 10.48550/arXiv.2204.06239

Etsuko Ishii, Yan Xu, Samuel Cahyawijaya, Bryan Wilie

Question rewriting (QR) is a subtask of conversational question answering (CQA) aiming to ease the challenges of understanding dependencies among dialogue history by reformulating questions in a self-contained form. Despite seeming plausible, little evidence is available to justify QR as a mitigation method for CQA. To verify the effectiveness of QR in CQA, we investigate a reinforcement learning approach that integrates QR and CQA tasks and does not require corresponding QR datasets for targeted CQA.We find, however, that the RL method is on par with the end-to-end baseline. We provide an analysis of the failure and describe the difficulty of exploiting QR for CQA.

问题改写(QR)是对话式问答(CQA)的一个子任务，旨在通过将问题以自包含的形式重新表述来缓解理解对话历史之间依赖关系的挑战。尽管看似合理，但几乎没有证据证明QR是CQA的缓解方法。为了验证QR在CQA中的有效性，我们研究了一种整合QR和CQA任务的强化学习方法，并且不需要针对目标CQA的相应QR数据集。然而，我们发现RL方法与端到端基线是相同的。我们提供了失败的分析，并描述了利用QR进行CQA的困难。

引用次数: 5

Extending the Scope of Out-of-Domain: Examining QA models in multiple subdomains 扩展域外范围:检查多个子域中的QA模型

First Workshop on Insights from Negative Results in NLP

Pub Date : 2022-04-09 DOI: 10.48550/arXiv.2204.04534

Chenyang Lyu, Jennifer Foster, Yvette Graham

Past work that investigates out-of-domain performance of QA systems has mainly focused on general domains (e.g. news domain, wikipedia domain), underestimating the importance of subdomains defined by the internal characteristics of QA datasets.In this paper, we extend the scope of “out-of-domain” by splitting QA examples into different subdomains according to their internal characteristics including question type, text length, answer position. We then examine the performance of QA systems trained on the data from different subdomains. Experimental results show that the performance of QA systems can be significantly reduced when the train data and test data come from different subdomains. These results question the generalizability of current QA systems in multiple subdomains, suggesting the need to combat the bias introduced by the internal characteristics of QA datasets.

过去研究QA系统的域外性能的工作主要集中在一般领域(例如新闻领域，维基百科领域)，低估了由QA数据集的内部特征定义的子领域的重要性。在本文中，我们通过将QA示例根据其内部特征(包括问题类型，文本长度，答案位置)划分为不同的子域来扩展“域外”的范围。然后，我们检查在不同子域的数据上训练的QA系统的性能。实验结果表明，当训练数据和测试数据来自不同的子域时，QA系统的性能会显著降低。这些结果质疑了当前QA系统在多个子领域的普遍性，表明需要对抗QA数据集内部特征引入的偏见。

引用次数: 3

Do Data-based Curricula Work? 基于数据的课程有效吗?

First Workshop on Insights from Negative Results in NLP

Pub Date : 2021-12-13 DOI: 10.18653/v1/2022.insights-1.16

Maxim K. Surkov, Vladislav D. Mosin, Ivan P. Yamshchikov

Current state-of-the-art NLP systems use large neural networks that require extensive computational resources for training. Inspired by human knowledge acquisition, researchers have proposed curriculum learning - sequencing tasks (task-based curricula) or ordering and sampling the datasets (data-based curricula) that facilitate training. This work investigates the benefits of data-based curriculum learning for large language models such as BERT and T5. We experiment with various curricula based on complexity measures and different sampling strategies. Extensive experiments on several NLP tasks show that curricula based on various complexity measures rarely have any benefits, while random sampling performs either as well or better than curricula.

目前最先进的NLP系统使用大型神经网络，需要大量的计算资源进行训练。受人类知识获取的启发，研究人员提出了课程学习-排序任务(基于任务的课程)或排序和采样数据集(基于数据的课程)，以促进培训。这项工作调查了基于数据的课程学习对大型语言模型(如BERT和T5)的好处。我们基于复杂性度量和不同的抽样策略对各种课程进行了实验。在几个NLP任务上进行的大量实验表明，基于各种复杂性度量的课程很少有任何好处，而随机抽样的效果要么和课程一样好，要么更好。

引用次数: 3

Embedding Structured Dictionary Entries 嵌入结构化字典条目

First Workshop on Insights from Negative Results in NLP

Pub Date : 2020-11-01 DOI: 10.18653/v1/2020.insights-1.18

Steven R. Wilson, Walid Magdy, Barbara McGillivray, Gareth Tyson

Previous work has shown how to effectively use external resources such as dictionaries to improve English-language word embeddings, either by manipulating the training process or by applying post-hoc adjustments to the embedding space. We experiment with a multi-task learning approach for explicitly incorporating the structured elements of dictionary entries, such as user-assigned tags and usage examples, when learning embeddings for dictionary headwords. Our work generalizes several existing models for learning word embeddings from dictionaries. However, we find that the most effective representations overall are learned by simply training with a skip-gram objective over the concatenated text of all entries in the dictionary, giving no particular focus to the structure of the entries.

以前的工作已经展示了如何有效地利用外部资源(如字典)来提高英语单词嵌入，或者通过操纵训练过程，或者通过对嵌入空间进行事后调整。我们尝试了一种多任务学习方法，用于在学习字典关键词嵌入时显式地合并字典条目的结构化元素，例如用户分配的标签和用法示例。我们的工作概括了几种现有的从字典中学习词嵌入的模型。然而，我们发现最有效的表示总体上是通过简单地用跳跃图目标训练字典中所有条目的连接文本来学习的，而不是特别关注条目的结构。

引用次数: 2

How Far Can We Go with Data Selection? A Case Study on Semantic Sequence Tagging Tasks 数据选择我们能走多远?语义序列标注任务的实例研究

First Workshop on Insights from Negative Results in NLP

Pub Date : 2020-11-01 DOI: 10.18653/v1/2020.insights-1.3

Samuel Louvan, B. Magnini

Although several works have addressed the role of data selection to improve transfer learning for various NLP tasks, there is no consensus about its real benefits and, more generally, there is a lack of shared practices on how it can be best applied. We propose a systematic approach aimed at evaluating data selection in scenarios of increasing complexity. Specifically, we compare the case in which source and target tasks are the same while source and target domains are different, against the more challenging scenario where both tasks and domains are different. We run a number of experiments on semantic sequence tagging tasks, which are relatively less investigated in data selection, and conclude that data selection has more benefit on the scenario when the tasks are the same, while in case of different (although related) tasks from distant domains, a combination of data selection and multi-task learning is ineffective for most cases.

虽然有几项工作已经解决了数据选择在改善各种NLP任务的迁移学习中的作用，但对其真正的好处没有达成共识，更普遍的是，缺乏关于如何最好地应用它的共享实践。我们提出了一种系统的方法，旨在评估日益复杂的场景中的数据选择。具体来说，我们比较了源任务和目标任务相同而源和目标域不同的情况，以及任务和域都不同的更具挑战性的情况。我们对语义序列标记任务进行了大量实验，这些任务在数据选择中研究相对较少，并得出结论，数据选择在任务相同的情况下更有利，而在来自遥远领域的不同(尽管相关)任务的情况下，数据选择和多任务学习的结合在大多数情况下是无效的。

引用次数: 0

NMF Ensembles? Not for Text Summarization! NMF集合?不是为了文本摘要!

First Workshop on Insights from Negative Results in NLP

Pub Date : 2020-11-01 DOI: 10.18653/v1/2020.insights-1.14

Alka Khurana, Vasudha Bhatnagar

Non-negative Matrix Factorization (NMF) has been used for text analytics with promising results. Instability of results arising due to stochastic variations during initialization makes a case for use of ensemble technology. However, our extensive empirical investigation indicates otherwise. In this paper, we establish that ensemble summary for single document using NMF is no better than the best base model summary.

非负矩阵分解(NMF)已被用于文本分析，并取得了良好的结果。初始化过程中随机变化引起的结果的不稳定性使得集成技术的应用成为可能。然而，我们广泛的实证调查表明并非如此。在本文中，我们证明了使用NMF对单个文档进行集成摘要并不优于最佳基础模型摘要。

引用次数: 1

Which Matters Most? Comparing the Impact of Concept and Document Relationships in Topic Models 哪个最重要?比较主题模型中概念和文档关系的影响

First Workshop on Insights from Negative Results in NLP

Pub Date : 2020-11-01 DOI: 10.18653/v1/2020.insights-1.5

Silvia Terragni, Debora Nozza, E. Fersini, M. Enza

Topic models have been widely used to discover hidden topics in a collection of documents. In this paper, we propose to investigate the role of two different types of relational information, i.e. document relationships and concept relationships. While exploiting the document network significantly improves topic coherence, the introduction of concepts and their relationships does not influence the results both quantitatively and qualitatively.

主题模型已被广泛用于发现文档集合中的隐藏主题。在本文中，我们建议探讨两种不同类型的关系信息，即文档关系和概念关系的作用。虽然利用文献网络可以显著提高主题连贯性，但概念及其关系的引入并不会在数量和质量上影响结果。

引用次数: 7

If You Build Your Own NER Scorer, Non-replicable Results Will Come 如果你建立自己的NER记分器，不可复制的结果将会到来

First Workshop on Insights from Negative Results in NLP

Pub Date : 2020-11-01 DOI: 10.18653/v1/2020.insights-1.15

Constantine Lignos, Marjan Kamyab

We attempt to replicate a named entity recognition (NER) model implemented in a popular toolkit and discover that a critical barrier to doing so is the inconsistent evaluation of improper label sequences. We define these sequences and examine how two scorers differ in their handling of them, finding that one approach produces F1 scores approximately 0.5 points higher on the CoNLL 2003 English development and test sets. We propose best practices to increase the replicability of NER evaluations by increasing transparency regarding the handling of improper label sequences.

我们试图复制一个在流行工具包中实现的命名实体识别(NER)模型，并发现这样做的一个关键障碍是对不适当的标签序列的不一致评估。我们定义了这些序列，并检查了两个评分者在处理它们时的差异，发现一种方法在CoNLL 2003英语发展和测试集上产生的F1分数大约高出0.5分。我们提出了最佳实践，通过增加处理不当标签序列的透明度来提高NER评估的可复制性。

引用次数: 7

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

First Workshop on Insights from Negative Results in NLP

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀