首页 > 最新文献

First Workshop on Insights from Negative Results in NLP最新文献

英文 中文
Counterfactually-Augmented SNLI Training Data Does Not Yield Better Generalization Than Unaugmented Data 反事实增广的SNLI训练数据并不比未增广的数据产生更好的泛化
Pub Date : 2020-10-09 DOI: 10.18653/v1/2020.insights-1.13
William Huang, Haokun Liu, Samuel R. Bowman
A growing body of work shows that models exploit annotation artifacts to achieve state-of-the-art performance on standard crowdsourced benchmarks—datasets collected from crowdworkers to create an evaluation task—while still failing on out-of-domain examples for the same task. Recent work has explored the use of counterfactually-augmented data—data built by minimally editing a set of seed examples to yield counterfactual labels—to augment training data associated with these benchmarks and build more robust classifiers that generalize better. However, Khashabi et al. (2020) find that this type of augmentation yields little benefit on reading comprehension tasks when controlling for dataset size and cost of collection. We build upon this work by using English natural language inference data to test model generalization and robustness and find that models trained on a counterfactually-augmented SNLI dataset do not generalize better than unaugmented datasets of similar size and that counterfactual augmentation can hurt performance, yielding models that are less robust to challenge examples. Counterfactual augmentation of natural language understanding data through standard crowdsourcing techniques does not appear to be an effective way of collecting training data and further innovation is required to make this general line of work viable.
越来越多的研究表明,模型利用注释工件在标准众包基准(从众包工作者那里收集的数据集来创建评估任务)上实现了最先进的性能,但在同一任务的域外示例上仍然失败。最近的工作探索了使用反事实增强数据(通过最小限度地编辑一组种子示例来生成反事实标签)来增强与这些基准相关的训练数据,并构建更健壮的分类器,从而更好地进行泛化。然而,Khashabi等人(2020)发现,在控制数据集大小和收集成本时,这种类型的增强对阅读理解任务几乎没有好处。我们在此基础上使用英语自然语言推理数据来测试模型的泛化和鲁棒性,并发现在反事实增强的SNLI数据集上训练的模型并不比类似规模的未增强数据集泛化得更好,而且反事实增强会损害性能,产生的模型对挑战示例的鲁棒性较差。通过标准众包技术对自然语言理解数据进行反事实增强似乎不是收集训练数据的有效方法,需要进一步创新才能使这一工作可行。
{"title":"Counterfactually-Augmented SNLI Training Data Does Not Yield Better Generalization Than Unaugmented Data","authors":"William Huang, Haokun Liu, Samuel R. Bowman","doi":"10.18653/v1/2020.insights-1.13","DOIUrl":"https://doi.org/10.18653/v1/2020.insights-1.13","url":null,"abstract":"A growing body of work shows that models exploit annotation artifacts to achieve state-of-the-art performance on standard crowdsourced benchmarks—datasets collected from crowdworkers to create an evaluation task—while still failing on out-of-domain examples for the same task. Recent work has explored the use of counterfactually-augmented data—data built by minimally editing a set of seed examples to yield counterfactual labels—to augment training data associated with these benchmarks and build more robust classifiers that generalize better. However, Khashabi et al. (2020) find that this type of augmentation yields little benefit on reading comprehension tasks when controlling for dataset size and cost of collection. We build upon this work by using English natural language inference data to test model generalization and robustness and find that models trained on a counterfactually-augmented SNLI dataset do not generalize better than unaugmented datasets of similar size and that counterfactual augmentation can hurt performance, yielding models that are less robust to challenge examples. Counterfactual augmentation of natural language understanding data through standard crowdsourcing techniques does not appear to be an effective way of collecting training data and further innovation is required to make this general line of work viable.","PeriodicalId":441528,"journal":{"name":"First Workshop on Insights from Negative Results in NLP","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131011942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
HINT3: Raising the bar for Intent Detection in the Wild 提示3:提高野外意图检测的门槛
Pub Date : 2020-09-29 DOI: 10.18653/v1/2020.insights-1.16
Gaurav Arora, Chirag Jain, Manas Chaturvedi, Krupal Modi
Intent Detection systems in the real world are exposed to complexities of imbalanced datasets containing varying perception of intent, unintended correlations and domain-specific aberrations. To facilitate benchmarking which can reflect near real-world scenarios, we introduce 3 new datasets created from live chatbots in diverse domains. Unlike most existing datasets that are crowdsourced, our datasets contain real user queries received by the chatbots and facilitates penalising unwanted correlations grasped during the training process. We evaluate 4 NLU platforms and a BERT based classifier and find that performance saturates at inadequate levels on test sets because all systems latch on to unintended patterns in training data.
现实世界中的意图检测系统暴露于不平衡数据集的复杂性中,这些数据集包含不同的意图感知、意外的相关性和特定领域的畸变。为了便于反映接近真实世界场景的基准测试,我们引入了从不同领域的实时聊天机器人创建的3个新数据集。与大多数现有的众包数据集不同,我们的数据集包含聊天机器人收到的真实用户查询,并有助于惩罚在训练过程中掌握的不必要的相关性。我们评估了4个NLU平台和一个基于BERT的分类器,发现测试集的性能在不足的水平上饱和,因为所有系统都在训练数据中锁定了意想不到的模式。
{"title":"HINT3: Raising the bar for Intent Detection in the Wild","authors":"Gaurav Arora, Chirag Jain, Manas Chaturvedi, Krupal Modi","doi":"10.18653/v1/2020.insights-1.16","DOIUrl":"https://doi.org/10.18653/v1/2020.insights-1.16","url":null,"abstract":"Intent Detection systems in the real world are exposed to complexities of imbalanced datasets containing varying perception of intent, unintended correlations and domain-specific aberrations. To facilitate benchmarking which can reflect near real-world scenarios, we introduce 3 new datasets created from live chatbots in diverse domains. Unlike most existing datasets that are crowdsourced, our datasets contain real user queries received by the chatbots and facilitates penalising unwanted correlations grasped during the training process. We evaluate 4 NLU platforms and a BERT based classifier and find that performance saturates at inadequate levels on test sets because all systems latch on to unintended patterns in training data.","PeriodicalId":441528,"journal":{"name":"First Workshop on Insights from Negative Results in NLP","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126198432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
On the Impact of Data Augmentation on Downstream Performance in Natural Language Processing 自然语言处理中数据增强对下游性能的影响
Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.insights-1.12
Itsuki Okimura, Machel Reid, M. Kawano, Yutaka Matsuo
With in the broader scope of machine learning, data augmentation is a common strategy to improve generalization and robustness of machine learning models. While data augmentation has been widely used within computer vision, its use in the NLP has been been comparably rather limited. The reason for this is that within NLP, the impact of proposed data augmentation methods on performance has not been evaluated in a unified manner, and effective data augmentation methods are unclear. In this paper, we look to tackle this by evaluating the impact of 12 data augmentation methods on multiple datasets when finetuning pre-trained language models. We find minimal improvements when data sizes are constrained to a few thousand, with performance degradation when data size is increased. We also use various methods to quantify the strength of data augmentations, and find that these values, though weakly correlated with downstream performance, correlate negatively or positively depending on the task.Furthermore, we find a glaring lack of consistently performant data augmentations. This all alludes to the difficulty of data augmentations for NLP tasks and we are inclined to believe that static data augmentations are not broadly applicable given these properties.
在机器学习的广泛范围内,数据增强是提高机器学习模型泛化和鲁棒性的常用策略。虽然数据增强在计算机视觉中得到了广泛的应用,但它在自然语言处理中的应用却相当有限。这是因为在NLP内部,提出的数据增强方法对性能的影响尚未得到统一的评估,有效的数据增强方法尚不清楚。在本文中,我们希望通过在调整预训练语言模型时评估12种数据增强方法对多个数据集的影响来解决这个问题。我们发现,当数据大小限制在几千个以内时,性能的改善微乎其微,而当数据大小增加时,性能会下降。我们还使用各种方法来量化数据增强的强度,并发现这些值虽然与下游性能弱相关,但根据任务负相关或正相关。此外,我们发现明显缺乏一致的性能数据增强。这些都暗示了NLP任务中数据增强的困难,我们倾向于认为静态数据增强在给定这些属性的情况下并不广泛适用。
{"title":"On the Impact of Data Augmentation on Downstream Performance in Natural Language Processing","authors":"Itsuki Okimura, Machel Reid, M. Kawano, Yutaka Matsuo","doi":"10.18653/v1/2022.insights-1.12","DOIUrl":"https://doi.org/10.18653/v1/2022.insights-1.12","url":null,"abstract":"With in the broader scope of machine learning, data augmentation is a common strategy to improve generalization and robustness of machine learning models. While data augmentation has been widely used within computer vision, its use in the NLP has been been comparably rather limited. The reason for this is that within NLP, the impact of proposed data augmentation methods on performance has not been evaluated in a unified manner, and effective data augmentation methods are unclear. In this paper, we look to tackle this by evaluating the impact of 12 data augmentation methods on multiple datasets when finetuning pre-trained language models. We find minimal improvements when data sizes are constrained to a few thousand, with performance degradation when data size is increased. We also use various methods to quantify the strength of data augmentations, and find that these values, though weakly correlated with downstream performance, correlate negatively or positively depending on the task.Furthermore, we find a glaring lack of consistently performant data augmentations. This all alludes to the difficulty of data augmentations for NLP tasks and we are inclined to believe that static data augmentations are not broadly applicable given these properties.","PeriodicalId":441528,"journal":{"name":"First Workshop on Insights from Negative Results in NLP","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131181515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Label Errors in BANKING77 银行标签错误
Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.insights-1.19
Cecilia Ying, Stephen Thomas
We investigate potential label errors present in the popular BANKING77 dataset and the associated negative impacts on intent classification methods. Motivated by our own negative results when constructing an intent classifier, we applied two automated approaches to identify potential label errors in the dataset. We found that over 1,400 (14%) of the 10,003 training utterances may have been incorrectly labelled. In a simple experiment, we found that by removing the utterances with potential errors, our intent classifier saw an increase of 4.5% and 8% for the F1-Score and Adjusted Rand Index, respectively, in supervised and unsupervised classification. This paper serves as a warning of the potential of noisy labels in popular NLP datasets. Further study is needed to fully identify the breadth and depth of label errors in BANKING77 and other datasets.
我们研究了流行的BANKING77数据集中存在的潜在标签错误以及相关的对意图分类方法的负面影响。在构建意图分类器时,由于我们自己的负面结果,我们应用了两种自动化方法来识别数据集中潜在的标签错误。我们发现,在1003个训练话语中,超过1400个(14%)可能被错误地标记。在一个简单的实验中,我们发现通过去除潜在错误的话语,我们的意图分类器在监督和非监督分类中F1-Score和Adjusted Rand Index分别提高了4.5%和8%。本文对流行的自然语言处理数据集中可能存在的噪声标签提出了警告。需要进一步的研究来充分确定BANKING77和其他数据集中标签错误的广度和深度。
{"title":"Label Errors in BANKING77","authors":"Cecilia Ying, Stephen Thomas","doi":"10.18653/v1/2022.insights-1.19","DOIUrl":"https://doi.org/10.18653/v1/2022.insights-1.19","url":null,"abstract":"We investigate potential label errors present in the popular BANKING77 dataset and the associated negative impacts on intent classification methods. Motivated by our own negative results when constructing an intent classifier, we applied two automated approaches to identify potential label errors in the dataset. We found that over 1,400 (14%) of the 10,003 training utterances may have been incorrectly labelled. In a simple experiment, we found that by removing the utterances with potential errors, our intent classifier saw an increase of 4.5% and 8% for the F1-Score and Adjusted Rand Index, respectively, in supervised and unsupervised classification. This paper serves as a warning of the potential of noisy labels in popular NLP datasets. Further study is needed to fully identify the breadth and depth of label errors in BANKING77 and other datasets.","PeriodicalId":441528,"journal":{"name":"First Workshop on Insights from Negative Results in NLP","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115905528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Replicability under Near-Perfect Conditions – A Case-Study from Automatic Summarization 近乎完美条件下的可复制性——一个来自自动摘要的案例研究
Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.insights-1.23
Margot Mieskes
Replication of research results has become more and more important in Natural Language Processing. Nevertheless, we still rely on results reported in the literature for comparison. Additionally, elements of an experimental setup are not always completely reported. This includes, but is not limited to reporting specific parameters used or omitting an implementational detail. In our experiment based on two frequently used data sets from the domain of automatic summarization and the seemingly full disclosure of research artefacts, we examine how well results reported are replicable and what elements influence the success or failure of replication. Our results indicate that publishing research artifacts is far from sufficient, that that publishing all relevant parameters in all possible detail is cruicial.
在自然语言处理中,研究结果的可重复性变得越来越重要。然而,我们仍然依赖文献报道的结果进行比较。此外,实验设置的要素并不总是完全报告。这包括但不限于报告所使用的特定参数或省略实现细节。在我们的实验中,基于来自自动总结领域的两个经常使用的数据集和似乎完全披露的研究工件,我们检查了报告的结果可复制的程度,以及影响复制成功或失败的因素。我们的结果表明,发布研究工件是远远不够的,以所有可能的细节发布所有相关参数是至关重要的。
{"title":"Replicability under Near-Perfect Conditions – A Case-Study from Automatic Summarization","authors":"Margot Mieskes","doi":"10.18653/v1/2022.insights-1.23","DOIUrl":"https://doi.org/10.18653/v1/2022.insights-1.23","url":null,"abstract":"Replication of research results has become more and more important in Natural Language Processing. Nevertheless, we still rely on results reported in the literature for comparison. Additionally, elements of an experimental setup are not always completely reported. This includes, but is not limited to reporting specific parameters used or omitting an implementational detail. In our experiment based on two frequently used data sets from the domain of automatic summarization and the seemingly full disclosure of research artefacts, we examine how well results reported are replicable and what elements influence the success or failure of replication. Our results indicate that publishing research artifacts is far from sufficient, that that publishing all relevant parameters in all possible detail is cruicial.","PeriodicalId":441528,"journal":{"name":"First Workshop on Insights from Negative Results in NLP","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115087900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BPE beyond Word Boundary: How NOT to use Multi Word Expressions in Neural Machine Translation 超越词边界的BPE:如何在神经机器翻译中不使用多词表达式
Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.insights-1.24
Dipesh Kumar, Avijit Thawani
BPE tokenization merges characters into longer tokens by finding frequently occurring contiguous patterns within the word boundary. An intuitive relaxation would be to extend a BPE vocabulary with multi-word expressions (MWEs): bigrams (in_a), trigrams (out_of_the), and skip-grams (he . his). In the context of Neural Machine Translation (NMT), we replace the least frequent subword/whole-word tokens with the most frequent MWEs. We find that these modifications to BPE end up hurting the model, resulting in a net drop of BLEU and chrF scores across two language pairs. We observe that naively extending BPE beyond word boundaries results in incoherent tokens which are themselves better represented as individual words. Moreover, we find that Pointwise Mutual Information (PMI) instead of frequency finds better MWEs (e.g., New_York, Statue_of_Liberty, neither . nor) which consistently improves translation performance.We release all code at https://github.com/pegasus-lynx/mwe-bpe.
BPE标记化通过在单词边界内查找频繁出现的连续模式,将字符合并为更长的标记。一种直观的放松方法是用多词表达式(MWEs)扩展BPE词汇表:双元(in_a)、三元(out_of_the)和跳格(he)。他的)。在神经机器翻译(NMT)的背景下,我们用最频繁的MWEs替换最不频繁的子词/整词标记。我们发现这些对BPE的修改最终会损害模型,导致两个语言对的BLEU和chrF分数的净下降。我们观察到,天真地将BPE扩展到单词边界之外会导致不连贯的符号,这些符号本身更好地表示为单个单词。此外,我们发现点间互信息(PMI)比频率找到更好的MWEs(例如,纽约,自由女神像,两者都不是)。Nor),从而持续提高翻译性能。我们在https://github.com/pegasus-lynx/mwe-bpe上发布所有代码。
{"title":"BPE beyond Word Boundary: How NOT to use Multi Word Expressions in Neural Machine Translation","authors":"Dipesh Kumar, Avijit Thawani","doi":"10.18653/v1/2022.insights-1.24","DOIUrl":"https://doi.org/10.18653/v1/2022.insights-1.24","url":null,"abstract":"BPE tokenization merges characters into longer tokens by finding frequently occurring contiguous patterns within the word boundary. An intuitive relaxation would be to extend a BPE vocabulary with multi-word expressions (MWEs): bigrams (in_a), trigrams (out_of_the), and skip-grams (he . his). In the context of Neural Machine Translation (NMT), we replace the least frequent subword/whole-word tokens with the most frequent MWEs. We find that these modifications to BPE end up hurting the model, resulting in a net drop of BLEU and chrF scores across two language pairs. We observe that naively extending BPE beyond word boundaries results in incoherent tokens which are themselves better represented as individual words. Moreover, we find that Pointwise Mutual Information (PMI) instead of frequency finds better MWEs (e.g., New_York, Statue_of_Liberty, neither . nor) which consistently improves translation performance.We release all code at https://github.com/pegasus-lynx/mwe-bpe.","PeriodicalId":441528,"journal":{"name":"First Workshop on Insights from Negative Results in NLP","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133977930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
How Much Do Modifications to Transformer Language Models Affect Their Ability to Learn Linguistic Knowledge? 对变形语言模型的修改对他们学习语言知识的能力有多大影响?
Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.insights-1.6
Simeng Sun, Brian Dillon, Mohit Iyyer
Recent progress in large pretrained language models (LMs) has led to a growth of analyses examining what kinds of linguistic knowledge are encoded by these models. Due to computational constraints, existing analyses are mostly conducted on publicly-released LM checkpoints, which makes it difficult to study how various factors during training affect the models’ acquisition of linguistic knowledge. In this paper, we train a suite of small-scale Transformer LMs that differ from each other with respect to architectural decisions (e.g., self-attention configuration) or training objectives (e.g., multi-tasking, focal loss). We evaluate these LMs on BLiMP, a targeted evaluation benchmark of multiple English linguistic phenomena. Our experiments show that while none of these modifications yields significant improvements on aggregate, changes to the loss function result in promising improvements on several subcategories (e.g., detecting adjunct islands, correctly scoping negative polarity items). We hope our work offers useful insights for future research into designing Transformer LMs that more effectively learn linguistic knowledge.
大型预训练语言模型(LMs)的最新进展导致了对这些模型编码哪种语言知识的分析的增长。由于计算的限制,现有的分析大多是在公开发布的LM检查点上进行的,这使得很难研究训练过程中的各种因素如何影响模型对语言知识的获取。在本文中,我们训练了一套小规模的Transformer lm,它们在体系结构决策(例如,自关注配置)或训练目标(例如,多任务,焦点丢失)方面彼此不同。我们在BLiMP上对这些lm进行了评估,BLiMP是一个针对多种英语语言现象的有针对性的评估基准。我们的实验表明,虽然这些修改都没有在总体上产生显着的改进,但对损失函数的更改在几个子类别(例如,检测附属岛屿,正确确定负极性项目)上产生了有希望的改进。我们希望我们的工作能为未来设计更有效地学习语言知识的Transformer LMs的研究提供有用的见解。
{"title":"How Much Do Modifications to Transformer Language Models Affect Their Ability to Learn Linguistic Knowledge?","authors":"Simeng Sun, Brian Dillon, Mohit Iyyer","doi":"10.18653/v1/2022.insights-1.6","DOIUrl":"https://doi.org/10.18653/v1/2022.insights-1.6","url":null,"abstract":"Recent progress in large pretrained language models (LMs) has led to a growth of analyses examining what kinds of linguistic knowledge are encoded by these models. Due to computational constraints, existing analyses are mostly conducted on publicly-released LM checkpoints, which makes it difficult to study how various factors during training affect the models’ acquisition of linguistic knowledge. In this paper, we train a suite of small-scale Transformer LMs that differ from each other with respect to architectural decisions (e.g., self-attention configuration) or training objectives (e.g., multi-tasking, focal loss). We evaluate these LMs on BLiMP, a targeted evaluation benchmark of multiple English linguistic phenomena. Our experiments show that while none of these modifications yields significant improvements on aggregate, changes to the loss function result in promising improvements on several subcategories (e.g., detecting adjunct islands, correctly scoping negative polarity items). We hope our work offers useful insights for future research into designing Transformer LMs that more effectively learn linguistic knowledge.","PeriodicalId":441528,"journal":{"name":"First Workshop on Insights from Negative Results in NLP","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121223015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Clustering Examples in Multi-Dataset Benchmarks with Item Response Theory 基于项目反应理论的多数据集基准聚类实例
Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.insights-1.14
Pedro Rodriguez, Phu Mon Htut, John P. Lalor, João Sedoc
In natural language processing, multi-dataset benchmarks for common tasks (e.g., SuperGLUE for natural language inference and MRQA for question answering) have risen in importance. Invariably, tasks and individual examples vary in difficulty. Recent analysis methods infer properties of examples such as difficulty. In particular, Item Response Theory (IRT) jointly infers example and model properties from the output of benchmark tasks (i.e., scores for each model-example pair). Therefore, it seems sensible that methods like IRT should be able to detect differences between datasets in a task. This work shows that current IRT models are not as good at identifying differences as we would expect, explain why this is difficult, and outline future directions that incorporate more (textual) signal from examples.
在自然语言处理中,常见任务的多数据集基准(例如,用于自然语言推理的SuperGLUE和用于问题回答的MRQA)已经变得越来越重要。任务和个别例子的难度总是不同的。最近的分析方法推断例子的性质,如难度。特别是,项目反应理论(IRT)从基准任务的输出中联合推断示例和模型属性(即每个模型-示例对的分数)。因此,像IRT这样的方法应该能够检测任务中数据集之间的差异,这似乎是明智的。这项工作表明,当前的IRT模型在识别差异方面不如我们预期的那么好,解释了为什么这很困难,并概述了未来的方向,即从示例中包含更多(文本)信号。
{"title":"Clustering Examples in Multi-Dataset Benchmarks with Item Response Theory","authors":"Pedro Rodriguez, Phu Mon Htut, John P. Lalor, João Sedoc","doi":"10.18653/v1/2022.insights-1.14","DOIUrl":"https://doi.org/10.18653/v1/2022.insights-1.14","url":null,"abstract":"In natural language processing, multi-dataset benchmarks for common tasks (e.g., SuperGLUE for natural language inference and MRQA for question answering) have risen in importance. Invariably, tasks and individual examples vary in difficulty. Recent analysis methods infer properties of examples such as difficulty. In particular, Item Response Theory (IRT) jointly infers example and model properties from the output of benchmark tasks (i.e., scores for each model-example pair). Therefore, it seems sensible that methods like IRT should be able to detect differences between datasets in a task. This work shows that current IRT models are not as good at identifying differences as we would expect, explain why this is difficult, and outline future directions that incorporate more (textual) signal from examples.","PeriodicalId":441528,"journal":{"name":"First Workshop on Insights from Negative Results in NLP","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127428930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Empirical study to understand the Compositional Prowess of Neural Dialog Models 理解神经对话模型合成能力的实证研究
Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.insights-1.21
Vinayshekhar Bannihatti Kumar, Vaibhav Kumar, Mukul Bhutani, Alexander I. Rudnicky
In this work, we examine the problems associated with neural dialog models under the common theme of compositionality. Specifically, we investigate three manifestations of compositionality: (1) Productivity, (2) Substitutivity, and (3) Systematicity. These manifestations shed light on the generalization, syntactic robustness, and semantic capabilities of neural dialog models. We design probing experiments by perturbing the training data to study the above phenomenon. We make informative observations based on automated metrics and hope that this work increases research interest in understanding the capacity of these models.
在这项工作中,我们研究了在组合性的共同主题下与神经对话模型相关的问题。具体而言,我们研究了组合性的三种表现形式:(1)生产率,(2)替代性和(3)系统性。这些表现揭示了神经对话模型的泛化、句法鲁棒性和语义能力。我们通过扰动训练数据设计探测实验来研究上述现象。我们在自动化度量的基础上进行信息观察,并希望这项工作能增加对理解这些模型能力的研究兴趣。
{"title":"An Empirical study to understand the Compositional Prowess of Neural Dialog Models","authors":"Vinayshekhar Bannihatti Kumar, Vaibhav Kumar, Mukul Bhutani, Alexander I. Rudnicky","doi":"10.18653/v1/2022.insights-1.21","DOIUrl":"https://doi.org/10.18653/v1/2022.insights-1.21","url":null,"abstract":"In this work, we examine the problems associated with neural dialog models under the common theme of compositionality. Specifically, we investigate three manifestations of compositionality: (1) Productivity, (2) Substitutivity, and (3) Systematicity. These manifestations shed light on the generalization, syntactic robustness, and semantic capabilities of neural dialog models. We design probing experiments by perturbing the training data to study the above phenomenon. We make informative observations based on automated metrics and hope that this work increases research interest in understanding the capacity of these models.","PeriodicalId":441528,"journal":{"name":"First Workshop on Insights from Negative Results in NLP","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127170400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Do Dependency Relations Help in the Task of Stance Detection? 依赖关系对姿态检测有帮助吗?
Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.insights-1.2
A. T. Cignarella, C. Bosco, Paolo Rosso
In this paper we present a set of multilingual experiments tackling the task of Stance Detection in five different languages: English, Spanish, Catalan, French and Italian. Furthermore, we study the phenomenon of stance with respect to six different targets – one per language, and two different for Italian – employing a variety of machine learning algorithms that primarily exploit morphological and syntactic knowledge as features, represented throughout the format of Universal Dependencies. Results seem to suggest that the methodology employed is not beneficial per se, but might be useful to exploit the same features with a different methodology.
在本文中,我们提出了一组多语言实验,处理五种不同语言的姿态检测任务:英语,西班牙语,加泰罗尼亚语,法语和意大利语。此外,我们研究了关于六个不同目标的立场现象-每种语言一个,意大利语两个不同-采用各种机器学习算法,主要利用形态和句法知识作为特征,在整个通用依赖关系格式中表示。结果似乎表明所采用的方法本身没有好处,但是用不同的方法利用相同的特性可能是有用的。
{"title":"Do Dependency Relations Help in the Task of Stance Detection?","authors":"A. T. Cignarella, C. Bosco, Paolo Rosso","doi":"10.18653/v1/2022.insights-1.2","DOIUrl":"https://doi.org/10.18653/v1/2022.insights-1.2","url":null,"abstract":"In this paper we present a set of multilingual experiments tackling the task of Stance Detection in five different languages: English, Spanish, Catalan, French and Italian. Furthermore, we study the phenomenon of stance with respect to six different targets – one per language, and two different for Italian – employing a variety of machine learning algorithms that primarily exploit morphological and syntactic knowledge as features, represented throughout the format of Universal Dependencies. Results seem to suggest that the methodology employed is not beneficial per se, but might be useful to exploit the same features with a different methodology.","PeriodicalId":441528,"journal":{"name":"First Workshop on Insights from Negative Results in NLP","volume":"02 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130814431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
First Workshop on Insights from Negative Results in NLP
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1