European Conference on Information Retrieval最新文献

英文中文

Quantifying Valence and Arousal in Text with Multilingual Pre-trained Transformers 用多语言预训练变形器量化文本的效价和觉醒

European Conference on Information Retrieval

Pub Date : 2023-02-27 DOI: 10.48550/arXiv.2302.14021

Gonccalo Azevedo Mendes, Bruno Martins

The analysis of emotions expressed in text has numerous applications. In contrast to categorical analysis, focused on classifying emotions according to a pre-defined set of common classes, dimensional approaches can offer a more nuanced way to distinguish between different emotions. Still, dimensional methods have been less studied in the literature. Considering a valence-arousal dimensional space, this work assesses the use of pre-trained Transformers to predict these two dimensions on a continuous scale, with input texts from multiple languages and domains. We specifically combined multiple annotated datasets from previous studies, corresponding to either emotional lexica or short text documents, and evaluated models of multiple sizes and trained under different settings. Our results show that model size can have a significant impact on the quality of predictions, and that by fine-tuning a large model we can confidently predict valence and arousal in multiple languages. We make available the code, models, and supporting data.

对文本中表达的情感的分析有许多应用。分类分析侧重于根据一组预定义的常见类别对情绪进行分类，与之相反，维度方法可以提供一种更细微的方法来区分不同的情绪。然而，文献中对量纲方法的研究较少。考虑到一个价-觉醒维度空间，本工作评估了使用预训练的变形金刚在连续尺度上预测这两个维度，并使用来自多种语言和领域的输入文本。我们特别结合了来自先前研究的多个带注释的数据集，对应于情感词典或短文本文档，并评估了多种大小的模型，并在不同的设置下进行了训练。我们的研究结果表明，模型的大小可以对预测的质量产生重大影响，并且通过微调一个大模型，我们可以自信地预测多种语言的效价和唤醒。我们使代码、模型和支持数据可用。

引用次数: 2

Automated Extraction of Fine-Grained Standardized Product Information from Unstructured Multilingual Web Data 从非结构化多语言Web数据中自动提取细粒度标准化产品信息

European Conference on Information Retrieval

Pub Date : 2023-02-23 DOI: 10.48550/arXiv.2302.12139

Alexander Flick, Sebastian Jäger, Ivana Trajanovska, F. Biessmann

Extracting structured information from unstructured data is one of the key challenges in modern information retrieval applications, including e-commerce. Here, we demonstrate how recent advances in machine learning, combined with a recently published multilingual data set with standardized fine-grained product category information, enable robust product attribute extraction in challenging transfer learning settings. Our models can reliably predict product attributes across online shops, languages, or both. Furthermore, we show that our models can be used to match product taxonomies between online retailers.

从非结构化数据中提取结构化信息是现代信息检索应用(包括电子商务)中的关键挑战之一。在这里，我们展示了机器学习的最新进展，结合最近发布的具有标准化细粒度产品类别信息的多语言数据集，如何在具有挑战性的迁移学习设置中实现健壮的产品属性提取。我们的模型可以可靠地预测在线商店、语言或两者之间的产品属性。此外，我们证明了我们的模型可以用于匹配在线零售商之间的产品分类。

引用次数: 0

Query Performance Prediction for Neural IR: Are We There Yet? 神经红外查询性能预测:我们还在那里吗?

European Conference on Information Retrieval

Pub Date : 2023-02-20 DOI: 10.48550/arXiv.2302.09947

G. Faggioli, Thibault Formal, S. Marchesin, S. Clinchant, N. Ferro, Benjamin Piwowarski

Evaluation in Information Retrieval relies on post-hoc empirical procedures, which are time-consuming and expensive operations. To alleviate this, Query Performance Prediction (QPP) models have been developed to estimate the performance of a system without the need for human-made relevance judgements. Such models, usually relying on lexical features from queries and corpora, have been applied to traditional sparse IR methods - with various degrees of success. With the advent of neural IR and large Pre-trained Language Models, the retrieval paradigm has significantly shifted towards more semantic signals. In this work, we study and analyze to what extent current QPP models can predict the performance of such systems. Our experiments consider seven traditional bag-of-words and seven BERT-based IR approaches, as well as nineteen state-of-the-art QPPs evaluated on two collections, Deep Learning '19 and Robust '04. Our findings show that QPPs perform statistically significantly worse on neural IR systems. In settings where semantic signals are prominent (e.g., passage retrieval), their performance on neural models drops by as much as 10% compared to bag-of-words approaches. On top of that, in lexical-oriented scenarios, QPPs fail to predict performance for neural IR systems on those queries where they differ from traditional approaches the most.

信息检索中的评价依赖于事后经验程序，这是一种耗时且昂贵的操作。为了缓解这种情况，已经开发了查询性能预测(Query Performance Prediction, QPP)模型来估计系统的性能，而不需要人为的相关性判断。这些模型通常依赖于查询和语料库的词法特征，已经应用于传统的稀疏红外方法，并取得了不同程度的成功。随着神经IR和大型预训练语言模型的出现，检索范式已明显转向更多的语义信号。在这项工作中，我们研究和分析了当前的QPP模型在多大程度上可以预测此类系统的性能。我们的实验考虑了7种传统的词袋和7种基于bert的IR方法，以及19种最先进的qpp，它们在两个集Deep Learning '19和Robust '04上进行了评估。我们的研究结果表明，qpp在神经IR系统上的表现在统计上明显更差。在语义信号突出的环境中(例如，段落检索)，它们在神经模型上的表现与词袋方法相比下降了10%。最重要的是，在面向词汇的场景中，qpp无法预测神经IR系统在那些与传统方法差异最大的查询上的性能。

{"title":"Query Performance Prediction for Neural IR: Are We There Yet?","authors":"G. Faggioli, Thibault Formal, S. Marchesin, S. Clinchant, N. Ferro, Benjamin Piwowarski","doi":"10.48550/arXiv.2302.09947","DOIUrl":"https://doi.org/10.48550/arXiv.2302.09947","url":null,"abstract":"Evaluation in Information Retrieval relies on post-hoc empirical procedures, which are time-consuming and expensive operations. To alleviate this, Query Performance Prediction (QPP) models have been developed to estimate the performance of a system without the need for human-made relevance judgements. Such models, usually relying on lexical features from queries and corpora, have been applied to traditional sparse IR methods - with various degrees of success. With the advent of neural IR and large Pre-trained Language Models, the retrieval paradigm has significantly shifted towards more semantic signals. In this work, we study and analyze to what extent current QPP models can predict the performance of such systems. Our experiments consider seven traditional bag-of-words and seven BERT-based IR approaches, as well as nineteen state-of-the-art QPPs evaluated on two collections, Deep Learning '19 and Robust '04. Our findings show that QPPs perform statistically significantly worse on neural IR systems. In settings where semantic signals are prominent (e.g., passage retrieval), their performance on neural models drops by as much as 10% compared to bag-of-words approaches. On top of that, in lexical-oriented scenarios, QPPs fail to predict performance for neural IR systems on those queries where they differ from traditional approaches the most.","PeriodicalId":126309,"journal":{"name":"European Conference on Information Retrieval","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115193501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Joint Span Segmentation and Rhetorical Role Labeling with Data Augmentation for Legal Documents 基于数据扩充的法律文件联合跨度分割与修辞角色标注

European Conference on Information Retrieval

Pub Date : 2023-02-13 DOI: 10.48550/arXiv.2302.06448

Santosh T.Y.S.S, Philipp Bock, Matthias Grabmair

Segmentation and Rhetorical Role Labeling of legal judgements play a crucial role in retrieval and adjacent tasks, including case summarization, semantic search, argument mining etc. Previous approaches have formulated this task either as independent classification or sequence labeling of sentences. In this work, we reformulate the task at span level as identifying spans of multiple consecutive sentences that share the same rhetorical role label to be assigned via classification. We employ semi-Markov Conditional Random Fields (CRF) to jointly learn span segmentation and span label assignment. We further explore three data augmentation strategies to mitigate the data scarcity in the specialized domain of law where individual documents tend to be very long and annotation cost is high. Our experiments demonstrate improvement of span-level prediction metrics with a semi-Markov CRF model over a CRF baseline. This benefit is contingent on the presence of multi sentence spans in the document.

法律判决书的分词和修辞角色标注在案件总结、语义搜索、论据挖掘等检索和相关任务中起着至关重要的作用。以前的方法将这个任务表述为独立分类或句子的序列标记。在这项工作中，我们在跨度层面将任务重新表述为识别多个连续句子的跨度，这些句子具有相同的修辞角色标签，并通过分类分配。我们采用半马尔可夫条件随机场(CRF)来联合学习跨度分割和跨度标签分配。我们进一步探讨了三种数据增强策略，以缓解法律专业领域中单个文档往往很长且注释成本高的数据稀缺性。我们的实验证明了半马尔可夫CRF模型在CRF基线上的跨度级预测指标的改进。这种好处取决于文档中是否存在多句子跨度。

引用次数: 1

Exploiting Graph Structured Cross-Domain Representation for Multi-Domain Recommendation 利用图结构跨领域表示进行多领域推荐

European Conference on Information Retrieval

Pub Date : 2023-02-12 DOI: 10.48550/arXiv.2302.05990

Alejandro Ariza-Casabona, Bartlomiej Twardowski, T. Wijaya

Multi-domain recommender systems benefit from cross-domain representation learning and positive knowledge transfer. Both can be achieved by introducing a specific modeling of input data (i.e. disjoint history) or trying dedicated training regimes. At the same time, treating domains as separate input sources becomes a limitation as it does not capture the interplay that naturally exists between domains. In this work, we efficiently learn multi-domain representation of sequential users' interactions using graph neural networks. We use temporal intra- and inter-domain interactions as contextual information for our method called MAGRec (short for Multi-domAin Graph-based Recommender). To better capture all relations in a multi-domain setting, we learn two graph-based sequential representations simultaneously: domain-guided for recent user interest, and general for long-term interest. This approach helps to mitigate the negative knowledge transfer problem from multiple domains and improve overall representation. We perform experiments on publicly available datasets in different scenarios where MAGRec consistently outperforms state-of-the-art methods. Furthermore, we provide an ablation study and discuss further extensions of our method.

多领域推荐系统受益于跨领域表示学习和正向知识迁移。两者都可以通过引入输入数据的特定建模(即不相交历史)或尝试专门的训练机制来实现。同时，将域作为单独的输入源处理成为一种限制，因为它不能捕获域之间自然存在的相互作用。在这项工作中，我们使用图神经网络有效地学习了顺序用户交互的多域表示。我们使用时域域内和域间交互作为上下文信息，用于我们的方法MAGRec (Multi-domAin Graph-based Recommender的缩写)。为了更好地捕获多领域设置中的所有关系，我们同时学习两种基于图的顺序表示:针对近期用户兴趣的领域引导，以及针对长期兴趣的一般表示。这种方法有助于缓解来自多个领域的负知识转移问题，提高整体表征。我们在不同场景下对公开可用的数据集进行实验，其中MAGRec始终优于最先进的方法。此外，我们还提供了消融研究，并讨论了我们方法的进一步扩展。

{"title":"Exploiting Graph Structured Cross-Domain Representation for Multi-Domain Recommendation","authors":"Alejandro Ariza-Casabona, Bartlomiej Twardowski, T. Wijaya","doi":"10.48550/arXiv.2302.05990","DOIUrl":"https://doi.org/10.48550/arXiv.2302.05990","url":null,"abstract":"Multi-domain recommender systems benefit from cross-domain representation learning and positive knowledge transfer. Both can be achieved by introducing a specific modeling of input data (i.e. disjoint history) or trying dedicated training regimes. At the same time, treating domains as separate input sources becomes a limitation as it does not capture the interplay that naturally exists between domains. In this work, we efficiently learn multi-domain representation of sequential users' interactions using graph neural networks. We use temporal intra- and inter-domain interactions as contextual information for our method called MAGRec (short for Multi-domAin Graph-based Recommender). To better capture all relations in a multi-domain setting, we learn two graph-based sequential representations simultaneously: domain-guided for recent user interest, and general for long-term interest. This approach helps to mitigate the negative knowledge transfer problem from multiple domains and improve overall representation. We perform experiments on publicly available datasets in different scenarios where MAGRec consistently outperforms state-of-the-art methods. Furthermore, we provide an ablation study and discuss further extensions of our method.","PeriodicalId":126309,"journal":{"name":"European Conference on Information Retrieval","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116154329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

DocILE 2023 Teaser: Document Information Localization and Extraction DocILE 2023预告：文档信息本地化与提取

European Conference on Information Retrieval

Pub Date : 2023-01-29 DOI: 10.48550/arXiv.2301.12394

vStvep'an vSimsa, Milan vSulc, Maty'avs Skalick'y, Yash J. Patel, Ahmed Hamdi

The lack of data for information extraction (IE) from semi-structured business documents is a real problem for the IE community. Publications relying on large-scale datasets use only proprietary, unpublished data due to the sensitive nature of such documents. Publicly available datasets are mostly small and domain-specific. The absence of a large-scale public dataset or benchmark hinders the reproducibility and cross-evaluation of published methods. The DocILE 2023 competition, hosted as a lab at the CLEF 2023 conference and as an ICDAR 2023 competition, will run the first major benchmark for the tasks of Key Information Localization and Extraction (KILE) and Line Item Recognition (LIR) from business documents. With thousands of annotated real documents from open sources, a hundred thousand of generated synthetic documents, and nearly a million unlabeled documents, the DocILE lab comes with the largest publicly available dataset for KILE and LIR. We are looking forward to contributions from the Computer Vision, Natural Language Processing, Information Retrieval, and other communities. The data, baselines, code and up-to-date information about the lab and competition are available at https://docile.rossum.ai/.

缺乏从半结构化商业文档中进行信息提取（IE）的数据是 IE 界面临的一个实际问题。由于此类文档的敏感性，依赖大规模数据集的出版物仅使用未公开的专有数据。公开可用的数据集大多规模较小，且针对特定领域。大规模公共数据集或基准的缺失阻碍了已发布方法的可重复性和交叉评估。DocILE 2023竞赛作为CLEF 2023会议的一个实验室和ICDAR 2023竞赛的一部分，将为商业文档中的关键信息定位和提取（KILE）和行项目识别（LIR）任务提供首个重要基准。DocILE 实验室拥有数以千计来自开放源的注释真实文档、十万个生成的合成文档和近百万个未标注文档，是公开可用的最大 KILE 和 LIR 数据集。我们期待来自计算机视觉、自然语言处理、信息检索和其他领域的贡献。有关实验室和竞赛的数据、基线、代码和最新信息，请访问 https://docile.rossum.ai/。

{"title":"DocILE 2023 Teaser: Document Information Localization and Extraction","authors":"vStvep'an vSimsa, Milan vSulc, Maty'avs Skalick'y, Yash J. Patel, Ahmed Hamdi","doi":"10.48550/arXiv.2301.12394","DOIUrl":"https://doi.org/10.48550/arXiv.2301.12394","url":null,"abstract":"The lack of data for information extraction (IE) from semi-structured business documents is a real problem for the IE community. Publications relying on large-scale datasets use only proprietary, unpublished data due to the sensitive nature of such documents. Publicly available datasets are mostly small and domain-specific. The absence of a large-scale public dataset or benchmark hinders the reproducibility and cross-evaluation of published methods. The DocILE 2023 competition, hosted as a lab at the CLEF 2023 conference and as an ICDAR 2023 competition, will run the first major benchmark for the tasks of Key Information Localization and Extraction (KILE) and Line Item Recognition (LIR) from business documents. With thousands of annotated real documents from open sources, a hundred thousand of generated synthetic documents, and nearly a million unlabeled documents, the DocILE lab comes with the largest publicly available dataset for KILE and LIR. We are looking forward to contributions from the Computer Vision, Natural Language Processing, Information Retrieval, and other communities. The data, baselines, code and up-to-date information about the lab and competition are available at https://docile.rossum.ai/.","PeriodicalId":126309,"journal":{"name":"European Conference on Information Retrieval","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126857642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Evolution of Filter Bubbles and Polarization in News Recommendation 新闻推荐中的过滤气泡演化与极化

European Conference on Information Retrieval

Pub Date : 2023-01-26 DOI: 10.48550/arXiv.2301.10926

Han Zhang, Ziwei Zhu, James Caverlee

Recent work in news recommendation has demonstrated that recommenders can over-expose users to articles that support their pre-existing opinions. However, most existing work focuses on a static setting or over a short-time window, leaving open questions about the long-term and dynamic impacts of news recommendations. In this paper, we explore these dynamic impacts through a systematic study of three research questions: 1) How do the news reading behaviors of users change after repeated long-term interactions with recommenders? 2) How do the inherent preferences of users change over time in such a dynamic recommender system? 3) Can the existing SOTA static method alleviate the problem in the dynamic environment? Concretely, we conduct a comprehensive data-driven study through simulation experiments of political polarization in news recommendations based on 40,000 annotated news articles. We find that users are rapidly exposed to more extreme content as the recommender evolves. We also find that a calibration-based intervention can slow down this polarization, but leaves open significant opportunities for future improvements

最近在新闻推荐方面的研究表明，推荐器可能会让用户过多地看到支持他们已有观点的文章。然而，大多数现有的工作都集中在静态设置或短时间窗口上，留下了关于新闻推荐的长期和动态影响的悬而未决的问题。在本文中，我们通过系统研究三个研究问题来探讨这些动态影响:1)用户在与推荐者反复长期互动后，新闻阅读行为如何变化?2)在动态推荐系统中，用户的内在偏好是如何随时间变化的?3)现有的SOTA静态方法能否缓解动态环境下的问题?具体而言，我们基于4万篇带注释的新闻文章，通过新闻推荐中的政治极化模拟实验，进行了全面的数据驱动研究。我们发现，随着推荐器的发展，用户会迅速接触到更极端的内容。我们还发现，基于校准的干预可以减缓这种两极分化，但为未来的改进留下了重要的机会

引用次数: 1

From Baseline to Top Performer: A Reproducibility Study of Approaches at the TREC 2021 Conversational Assistance Track 从基线到顶级表演者:TREC 2021会话辅助轨道方法的可重复性研究

European Conference on Information Retrieval

Pub Date : 2023-01-25 DOI: 10.48550/arXiv.2301.10493

Weronika Lajewska, K. Balog

This paper reports on an effort of reproducing the organizers' baseline as well as the top performing participant submission at the 2021 edition of the TREC Conversational Assistance track. TREC systems are commonly regarded as reference points for effectiveness comparison. Yet, the papers accompanying them have less strict requirements than peer-reviewed publications, which can make reproducibility challenging. Our results indicate that key practical information is indeed missing. While the results can be reproduced within a 19% relative margin with respect to the main evaluation measure, the relative difference between the baseline and the top performing approach shrinks from the reported 18% to 5%. Additionally, we report on a new set of experiments aimed at understanding the impact of various pipeline components. We show that end-to-end system performance can indeed benefit from advanced retrieval techniques in either stage of a two-stage retrieval pipeline. We also measure the impact of the dataset used for fine-tuning the query rewriter and find that employing different query rewriting methods in different stages of the retrieval pipeline might be beneficial. Moreover, these results are shown to generalize across the 2020 and 2021 editions of the track. We conclude our study with a list of lessons learned and practical suggestions.

本文报告了在2021年版TREC会话辅助轨道上再现组织者基线的努力以及表现最佳的参与者提交。TREC系统通常被视为有效性比较的参考点。然而，与同行评议的出版物相比，随附的论文没有那么严格的要求，这可能会给可重复性带来挑战。我们的研究结果表明，关键的实用信息确实缺失。虽然相对于主要的评估方法，结果可以在19%的相对范围内重现，但基线和最佳执行方法之间的相对差异从报告的18%缩小到5%。此外，我们报告了一组新的实验，旨在了解各种管道组件的影响。我们表明，在两阶段检索管道的任何阶段中，端到端系统性能确实可以从高级检索技术中受益。我们还测量了用于微调查询重写器的数据集的影响，并发现在检索管道的不同阶段使用不同的查询重写方法可能是有益的。此外，这些结果被证明适用于2020年和2021年版本的赛道。最后，我们总结了一些经验教训和实际建议。

{"title":"From Baseline to Top Performer: A Reproducibility Study of Approaches at the TREC 2021 Conversational Assistance Track","authors":"Weronika Lajewska, K. Balog","doi":"10.48550/arXiv.2301.10493","DOIUrl":"https://doi.org/10.48550/arXiv.2301.10493","url":null,"abstract":"This paper reports on an effort of reproducing the organizers' baseline as well as the top performing participant submission at the 2021 edition of the TREC Conversational Assistance track. TREC systems are commonly regarded as reference points for effectiveness comparison. Yet, the papers accompanying them have less strict requirements than peer-reviewed publications, which can make reproducibility challenging. Our results indicate that key practical information is indeed missing. While the results can be reproduced within a 19% relative margin with respect to the main evaluation measure, the relative difference between the baseline and the top performing approach shrinks from the reported 18% to 5%. Additionally, we report on a new set of experiments aimed at understanding the impact of various pipeline components. We show that end-to-end system performance can indeed benefit from advanced retrieval techniques in either stage of a two-stage retrieval pipeline. We also measure the impact of the dataset used for fine-tuning the query rewriter and find that employing different query rewriting methods in different stages of the retrieval pipeline might be beneficial. Moreover, these results are shown to generalize across the 2020 and 2021 editions of the track. We conclude our study with a list of lessons learned and practical suggestions.","PeriodicalId":126309,"journal":{"name":"European Conference on Information Retrieval","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134081561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A Study on FGSM Adversarial Training for Neural Retrieval 面向神经检索的FGSM对抗训练方法研究

European Conference on Information Retrieval

Pub Date : 2023-01-25 DOI: 10.48550/arXiv.2301.10576

Simon Lupart, S. Clinchant

Neural retrieval models have acquired significant effectiveness gains over the last few years compared to term-based methods. Nevertheless, those models may be brittle when faced to typos, distribution shifts or vulnerable to malicious attacks. For instance, several recent papers demonstrated that such variations severely impacted models performances, and then tried to train more resilient models. Usual approaches include synonyms replacements or typos injections -- as data-augmentation -- and the use of more robust tokenizers (characterBERT, BPE-dropout). To further complement the literature, we investigate in this paper adversarial training as another possible solution to this robustness issue. Our comparison includes the two main families of BERT-based neural retrievers, i.e. dense and sparse, with and without distillation techniques. We then demonstrate that one of the most simple adversarial training techniques -- the Fast Gradient Sign Method (FGSM) -- can improve first stage rankers robustness and effectiveness. In particular, FGSM increases models performances on both in-domain and out-of-domain distributions, and also on queries with typos, for multiple neural retrievers.

与基于术语的方法相比，神经检索模型在过去几年中获得了显著的有效性增益。然而，当面对打字错误、分布变化或恶意攻击时，这些模型可能很脆弱。例如，最近的几篇论文证明了这种变化严重影响了模型的性能，然后试图训练更有弹性的模型。通常的方法包括同义词替换或输入错误注入——作为数据增强——以及使用更健壮的标记器(characterBERT、BPE-dropout)。为了进一步补充文献，我们在本文中研究了对抗性训练作为鲁棒性问题的另一种可能解决方案。我们的比较包括两个主要家族的基于bert的神经检索，即密集和稀疏，有和没有蒸馏技术。然后，我们证明了最简单的对抗性训练技术之一-快速梯度符号方法(FGSM) -可以提高第一阶段排名器的鲁棒性和有效性。特别是，对于多个神经检索器，FGSM提高了模型在域内和域外分布上的性能，以及在带有错字的查询上的性能。

{"title":"A Study on FGSM Adversarial Training for Neural Retrieval","authors":"Simon Lupart, S. Clinchant","doi":"10.48550/arXiv.2301.10576","DOIUrl":"https://doi.org/10.48550/arXiv.2301.10576","url":null,"abstract":"Neural retrieval models have acquired significant effectiveness gains over the last few years compared to term-based methods. Nevertheless, those models may be brittle when faced to typos, distribution shifts or vulnerable to malicious attacks. For instance, several recent papers demonstrated that such variations severely impacted models performances, and then tried to train more resilient models. Usual approaches include synonyms replacements or typos injections -- as data-augmentation -- and the use of more robust tokenizers (characterBERT, BPE-dropout). To further complement the literature, we investigate in this paper adversarial training as another possible solution to this robustness issue. Our comparison includes the two main families of BERT-based neural retrievers, i.e. dense and sparse, with and without distillation techniques. We then demonstrate that one of the most simple adversarial training techniques -- the Fast Gradient Sign Method (FGSM) -- can improve first stage rankers robustness and effectiveness. In particular, FGSM increases models performances on both in-domain and out-of-domain distributions, and also on queries with typos, for multiple neural retrievers.","PeriodicalId":126309,"journal":{"name":"European Conference on Information Retrieval","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124956429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

An Experimental Study on Pretraining Transformers from Scratch for IR 基于IR的变压器从零开始预训练实验研究

European Conference on Information Retrieval

Pub Date : 2023-01-25 DOI: 10.48550/arXiv.2301.10444

Carlos Lassance, Herv'e D'ejean, S. Clinchant

Finetuning Pretrained Language Models (PLM) for IR has been de facto the standard practice since their breakthrough effectiveness few years ago. But, is this approach well understood? In this paper, we study the impact of the pretraining collection on the final IR effectiveness. In particular, we challenge the current hypothesis that PLM shall be trained on a large enough generic collection and we show that pretraining from scratch on the collection of interest is surprisingly competitive with the current approach. We benchmark first-stage ranking rankers and cross-encoders for reranking on the task of general passage retrieval on MSMARCO, Mr-Tydi for Arabic, Japanese and Russian, and TripClick for specific domain. Contrary to popular belief, we show that, for finetuning first-stage rankers, models pretrained solely on their collection have equivalent or better effectiveness compared to more general models. However, there is a slight effectiveness drop for rerankers pretrained only on the target collection. Overall, our study sheds a new light on the role of the pretraining collection and should make our community ponder on building specialized models by pretraining from scratch. Last but not least, doing so could enable better control of efficiency, data bias and replicability, which are key research questions for the IR community.

预训练语言模型(PLM)自几年前取得突破性成效以来，实际上已成为IR的标准实践。但是，这种方法被理解了吗?在本文中，我们研究了预训练集合对最终IR有效性的影响。特别是，我们挑战了当前的假设，即PLM必须在足够大的通用集合上进行训练，并且我们表明，在感兴趣的集合上从头开始预训练与当前方法相比具有惊人的竞争力。我们对第一阶段排名排名器和交叉编码器进行基准测试，以对MSMARCO上的一般段落检索任务进行重新排名，Mr-Tydi用于阿拉伯语，日语和俄语，TripClick用于特定域。与普遍的看法相反，我们表明，对于微调第一阶段排名器，仅对其集合进行预训练的模型与更一般的模型相比具有同等或更好的有效性。然而，对于只在目标集合上预训练的重新排名者来说，有轻微的有效性下降。总的来说，我们的研究揭示了预训练集合的作用，并应该让我们的社区思考通过从头开始预训练来构建专门的模型。最后但并非最不重要的是，这样做可以更好地控制效率、数据偏差和可复制性，这些都是IR社区的关键研究问题。

{"title":"An Experimental Study on Pretraining Transformers from Scratch for IR","authors":"Carlos Lassance, Herv'e D'ejean, S. Clinchant","doi":"10.48550/arXiv.2301.10444","DOIUrl":"https://doi.org/10.48550/arXiv.2301.10444","url":null,"abstract":"Finetuning Pretrained Language Models (PLM) for IR has been de facto the standard practice since their breakthrough effectiveness few years ago. But, is this approach well understood? In this paper, we study the impact of the pretraining collection on the final IR effectiveness. In particular, we challenge the current hypothesis that PLM shall be trained on a large enough generic collection and we show that pretraining from scratch on the collection of interest is surprisingly competitive with the current approach. We benchmark first-stage ranking rankers and cross-encoders for reranking on the task of general passage retrieval on MSMARCO, Mr-Tydi for Arabic, Japanese and Russian, and TripClick for specific domain. Contrary to popular belief, we show that, for finetuning first-stage rankers, models pretrained solely on their collection have equivalent or better effectiveness compared to more general models. However, there is a slight effectiveness drop for rerankers pretrained only on the target collection. Overall, our study sheds a new light on the role of the pretraining collection and should make our community ponder on building specialized models by pretraining from scratch. Last but not least, doing so could enable better control of efficiency, data bias and replicability, which are key research questions for the IR community.","PeriodicalId":126309,"journal":{"name":"European Conference on Information Retrieval","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128447144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

European Conference on Information Retrieval

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀