Unsupervised Domain Adaptation for Sparse Retrieval by Filling Vocabulary and Word Frequency Gaps

Q3 Environmental Science AACL Bioflux Pub Date : 2022-11-08 DOI:10.48550/arXiv.2211.03988

Hiroki Iida, Naoaki Okazaki

{"title":"Unsupervised Domain Adaptation for Sparse Retrieval by Filling Vocabulary and Word Frequency Gaps","authors":"Hiroki Iida, Naoaki Okazaki","doi":"10.48550/arXiv.2211.03988","DOIUrl":null,"url":null,"abstract":"IR models using a pretrained language model significantly outperform lexical approaches like BM25. In particular, SPLADE, which encodes texts to sparse vectors, is an effective model for practical use because it shows robustness to out-of-domain datasets. However, SPLADE still struggles with exact matching of low-frequency words in training data. In addition, domain shifts in vocabulary and word frequencies deteriorate the IR performance of SPLADE. Because supervision data are scarce in the target domain, addressing the domain shifts without supervision data is necessary. This paper proposes an unsupervised domain adaptation method by filling vocabulary and word-frequency gaps. First, we expand a vocabulary and execute continual pretraining with a masked language model on a corpus of the target domain. Then, we multiply SPLADE-encoded sparse vectors by inverse document frequency weights to consider the importance of documents with low-frequency words. We conducted experiments using our method on datasets with a large vocabulary gap from a source domain. We show that our method outperforms the present state-of-the-art domain adaptation method. In addition, our method achieves state-of-the-art results, combined with BM25.","PeriodicalId":39298,"journal":{"name":"AACL Bioflux","volume":"37 1","pages":"752-765"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AACL Bioflux","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2211.03988","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Environmental Science","Score":null,"Total":0}

引用次数: 3

Abstract

IR models using a pretrained language model significantly outperform lexical approaches like BM25. In particular, SPLADE, which encodes texts to sparse vectors, is an effective model for practical use because it shows robustness to out-of-domain datasets. However, SPLADE still struggles with exact matching of low-frequency words in training data. In addition, domain shifts in vocabulary and word frequencies deteriorate the IR performance of SPLADE. Because supervision data are scarce in the target domain, addressing the domain shifts without supervision data is necessary. This paper proposes an unsupervised domain adaptation method by filling vocabulary and word-frequency gaps. First, we expand a vocabulary and execute continual pretraining with a masked language model on a corpus of the target domain. Then, we multiply SPLADE-encoded sparse vectors by inverse document frequency weights to consider the importance of documents with low-frequency words. We conducted experiments using our method on datasets with a large vocabulary gap from a source domain. We show that our method outperforms the present state-of-the-art domain adaptation method. In addition, our method achieves state-of-the-art results, combined with BM25.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于词汇和词频间隙填充的无监督域自适应稀疏检索

使用预训练语言模型的IR模型明显优于BM25等词法方法。特别是SPLADE，它将文本编码为稀疏向量，是一个实际使用的有效模型，因为它对域外数据集具有鲁棒性。然而，SPLADE在训练数据中低频词的精确匹配方面仍然存在问题。此外，词汇和词频的域移位会降低SPLADE的红外性能。由于目标领域的监管数据是稀缺的，因此在没有监管数据的情况下解决领域转移是必要的。本文提出了一种通过填充词汇和词频间隙的无监督领域自适应方法。首先，我们扩展词汇表，并在目标领域的语料库上使用屏蔽语言模型进行持续的预训练。然后，我们将splade编码的稀疏向量乘以逆文档频率权值，以考虑含有低频词的文档的重要性。我们使用我们的方法在源域具有较大词汇差的数据集上进行了实验。我们表明，我们的方法优于目前最先进的领域自适应方法。此外，结合BM25，我们的方法达到了最先进的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

AACL Bioflux Environmental Science-Management, Monitoring, Policy and Law

CiteScore

1.40

自引率

0.00%

发文量