Unsupervised Domain Adaptation for Sparse Retrieval by Filling Vocabulary and Word Frequency Gaps

Q3 Environmental Science AACL Bioflux Pub Date : 2022-11-08 DOI:10.48550/arXiv.2211.03988
Hiroki Iida, Naoaki Okazaki
{"title":"Unsupervised Domain Adaptation for Sparse Retrieval by Filling Vocabulary and Word Frequency Gaps","authors":"Hiroki Iida, Naoaki Okazaki","doi":"10.48550/arXiv.2211.03988","DOIUrl":null,"url":null,"abstract":"IR models using a pretrained language model significantly outperform lexical approaches like BM25. In particular, SPLADE, which encodes texts to sparse vectors, is an effective model for practical use because it shows robustness to out-of-domain datasets. However, SPLADE still struggles with exact matching of low-frequency words in training data. In addition, domain shifts in vocabulary and word frequencies deteriorate the IR performance of SPLADE. Because supervision data are scarce in the target domain, addressing the domain shifts without supervision data is necessary. This paper proposes an unsupervised domain adaptation method by filling vocabulary and word-frequency gaps. First, we expand a vocabulary and execute continual pretraining with a masked language model on a corpus of the target domain. Then, we multiply SPLADE-encoded sparse vectors by inverse document frequency weights to consider the importance of documents with low-frequency words. We conducted experiments using our method on datasets with a large vocabulary gap from a source domain. We show that our method outperforms the present state-of-the-art domain adaptation method. In addition, our method achieves state-of-the-art results, combined with BM25.","PeriodicalId":39298,"journal":{"name":"AACL Bioflux","volume":"37 1","pages":"752-765"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AACL Bioflux","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2211.03988","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Environmental Science","Score":null,"Total":0}
引用次数: 3

Abstract

IR models using a pretrained language model significantly outperform lexical approaches like BM25. In particular, SPLADE, which encodes texts to sparse vectors, is an effective model for practical use because it shows robustness to out-of-domain datasets. However, SPLADE still struggles with exact matching of low-frequency words in training data. In addition, domain shifts in vocabulary and word frequencies deteriorate the IR performance of SPLADE. Because supervision data are scarce in the target domain, addressing the domain shifts without supervision data is necessary. This paper proposes an unsupervised domain adaptation method by filling vocabulary and word-frequency gaps. First, we expand a vocabulary and execute continual pretraining with a masked language model on a corpus of the target domain. Then, we multiply SPLADE-encoded sparse vectors by inverse document frequency weights to consider the importance of documents with low-frequency words. We conducted experiments using our method on datasets with a large vocabulary gap from a source domain. We show that our method outperforms the present state-of-the-art domain adaptation method. In addition, our method achieves state-of-the-art results, combined with BM25.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于词汇和词频间隙填充的无监督域自适应稀疏检索
使用预训练语言模型的IR模型明显优于BM25等词法方法。特别是SPLADE,它将文本编码为稀疏向量,是一个实际使用的有效模型,因为它对域外数据集具有鲁棒性。然而,SPLADE在训练数据中低频词的精确匹配方面仍然存在问题。此外,词汇和词频的域移位会降低SPLADE的红外性能。由于目标领域的监管数据是稀缺的,因此在没有监管数据的情况下解决领域转移是必要的。本文提出了一种通过填充词汇和词频间隙的无监督领域自适应方法。首先,我们扩展词汇表,并在目标领域的语料库上使用屏蔽语言模型进行持续的预训练。然后,我们将splade编码的稀疏向量乘以逆文档频率权值,以考虑含有低频词的文档的重要性。我们使用我们的方法在源域具有较大词汇差的数据集上进行了实验。我们表明,我们的方法优于目前最先进的领域自适应方法。此外,结合BM25,我们的方法达到了最先进的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
AACL Bioflux
AACL Bioflux Environmental Science-Management, Monitoring, Policy and Law
CiteScore
1.40
自引率
0.00%
发文量
0
期刊最新文献
HaRiM^+: Evaluating Summary Quality with Hallucination Risk PESE: Event Structure Extraction using Pointer Network based Encoder-Decoder Architecture Bipartite-play Dialogue Collection for Practical Automatic Evaluation of Dialogue Systems Local Structure Matters Most in Most Languages Unsupervised Domain Adaptation for Sparse Retrieval by Filling Vocabulary and Word Frequency Gaps
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1