An in-depth analysis of passage-level label transfer for contextual document ranking

IF 1.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Information Retrieval Journal Pub Date : 2023-12-08 DOI:10.1007/s10791-023-09430-5

Koustav Rudra, Zeon Trevor Fernando, Avishek Anand

{"title":"An in-depth analysis of passage-level label transfer for contextual document ranking","authors":"Koustav Rudra, Zeon Trevor Fernando, Avishek Anand","doi":"10.1007/s10791-023-09430-5","DOIUrl":null,"url":null,"abstract":"<p>Pre-trained contextual language models such as BERT, GPT, and XLnet work quite well for document retrieval tasks. Such models are fine-tuned based on the query-document/query-passage level relevance labels to capture the ranking signals. However, the documents are longer than the passages and such document ranking models suffer from the token limitation (512) of BERT. Researchers proposed ranking strategies that either truncate the documents beyond the token limit or chunk the documents into units that can fit into the BERT. In the later case, the relevance labels are either directly transferred from the original query-document pair or learned through some external model. In this paper, we conduct a detailed study of the design decisions about splitting and label transfer on retrieval effectiveness and efficiency. We find that direct transfer of relevance labels from documents to passages introduces <i>label noise</i> that strongly affects retrieval effectiveness for large training datasets. We also find that query processing times are adversely affected by fine-grained splitting schemes. As a remedy, we propose a careful passage level labelling scheme using weak supervision that delivers improved performance (3–14% in terms of nDCG score) over most of the recently proposed models for ad-hoc retrieval while maintaining manageable computational complexity on four diverse document retrieval datasets.</p>","PeriodicalId":54352,"journal":{"name":"Information Retrieval Journal","volume":"64 1","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2023-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Retrieval Journal","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10791-023-09430-5","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 5

Abstract

Pre-trained contextual language models such as BERT, GPT, and XLnet work quite well for document retrieval tasks. Such models are fine-tuned based on the query-document/query-passage level relevance labels to capture the ranking signals. However, the documents are longer than the passages and such document ranking models suffer from the token limitation (512) of BERT. Researchers proposed ranking strategies that either truncate the documents beyond the token limit or chunk the documents into units that can fit into the BERT. In the later case, the relevance labels are either directly transferred from the original query-document pair or learned through some external model. In this paper, we conduct a detailed study of the design decisions about splitting and label transfer on retrieval effectiveness and efficiency. We find that direct transfer of relevance labels from documents to passages introduces label noise that strongly affects retrieval effectiveness for large training datasets. We also find that query processing times are adversely affected by fine-grained splitting schemes. As a remedy, we propose a careful passage level labelling scheme using weak supervision that delivers improved performance (3–14% in terms of nDCG score) over most of the recently proposed models for ad-hoc retrieval while maintaining manageable computational complexity on four diverse document retrieval datasets.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

深入分析用于上下文文档排序的段落级标签转移

预先训练的上下文语言模型（如 BERT、GPT 和 XLnet）在文档检索任务中效果相当不错。这些模型根据查询-文档/查询-段落级别的相关性标签进行微调，以捕捉排序信号。然而，文档比段落长，这类文档排序模型受到 BERT 标记限制（512）的影响。研究人员提出了一些排序策略，要么将超过标记限制的文档截断，要么将文档分块，使其适合 BERT。在后一种情况下，相关性标签要么直接从原始查询-文档对中转移，要么通过一些外部模型学习。在本文中，我们详细研究了拆分和标签转移的设计决策对检索效果和效率的影响。我们发现，将相关性标签从文档直接转移到段落会引入标签噪声，从而严重影响大型训练数据集的检索效果。我们还发现，细粒度分割方案会对查询处理时间产生不利影响。作为一种补救措施，我们提出了一种使用弱监督的谨慎的段落级标签方案，与最近提出的大多数临时检索模型相比，该方案提高了性能（在 nDCG 分数方面提高了 3-14%），同时在四个不同的文档检索数据集上保持了可管理的计算复杂度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Information Retrieval Journal 工程技术-计算机：信息系统

CiteScore

6.20

自引率

0.00%

发文量

审稿时长

13.5 months

期刊介绍： The journal provides an international forum for the publication of theory, algorithms, analysis and experiments across the broad area of information retrieval. Topics of interest include search, indexing, analysis, and evaluation for applications such as the web, social and streaming media, recommender systems, and text archives. This includes research on human factors in search, bridging artificial intelligence and information retrieval, and domain-specific search applications.