Esmaeil NarimissaAustralian Taxation Office, David RaithelAustralian Taxation Office
{"title":"Exploring Information Retrieval Landscapes: An Investigation of a Novel Evaluation Techniques and Comparative Document Splitting Methods","authors":"Esmaeil NarimissaAustralian Taxation Office, David RaithelAustralian Taxation Office","doi":"arxiv-2409.08479","DOIUrl":null,"url":null,"abstract":"The performance of Retrieval-Augmented Generation (RAG) systems in\ninformation retrieval is significantly influenced by the characteristics of the\ndocuments being processed. In this study, the structured nature of textbooks,\nthe conciseness of articles, and the narrative complexity of novels are shown\nto require distinct retrieval strategies. A comparative evaluation of multiple\ndocument-splitting methods reveals that the Recursive Character Splitter\noutperforms the Token-based Splitter in preserving contextual integrity. A\nnovel evaluation technique is introduced, utilizing an open-source model to\ngenerate a comprehensive dataset of question-and-answer pairs, simulating\nrealistic retrieval scenarios to enhance testing efficiency and metric\nreliability. The evaluation employs weighted scoring metrics, including\nSequenceMatcher, BLEU, METEOR, and BERT Score, to assess the system's accuracy\nand relevance. This approach establishes a refined standard for evaluating the\nprecision of RAG systems, with future research focusing on optimizing chunk and\noverlap sizes to improve retrieval accuracy and efficiency.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"21 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08479","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The performance of Retrieval-Augmented Generation (RAG) systems in
information retrieval is significantly influenced by the characteristics of the
documents being processed. In this study, the structured nature of textbooks,
the conciseness of articles, and the narrative complexity of novels are shown
to require distinct retrieval strategies. A comparative evaluation of multiple
document-splitting methods reveals that the Recursive Character Splitter
outperforms the Token-based Splitter in preserving contextual integrity. A
novel evaluation technique is introduced, utilizing an open-source model to
generate a comprehensive dataset of question-and-answer pairs, simulating
realistic retrieval scenarios to enhance testing efficiency and metric
reliability. The evaluation employs weighted scoring metrics, including
SequenceMatcher, BLEU, METEOR, and BERT Score, to assess the system's accuracy
and relevance. This approach establishes a refined standard for evaluating the
precision of RAG systems, with future research focusing on optimizing chunk and
overlap sizes to improve retrieval accuracy and efficiency.