Pub Date : 2024-03-20DOI: 10.1007/978-3-031-56060-6_8
Pranav Kasela, Gabriella Pasi, Raffaele Perego, N. Tonellotto
{"title":"DESIRE-ME: Domain-Enhanced Supervised Information Retrieval Using Mixture-of-Experts","authors":"Pranav Kasela, Gabriella Pasi, Raffaele Perego, N. Tonellotto","doi":"10.1007/978-3-031-56060-6_8","DOIUrl":"https://doi.org/10.1007/978-3-031-56060-6_8","url":null,"abstract":"","PeriodicalId":126309,"journal":{"name":"European Conference on Information Retrieval","volume":" 38","pages":"111-125"},"PeriodicalIF":0.0,"publicationDate":"2024-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140388453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-16DOI: 10.1007/978-3-031-56063-7_21
Georgios Sidiropoulos, E. Kanoulas
{"title":"Improving the Robustness of Dense Retrievers Against Typos via Multi-Positive Contrastive Learning","authors":"Georgios Sidiropoulos, E. Kanoulas","doi":"10.1007/978-3-031-56063-7_21","DOIUrl":"https://doi.org/10.1007/978-3-031-56063-7_21","url":null,"abstract":"","PeriodicalId":126309,"journal":{"name":"European Conference on Information Retrieval","volume":" 44","pages":"297-305"},"PeriodicalIF":0.0,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140391260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-09DOI: 10.1007/978-3-031-56069-9_1
Amin Abolghasemi, Leif Azzopardi, Arian Askari, X. −832, M. D. Rijke, Suzan Verberne
{"title":"Measuring Bias in a Ranked List Using Term-Based Representations","authors":"Amin Abolghasemi, Leif Azzopardi, Arian Askari, X. −832, M. D. Rijke, Suzan Verberne","doi":"10.1007/978-3-031-56069-9_1","DOIUrl":"https://doi.org/10.1007/978-3-031-56069-9_1","url":null,"abstract":"","PeriodicalId":126309,"journal":{"name":"European Conference on Information Retrieval","volume":"14 6","pages":"3-19"},"PeriodicalIF":0.0,"publicationDate":"2024-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140396335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-05DOI: 10.1007/978-3-031-56066-8_11
Osman Alperen Koras, Jörg Schlötterer, Christin Seifert
{"title":"A Second Look on BASS - Boosting Abstractive Summarization with Unified Semantic Graphs - A Replication Study","authors":"Osman Alperen Koras, Jörg Schlötterer, Christin Seifert","doi":"10.1007/978-3-031-56066-8_11","DOIUrl":"https://doi.org/10.1007/978-3-031-56066-8_11","url":null,"abstract":"","PeriodicalId":126309,"journal":{"name":"European Conference on Information Retrieval","volume":"343 8","pages":"99-114"},"PeriodicalIF":0.0,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140397756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-27DOI: 10.48550/arXiv.2402.17535
Thong Nguyen, Mariya Hendriksen, Andrew Yates, M. D. Rijke
Learned sparse retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors that can be indexed and retrieved efficiently with an inverted index. We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval. While LSR has seen success in text retrieval, its application in multimodal retrieval remains underexplored. Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets. Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors. We address issues of high dimension co-activation and semantic deviation through a new training algorithm, using Bernoulli random variables to control query expansion. Experiments with two dense models (BLIP, ALBEF) and two datasets (MSCOCO, Flickr30k) show that our proposed algorithm effectively reduces co-activation and semantic deviation. Our best-performing sparsified model outperforms state-of-the-art text-image LSR models with a shorter training time and lower GPU memory requirements. Our approach offers an effective solution for training LSR retrieval models in multimodal settings. Our code and model checkpoints are available at github.com/thongnt99/lsr-multimodal
{"title":"Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control","authors":"Thong Nguyen, Mariya Hendriksen, Andrew Yates, M. D. Rijke","doi":"10.48550/arXiv.2402.17535","DOIUrl":"https://doi.org/10.48550/arXiv.2402.17535","url":null,"abstract":"Learned sparse retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors that can be indexed and retrieved efficiently with an inverted index. We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval. While LSR has seen success in text retrieval, its application in multimodal retrieval remains underexplored. Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets. Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors. We address issues of high dimension co-activation and semantic deviation through a new training algorithm, using Bernoulli random variables to control query expansion. Experiments with two dense models (BLIP, ALBEF) and two datasets (MSCOCO, Flickr30k) show that our proposed algorithm effectively reduces co-activation and semantic deviation. Our best-performing sparsified model outperforms state-of-the-art text-image LSR models with a shorter training time and lower GPU memory requirements. Our approach offers an effective solution for training LSR retrieval models in multimodal settings. Our code and model checkpoints are available at github.com/thongnt99/lsr-multimodal","PeriodicalId":126309,"journal":{"name":"European Conference on Information Retrieval","volume":"41 4","pages":"448-464"},"PeriodicalIF":0.0,"publicationDate":"2024-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140425562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-20DOI: 10.48550/arXiv.2402.12663
Varad Pimpalkhute, John Heyer, Xusen Yin, Sameer Gupta
We investigate the integration of Large Language Models (LLMs) into query encoders to improve dense retrieval without increasing latency and cost, by circumventing the dependency on LLMs at inference time. SoftQE incorporates knowledge from LLMs by mapping embeddings of input queries to those of the LLM-expanded queries. While improvements over various strong baselines on in-domain MS-MARCO metrics are marginal, SoftQE improves performance by 2.83 absolute percentage points on average on five out-of-domain BEIR tasks.
{"title":"SoftQE: Learned Representations of Queries Expanded by LLMs","authors":"Varad Pimpalkhute, John Heyer, Xusen Yin, Sameer Gupta","doi":"10.48550/arXiv.2402.12663","DOIUrl":"https://doi.org/10.48550/arXiv.2402.12663","url":null,"abstract":"We investigate the integration of Large Language Models (LLMs) into query encoders to improve dense retrieval without increasing latency and cost, by circumventing the dependency on LLMs at inference time. SoftQE incorporates knowledge from LLMs by mapping embeddings of input queries to those of the LLM-expanded queries. While improvements over various strong baselines on in-domain MS-MARCO metrics are marginal, SoftQE improves performance by 2.83 absolute percentage points on average on five out-of-domain BEIR tasks.","PeriodicalId":126309,"journal":{"name":"European Conference on Information Retrieval","volume":"1024 ","pages":"68-77"},"PeriodicalIF":0.0,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140445758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Query performance prediction (QPP) aims to forecast the effectiveness of a search engine across a range of queries and documents. While state-of-the-art predictors offer a certain level of precision, their accuracy is not flawless. Prior research has recognized the challenges inherent in QPP but often lacks a thorough qualitative analysis. In this paper, we delve into QPP by examining the factors that influence the predictability of query performance accuracy. We propose the working hypothesis that while some queries are readily predictable, others present significant challenges. By focusing on outliers, we aim to identify the queries that are particularly challenging to predict. To this end, we employ multivariate outlier detection method. Our results demonstrate the effectiveness of this approach in identifying queries on which QPP do not perform well, yielding less reliable predictions. Moreover, we provide evidence that excluding these hard-to-predict queries from the analysis significantly enhances the overall accuracy of QPP.
{"title":"Can we predict QPP? An approach based on multivariate outliers","authors":"Adrian-Gabriel Chifu, S'ebastien D'ejean, Moncef Garouani, Josiane Mothe, Di'ego Ortiz, Md. Zia Ullah","doi":"10.48550/arXiv.2402.16875","DOIUrl":"https://doi.org/10.48550/arXiv.2402.16875","url":null,"abstract":"Query performance prediction (QPP) aims to forecast the effectiveness of a search engine across a range of queries and documents. While state-of-the-art predictors offer a certain level of precision, their accuracy is not flawless. Prior research has recognized the challenges inherent in QPP but often lacks a thorough qualitative analysis. In this paper, we delve into QPP by examining the factors that influence the predictability of query performance accuracy. We propose the working hypothesis that while some queries are readily predictable, others present significant challenges. By focusing on outliers, we aim to identify the queries that are particularly challenging to predict. To this end, we employ multivariate outlier detection method. Our results demonstrate the effectiveness of this approach in identifying queries on which QPP do not perform well, yielding less reliable predictions. Moreover, we provide evidence that excluding these hard-to-predict queries from the analysis significantly enhances the overall accuracy of QPP.","PeriodicalId":126309,"journal":{"name":"European Conference on Information Retrieval","volume":"78 ","pages":"458-467"},"PeriodicalIF":0.0,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140460779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-29DOI: 10.48550/arXiv.2401.16625
Gautam Kishore Shahi, Amit Kumar Jaiswal, Thomas Mandl
We contribute the first publicly available dataset of factual claims from different platforms and fake YouTube videos on the 2023 Israel-Hamas war for automatic fake YouTube video classification. The FakeClaim data is collected from 60 fact-checking organizations in 30 languages and enriched with metadata from the fact-checking organizations curated by trained journalists specialized in fact-checking. Further, we classify fake videos within the subset of YouTube videos using textual information and user comments. We used a pre-trained model to classify each video with different feature combinations. Our best-performing fine-tuned language model, Universal Sentence Encoder (USE), achieves a Macro F1 of 87%, which shows that the trained model can be helpful for debunking fake videos using the comments from the user discussion. The dataset is available on Githubfootnote{https://github.com/Gautamshahi/FakeClaim}
{"title":"FakeClaim: A Multiple Platform-driven Dataset for Identification of Fake News on 2023 Israel-Hamas War","authors":"Gautam Kishore Shahi, Amit Kumar Jaiswal, Thomas Mandl","doi":"10.48550/arXiv.2401.16625","DOIUrl":"https://doi.org/10.48550/arXiv.2401.16625","url":null,"abstract":"We contribute the first publicly available dataset of factual claims from different platforms and fake YouTube videos on the 2023 Israel-Hamas war for automatic fake YouTube video classification. The FakeClaim data is collected from 60 fact-checking organizations in 30 languages and enriched with metadata from the fact-checking organizations curated by trained journalists specialized in fact-checking. Further, we classify fake videos within the subset of YouTube videos using textual information and user comments. We used a pre-trained model to classify each video with different feature combinations. Our best-performing fine-tuned language model, Universal Sentence Encoder (USE), achieves a Macro F1 of 87%, which shows that the trained model can be helpful for debunking fake videos using the comments from the user discussion. The dataset is available on Githubfootnote{https://github.com/Gautamshahi/FakeClaim}","PeriodicalId":126309,"journal":{"name":"European Conference on Information Retrieval","volume":"70 5","pages":"66-74"},"PeriodicalIF":0.0,"publicationDate":"2024-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140485828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-24DOI: 10.48550/arXiv.2401.13566
Ludovico Boratto, Giulia Cerniglia, M. Marras, Alessandra Perniciano, Barbara Pes
When devising recommendation services, it is important to account for the interests of all content providers, encompassing not only newcomers but also minority demographic groups. In various instances, certain provider groups find themselves underrepresented in the item catalog, a situation that can influence recommendation results. Hence, platform owners often seek to regulate the exposure of these provider groups in the recommended lists. In this paper, we propose a novel cost-sensitive approach designed to guarantee these target exposure levels in pairwise recommendation models. This approach quantifies, and consequently mitigate, the discrepancies between the volume of recommendations allocated to groups and their contribution in the item catalog, under the principle of equity. Our results show that this approach, while aligning groups exposure with their assigned levels, does not compromise to the original recommendation utility. Source code and pre-processed data can be retrieved at https://github.com/alessandraperniciano/meta-learning-strategy-fair-provider-exposure.
{"title":"A Cost-Sensitive Meta-Learning Strategy for Fair Provider Exposure in Recommendation","authors":"Ludovico Boratto, Giulia Cerniglia, M. Marras, Alessandra Perniciano, Barbara Pes","doi":"10.48550/arXiv.2401.13566","DOIUrl":"https://doi.org/10.48550/arXiv.2401.13566","url":null,"abstract":"When devising recommendation services, it is important to account for the interests of all content providers, encompassing not only newcomers but also minority demographic groups. In various instances, certain provider groups find themselves underrepresented in the item catalog, a situation that can influence recommendation results. Hence, platform owners often seek to regulate the exposure of these provider groups in the recommended lists. In this paper, we propose a novel cost-sensitive approach designed to guarantee these target exposure levels in pairwise recommendation models. This approach quantifies, and consequently mitigate, the discrepancies between the volume of recommendations allocated to groups and their contribution in the item catalog, under the principle of equity. Our results show that this approach, while aligning groups exposure with their assigned levels, does not compromise to the original recommendation utility. Source code and pre-processed data can be retrieved at https://github.com/alessandraperniciano/meta-learning-strategy-fair-provider-exposure.","PeriodicalId":126309,"journal":{"name":"European Conference on Information Retrieval","volume":"1 1","pages":"440-448"},"PeriodicalIF":0.0,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140497673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-21DOI: 10.48550/arXiv.2401.11463
Ivan Sekuli'c, Weronika Lajewska, K. Balog, Fabio Crestani
While the body of research directed towards constructing and generating clarifying questions in mixed-initiative conversational search systems is vast, research aimed at processing and comprehending users' answers to such questions is scarce. To this end, we present a simple yet effective method for processing answers to clarifying questions, moving away from previous work that simply appends answers to the original query and thus potentially degrades retrieval performance. Specifically, we propose a classifier for assessing usefulness of the prompted clarifying question and an answer given by the user. Useful questions or answers are further appended to the conversation history and passed to a transformer-based query rewriting module. Results demonstrate significant improvements over strong non-mixed-initiative baselines. Furthermore, the proposed approach mitigates the performance drops when non useful questions and answers are utilized.
{"title":"Estimating the Usefulness of Clarifying Questions and Answers for Conversational Search","authors":"Ivan Sekuli'c, Weronika Lajewska, K. Balog, Fabio Crestani","doi":"10.48550/arXiv.2401.11463","DOIUrl":"https://doi.org/10.48550/arXiv.2401.11463","url":null,"abstract":"While the body of research directed towards constructing and generating clarifying questions in mixed-initiative conversational search systems is vast, research aimed at processing and comprehending users' answers to such questions is scarce. To this end, we present a simple yet effective method for processing answers to clarifying questions, moving away from previous work that simply appends answers to the original query and thus potentially degrades retrieval performance. Specifically, we propose a classifier for assessing usefulness of the prompted clarifying question and an answer given by the user. Useful questions or answers are further appended to the conversation history and passed to a transformer-based query rewriting module. Results demonstrate significant improvements over strong non-mixed-initiative baselines. Furthermore, the proposed approach mitigates the performance drops when non useful questions and answers are utilized.","PeriodicalId":126309,"journal":{"name":"European Conference on Information Retrieval","volume":"92 3","pages":"384-392"},"PeriodicalIF":0.0,"publicationDate":"2024-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140501229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}