首页 > 最新文献

arXiv - CS - Information Retrieval最新文献

英文 中文
Perceptions of Edinburgh: Capturing Neighbourhood Characteristics by Clustering Geoparsed Local News 感知爱丁堡:通过聚类地方新闻捕捉邻里特征
Pub Date : 2024-09-17 DOI: arxiv-2409.11505
Andreas Grivas, Claire Grover, Richard Tobin, Clare Llewellyn, Eleojo Oluwaseun Abubakar, Chunyu Zheng, Chris Dibben, Alan Marshall, Jamie Pearce, Beatrice Alex
The communities that we live in affect our health in ways that are complexand hard to define. Moreover, our understanding of the place-based processesaffecting health and inequalities is limited. This undermines the developmentof robust policy interventions to improve local health and well-being. Newsmedia provides social and community information that may be useful in healthstudies. Here we propose a methodology for characterising neighbourhoods byusing local news articles. More specifically, we show how we can use NaturalLanguage Processing (NLP) to unlock further information about neighbourhoods byanalysing, geoparsing and clustering news articles. Our work is novel becausewe combine street-level geoparsing tailored to the locality with clustering offull news articles, enabling a more detailed examination of neighbourhoodcharacteristics. We evaluate our outputs and show via a confluence of evidence,both from a qualitative and a quantitative perspective, that the themes weextract from news articles are sensible and reflect many characteristics of thereal world. This is significant because it allows us to better understand theeffects of neighbourhoods on health. Our findings on neighbourhoodcharacterisation using news data will support a new generation of place-basedresearch which examines a wider set of spatial processes and how they affecthealth, enabling new epidemiological research.
我们生活的社区对我们的健康有着复杂而难以界定的影响。此外,我们对基于地方的影响健康和不平等的过程的了解也很有限。这不利于制定强有力的政策干预措施,以改善当地的健康和福祉。新闻媒体提供的社会和社区信息可能对健康研究有用。在此,我们提出了一种利用本地新闻报道来描述社区特征的方法。更具体地说,我们展示了如何利用自然语言处理(NLP)技术,通过对新闻报道进行分析、地理解析和聚类,进一步挖掘邻里信息。我们的工作很新颖,因为我们将根据当地情况定制的街道级地理解析与完整新闻文章的聚类相结合,从而能够更详细地检查街区特征。我们从定性和定量的角度评估了我们的成果,并通过一系列证据表明,我们从新闻文章中提取的主题是合理的,反映了现实世界的许多特征。这一点意义重大,因为它能让我们更好地了解社区对健康的影响。我们利用新闻数据进行邻里特征描述的研究结果将支持新一代基于地点的研究,该研究将探讨更广泛的空间过程及其对健康的影响,从而开展新的流行病学研究。
{"title":"Perceptions of Edinburgh: Capturing Neighbourhood Characteristics by Clustering Geoparsed Local News","authors":"Andreas Grivas, Claire Grover, Richard Tobin, Clare Llewellyn, Eleojo Oluwaseun Abubakar, Chunyu Zheng, Chris Dibben, Alan Marshall, Jamie Pearce, Beatrice Alex","doi":"arxiv-2409.11505","DOIUrl":"https://doi.org/arxiv-2409.11505","url":null,"abstract":"The communities that we live in affect our health in ways that are complex\u0000and hard to define. Moreover, our understanding of the place-based processes\u0000affecting health and inequalities is limited. This undermines the development\u0000of robust policy interventions to improve local health and well-being. News\u0000media provides social and community information that may be useful in health\u0000studies. Here we propose a methodology for characterising neighbourhoods by\u0000using local news articles. More specifically, we show how we can use Natural\u0000Language Processing (NLP) to unlock further information about neighbourhoods by\u0000analysing, geoparsing and clustering news articles. Our work is novel because\u0000we combine street-level geoparsing tailored to the locality with clustering of\u0000full news articles, enabling a more detailed examination of neighbourhood\u0000characteristics. We evaluate our outputs and show via a confluence of evidence,\u0000both from a qualitative and a quantitative perspective, that the themes we\u0000extract from news articles are sensible and reflect many characteristics of the\u0000real world. This is significant because it allows us to better understand the\u0000effects of neighbourhoods on health. Our findings on neighbourhood\u0000characterisation using news data will support a new generation of place-based\u0000research which examines a wider set of spatial processes and how they affect\u0000health, enabling new epidemiological research.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TISIS : Trajectory Indexing for SImilarity Search TISIS:用于相似性搜索的轨迹索引法
Pub Date : 2024-09-17 DOI: arxiv-2409.11301
Sara Jarrad, Hubert Naacke, Stephane Gancarski
Social media platforms enable users to share diverse types of information,including geolocation data that captures their movement patterns. Suchgeolocation data can be leveraged to reconstruct the trajectory of a user'svisited Points of Interest (POIs). A key requirement in numerous applicationsis the ability to measure the similarity between such trajectories, as thisfacilitates the retrieval of trajectories that are similar to a given referencetrajectory. This is the main focus of our work. Existing methods predominantlyrely on applying a similarity function to each candidate trajectory to identifythose that are sufficiently similar. However, this approach becomescomputationally expensive when dealing with large-scale datasets. To mitigatethis challenge, we propose TISIS, an efficient method that uses trajectoryindexing to quickly find similar trajectories that share common POIs in thesame order. Furthermore, to account for scenarios where POIs in trajectoriesmay not exactly match but are contextually similar, we introduce TISIS*, avariant of TISIS that incorporates POI embeddings. This extension allows formore comprehensive retrieval of similar trajectories by considering semanticsimilarities between POIs, beyond mere exact matches. Extensive experimentalevaluations demonstrate that the proposed approach significantly outperforms abaseline method based on the well-known Longest Common SubSequence (LCSS)algorithm, yielding substantial performance improvements across variousreal-world datasets.
社交媒体平台使用户能够分享各种类型的信息,包括捕捉其移动模式的地理位置数据。这些地理位置数据可用于重建用户访问过的兴趣点(POIs)的轨迹。许多应用的一个关键要求是能够测量这些轨迹之间的相似性,因为这有助于检索与给定参考轨迹相似的轨迹。这是我们工作的重点。现有的方法主要是对每个候选轨迹应用一个相似度函数来识别那些足够相似的轨迹。然而,在处理大规模数据集时,这种方法的计算成本变得非常昂贵。为了缓解这一难题,我们提出了 TISIS,这是一种高效的方法,它使用轨迹索引来快速找到以相同顺序共享共同 POI 的相似轨迹。此外,为了考虑到轨迹中的 POI 可能不完全匹配但上下文相似的情况,我们引入了 TISIS*,它是 TISIS 的一个变体,包含 POI 嵌入。这种扩展通过考虑 POI 之间的语义相似性(不仅仅是完全匹配),可以更全面地检索相似轨迹。广泛的实验评估表明,所提出的方法明显优于基于著名的最长公共子序列(LCSS)算法的基准方法,在各种真实世界数据集上取得了显著的性能改进。
{"title":"TISIS : Trajectory Indexing for SImilarity Search","authors":"Sara Jarrad, Hubert Naacke, Stephane Gancarski","doi":"arxiv-2409.11301","DOIUrl":"https://doi.org/arxiv-2409.11301","url":null,"abstract":"Social media platforms enable users to share diverse types of information,\u0000including geolocation data that captures their movement patterns. Such\u0000geolocation data can be leveraged to reconstruct the trajectory of a user's\u0000visited Points of Interest (POIs). A key requirement in numerous applications\u0000is the ability to measure the similarity between such trajectories, as this\u0000facilitates the retrieval of trajectories that are similar to a given reference\u0000trajectory. This is the main focus of our work. Existing methods predominantly\u0000rely on applying a similarity function to each candidate trajectory to identify\u0000those that are sufficiently similar. However, this approach becomes\u0000computationally expensive when dealing with large-scale datasets. To mitigate\u0000this challenge, we propose TISIS, an efficient method that uses trajectory\u0000indexing to quickly find similar trajectories that share common POIs in the\u0000same order. Furthermore, to account for scenarios where POIs in trajectories\u0000may not exactly match but are contextually similar, we introduce TISIS*, a\u0000variant of TISIS that incorporates POI embeddings. This extension allows for\u0000more comprehensive retrieval of similar trajectories by considering semantic\u0000similarities between POIs, beyond mere exact matches. Extensive experimental\u0000evaluations demonstrate that the proposed approach significantly outperforms a\u0000baseline method based on the well-known Longest Common SubSequence (LCSS)\u0000algorithm, yielding substantial performance improvements across various\u0000real-world datasets.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Trustworthiness in Retrieval-Augmented Generation Systems: A Survey 检索增强生成系统中的可信度:调查
Pub Date : 2024-09-16 DOI: arxiv-2409.10102
Yujia Zhou, Yan Liu, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Zheng Liu, Chaozhuo Li, Zhicheng Dou, Tsung-Yi Ho, Philip S. Yu
Retrieval-Augmented Generation (RAG) has quickly grown into a pivotalparadigm in the development of Large Language Models (LLMs). While much of thecurrent research in this field focuses on performance optimization,particularly in terms of accuracy and efficiency, the trustworthiness of RAGsystems remains an area still under exploration. From a positive perspective,RAG systems are promising to enhance LLMs by providing them with useful andup-to-date knowledge from vast external databases, thereby mitigating thelong-standing problem of hallucination. While from a negative perspective, RAGsystems are at the risk of generating undesirable contents if the retrievedinformation is either inappropriate or poorly utilized. To address theseconcerns, we propose a unified framework that assesses the trustworthiness ofRAG systems across six key dimensions: factuality, robustness, fairness,transparency, accountability, and privacy. Within this framework, we thoroughlyreview the existing literature on each dimension. Additionally, we create theevaluation benchmark regarding the six dimensions and conduct comprehensiveevaluations for a variety of proprietary and open-source models. Finally, weidentify the potential challenges for future research based on ourinvestigation results. Through this work, we aim to lay a structured foundationfor future investigations and provide practical insights for enhancing thetrustworthiness of RAG systems in real-world applications.
检索增强生成(RAG)已迅速发展成为大型语言模型(LLM)开发中的一个重要范式。虽然目前该领域的大部分研究都集中在性能优化,特别是准确性和效率方面,但 RAG 系统的可信度仍是一个有待探索的领域。从积极的角度来看,RAG 系统有望从庞大的外部数据库中为 LLMs 提供有用的最新知识,从而缓解长期存在的幻觉问题。但从负面角度来看,如果检索到的信息不恰当或利用率不高,RAG 系统就有可能产生不良内容。为了解决这些问题,我们提出了一个统一的框架,从六个关键维度评估 RAG 系统的可信度:事实性、稳健性、公平性、透明度、责任性和隐私性。在这一框架内,我们对每个维度的现有文献进行了深入研究。此外,我们还创建了六个维度的评估基准,并对各种专有和开源模型进行了全面评估。最后,我们根据调查结果确定了未来研究的潜在挑战。通过这项工作,我们旨在为未来的研究奠定结构化的基础,并为提高 RAG 系统在实际应用中的可信度提供实用的见解。
{"title":"Trustworthiness in Retrieval-Augmented Generation Systems: A Survey","authors":"Yujia Zhou, Yan Liu, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Zheng Liu, Chaozhuo Li, Zhicheng Dou, Tsung-Yi Ho, Philip S. Yu","doi":"arxiv-2409.10102","DOIUrl":"https://doi.org/arxiv-2409.10102","url":null,"abstract":"Retrieval-Augmented Generation (RAG) has quickly grown into a pivotal\u0000paradigm in the development of Large Language Models (LLMs). While much of the\u0000current research in this field focuses on performance optimization,\u0000particularly in terms of accuracy and efficiency, the trustworthiness of RAG\u0000systems remains an area still under exploration. From a positive perspective,\u0000RAG systems are promising to enhance LLMs by providing them with useful and\u0000up-to-date knowledge from vast external databases, thereby mitigating the\u0000long-standing problem of hallucination. While from a negative perspective, RAG\u0000systems are at the risk of generating undesirable contents if the retrieved\u0000information is either inappropriate or poorly utilized. To address these\u0000concerns, we propose a unified framework that assesses the trustworthiness of\u0000RAG systems across six key dimensions: factuality, robustness, fairness,\u0000transparency, accountability, and privacy. Within this framework, we thoroughly\u0000review the existing literature on each dimension. Additionally, we create the\u0000evaluation benchmark regarding the six dimensions and conduct comprehensive\u0000evaluations for a variety of proprietary and open-source models. Finally, we\u0000identify the potential challenges for future research based on our\u0000investigation results. Through this work, we aim to lay a structured foundation\u0000for future investigations and provide practical insights for enhancing the\u0000trustworthiness of RAG systems in real-world applications.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
jina-embeddings-v3: Multilingual Embeddings With Task LoRA jina-embeddings-v3:带任务 LoRA 的多语言嵌入法
Pub Date : 2024-09-16 DOI: arxiv-2409.10173
Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram, Michael Günther, Bo Wang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Andreas Koukounas, Nan Wang, Han Xiao
We introduce jina-embeddings-v3, a novel text embedding model with 570million parameters, achieves state-of-the-art performance on multilingual dataand long-context retrieval tasks, supporting context lengths of up to 8192tokens. The model includes a set of task-specific Low-Rank Adaptation (LoRA)adapters to generate high-quality embeddings for query-document retrieval,clustering, classification, and text matching. Additionally, MatryoshkaRepresentation Learning is integrated into the training process, allowingflexible truncation of embedding dimensions without compromising performance.Evaluation on the MTEB benchmark shows that jina-embeddings-v3 outperforms thelatest proprietary embeddings from OpenAI and Cohere on English tasks, whileachieving superior performance compared to multilingual-e5-large-instructacross all multilingual tasks.
我们介绍了 jina-embeddings-v3,这是一种拥有 5.7 亿个参数的新型文本嵌入模型,在多语言数据和长文本检索任务中实现了最先进的性能,支持高达 8192 个字节的上下文长度。该模型包括一组特定任务的低库适配(Low-Rank Adaptation,LoRA)适配器,可为查询-文档检索、聚类、分类和文本匹配生成高质量的嵌入。在 MTEB 基准测试中的评估结果表明,jina-embeddings-v3 在英语任务中的表现优于 OpenAI 和 Cohere 的最新专有嵌入式模型,而在所有多语言任务中的表现则优于 multilingual-e5-large-instruct。
{"title":"jina-embeddings-v3: Multilingual Embeddings With Task LoRA","authors":"Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram, Michael Günther, Bo Wang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Andreas Koukounas, Nan Wang, Han Xiao","doi":"arxiv-2409.10173","DOIUrl":"https://doi.org/arxiv-2409.10173","url":null,"abstract":"We introduce jina-embeddings-v3, a novel text embedding model with 570\u0000million parameters, achieves state-of-the-art performance on multilingual data\u0000and long-context retrieval tasks, supporting context lengths of up to 8192\u0000tokens. The model includes a set of task-specific Low-Rank Adaptation (LoRA)\u0000adapters to generate high-quality embeddings for query-document retrieval,\u0000clustering, classification, and text matching. Additionally, Matryoshka\u0000Representation Learning is integrated into the training process, allowing\u0000flexible truncation of embedding dimensions without compromising performance.\u0000Evaluation on the MTEB benchmark shows that jina-embeddings-v3 outperforms the\u0000latest proprietary embeddings from OpenAI and Cohere on English tasks, while\u0000achieving superior performance compared to multilingual-e5-large-instruct\u0000across all multilingual tasks.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
beeFormer: Bridging the Gap Between Semantic and Interaction Similarity in Recommender Systems beeFormer:缩小推荐系统中语义相似性与交互相似性之间的差距
Pub Date : 2024-09-16 DOI: arxiv-2409.10309
Vojtěch Vančura, Pavel Kordík, Milan Straka
Recommender systems often use text-side information to improve theirpredictions, especially in cold-start or zero-shot recommendation scenarios,where traditional collaborative filtering approaches cannot be used. Manyapproaches to text-mining side information for recommender systems have beenproposed over recent years, with sentence Transformers being the most prominentone. However, these models are trained to predict semantic similarity withoututilizing interaction data with hidden patterns specific to recommendersystems. In this paper, we propose beeFormer, a framework for training sentenceTransformer models with interaction data. We demonstrate that our modelstrained with beeFormer can transfer knowledge between datasets whileoutperforming not only semantic similarity sentence Transformers but alsotraditional collaborative filtering methods. We also show that training onmultiple datasets from different domains accumulates knowledge in a singlemodel, unlocking the possibility of training universal, domain-agnosticsentence Transformer models to mine text representations for recommendersystems. We release the source code, trained models, and additional detailsallowing replication of our experiments athttps://github.com/recombee/beeformer.
推荐系统经常使用文本边信息来改进其预测,尤其是在冷启动或零点推荐场景中,因为在这些场景中无法使用传统的协同过滤方法。近年来,人们提出了许多为推荐系统挖掘文本侧信息的方法,其中最著名的是句子转换器。然而,这些模型都是为预测语义相似性而训练的,没有利用具有推荐系统特有的隐藏模式的交互数据。在本文中,我们提出了用交互数据训练句子转换器模型的框架--beeFormer。我们证明,使用 beeFormer 训练的模型可以在数据集之间传递知识,同时不仅优于语义相似性句子转换器,也优于传统的协同过滤方法。我们还证明,在来自不同领域的多个数据集上进行训练可以在单个模型中积累知识,从而为训练通用的领域诊断句子转换器模型提供可能,为推荐系统挖掘文本表征。我们发布了源代码、训练好的模型和其他细节,允许在https://github.com/recombee/beeformer 复制我们的实验。
{"title":"beeFormer: Bridging the Gap Between Semantic and Interaction Similarity in Recommender Systems","authors":"Vojtěch Vančura, Pavel Kordík, Milan Straka","doi":"arxiv-2409.10309","DOIUrl":"https://doi.org/arxiv-2409.10309","url":null,"abstract":"Recommender systems often use text-side information to improve their\u0000predictions, especially in cold-start or zero-shot recommendation scenarios,\u0000where traditional collaborative filtering approaches cannot be used. Many\u0000approaches to text-mining side information for recommender systems have been\u0000proposed over recent years, with sentence Transformers being the most prominent\u0000one. However, these models are trained to predict semantic similarity without\u0000utilizing interaction data with hidden patterns specific to recommender\u0000systems. In this paper, we propose beeFormer, a framework for training sentence\u0000Transformer models with interaction data. We demonstrate that our models\u0000trained with beeFormer can transfer knowledge between datasets while\u0000outperforming not only semantic similarity sentence Transformers but also\u0000traditional collaborative filtering methods. We also show that training on\u0000multiple datasets from different domains accumulates knowledge in a single\u0000model, unlocking the possibility of training universal, domain-agnostic\u0000sentence Transformer models to mine text representations for recommender\u0000systems. We release the source code, trained models, and additional details\u0000allowing replication of our experiments at\u0000https://github.com/recombee/beeformer.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large Language Model Enhanced Hard Sample Identification for Denoising Recommendation 用于去噪推荐的大语言模型增强型硬样本识别
Pub Date : 2024-09-16 DOI: arxiv-2409.10343
Tianrui Song, Wenshuo Chao, Hao Liu
Implicit feedback, often used to build recommender systems, unavoidablyconfronts noise due to factors such as misclicks and position bias. Previousstudies have attempted to alleviate this by identifying noisy samples based ontheir diverged patterns, such as higher loss values, and mitigating the noisethrough sample dropping or reweighting. Despite the progress, we observeexisting approaches struggle to distinguish hard samples and noise samples, asthey often exhibit similar patterns, thereby limiting their effectiveness indenoising recommendations. To address this challenge, we propose a LargeLanguage Model Enhanced Hard Sample Denoising (LLMHD) framework. Specifically,we construct an LLM-based scorer to evaluate the semantic consistency of itemswith the user preference, which is quantified based on summarized historicaluser interactions. The resulting scores are used to assess the hardness ofsamples for the pointwise or pairwise training objectives. To ensureefficiency, we introduce a variance-based sample pruning strategy to filterpotential hard samples before scoring. Besides, we propose an iterativepreference update module designed to continuously refine summarized userpreference, which may be biased due to false-positive user-item interactions.Extensive experiments on three real-world datasets and four backbonerecommenders demonstrate the effectiveness of our approach.
由于误点击和位置偏差等因素,通常用于构建推荐系统的隐式反馈不可避免地会遇到噪音。以往的研究试图通过根据样本的不同模式(如较高的损失值)来识别噪声样本,并通过样本丢弃或重新加权来减轻噪声,从而缓解这一问题。尽管取得了进展,但我们发现现有的方法在区分硬样本和噪声样本时仍有困难,因为它们经常表现出相似的模式,从而限制了它们在剔除建议方面的有效性。为了应对这一挑战,我们提出了大型语言模型增强硬样本去噪 (LLMHD) 框架。具体来说,我们构建了一个基于 LLM 的评分器来评估项目与用户偏好在语义上的一致性,而用户偏好是基于历史用户交互总结量化的。由此得出的分数可用于评估点或成对训练目标的样本硬度。为了确保效率,我们引入了基于方差的样本剪枝策略,在评分前过滤潜在的硬样本。此外,我们还提出了一个迭代偏好更新模块,旨在不断完善总结出的用户偏好,而用户偏好可能会因为用户与项目之间的假阳性交互而产生偏差。
{"title":"Large Language Model Enhanced Hard Sample Identification for Denoising Recommendation","authors":"Tianrui Song, Wenshuo Chao, Hao Liu","doi":"arxiv-2409.10343","DOIUrl":"https://doi.org/arxiv-2409.10343","url":null,"abstract":"Implicit feedback, often used to build recommender systems, unavoidably\u0000confronts noise due to factors such as misclicks and position bias. Previous\u0000studies have attempted to alleviate this by identifying noisy samples based on\u0000their diverged patterns, such as higher loss values, and mitigating the noise\u0000through sample dropping or reweighting. Despite the progress, we observe\u0000existing approaches struggle to distinguish hard samples and noise samples, as\u0000they often exhibit similar patterns, thereby limiting their effectiveness in\u0000denoising recommendations. To address this challenge, we propose a Large\u0000Language Model Enhanced Hard Sample Denoising (LLMHD) framework. Specifically,\u0000we construct an LLM-based scorer to evaluate the semantic consistency of items\u0000with the user preference, which is quantified based on summarized historical\u0000user interactions. The resulting scores are used to assess the hardness of\u0000samples for the pointwise or pairwise training objectives. To ensure\u0000efficiency, we introduce a variance-based sample pruning strategy to filter\u0000potential hard samples before scoring. Besides, we propose an iterative\u0000preference update module designed to continuously refine summarized user\u0000preference, which may be biased due to false-positive user-item interactions.\u0000Extensive experiments on three real-world datasets and four backbone\u0000recommenders demonstrate the effectiveness of our approach.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"69 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Practical and Asymptotically Optimal Quantization of High-Dimensional Vectors in Euclidean Space for Approximate Nearest Neighbor Search 欧几里得空间中高维向量的实用和渐近最优量化,用于近似近邻搜索
Pub Date : 2024-09-16 DOI: arxiv-2409.09913
Jianyang Gao, Yutong Gou, Yuexuan Xu, Yongyi Yang, Cheng Long, Raymond Chi-Wing Wong
Approximate nearest neighbor (ANN) query in high-dimensional Euclidean spaceis a key operator in database systems. For this query, quantization is apopular family of methods developed for compressing vectors and reducing memoryconsumption. Recently, a method called RaBitQ achieves the state-of-the-artperformance among these methods. It produces better empirical performance inboth accuracy and efficiency when using the same compression rate and providesrigorous theoretical guarantees. However, the method is only designed forcompressing vectors at high compression rates (32x) and lacks support forachieving higher accuracy by using more space. In this paper, we introduce anew quantization method to address this limitation by extending RaBitQ. The newmethod inherits the theoretical guarantees of RaBitQ and achieves theasymptotic optimality in terms of the trade-off between space and error boundsas to be proven in this study. Additionally, we present efficientimplementations of the method, enabling its application to ANN queries toreduce both space and time consumption. Extensive experiments on real-worlddatasets confirm that our method consistently outperforms the state-of-the-artbaselines in both accuracy and efficiency when using the same amount of memory.
高维欧几里得空间中的近似近邻(ANN)查询是数据库系统中的一个关键操作。对于这种查询,量化是为压缩向量和减少内存消耗而开发的一系列常用方法。最近,一种名为 RaBitQ 的方法在这些方法中取得了最先进的性能。当使用相同的压缩率时,它在准确性和效率方面都有更好的经验表现,并提供了严格的理论保证。然而,该方法仅针对高压缩率(32x)下的矢量压缩而设计,缺乏通过使用更多空间来实现更高精度的支持。在本文中,我们通过扩展 RaBitQ 引入了一种新的量化方法来解决这一限制。新方法继承了 RaBitQ 的理论保证,并在空间和误差边界的权衡方面实现了渐进最优,这一点将在本研究中得到证明。此外,我们还介绍了该方法的高效实现,使其能够应用于 ANN 查询,从而减少空间和时间消耗。在真实世界数据集上进行的大量实验证实,在使用相同内存量的情况下,我们的方法在准确性和效率上都一直优于现有的基准线。
{"title":"Practical and Asymptotically Optimal Quantization of High-Dimensional Vectors in Euclidean Space for Approximate Nearest Neighbor Search","authors":"Jianyang Gao, Yutong Gou, Yuexuan Xu, Yongyi Yang, Cheng Long, Raymond Chi-Wing Wong","doi":"arxiv-2409.09913","DOIUrl":"https://doi.org/arxiv-2409.09913","url":null,"abstract":"Approximate nearest neighbor (ANN) query in high-dimensional Euclidean space\u0000is a key operator in database systems. For this query, quantization is a\u0000popular family of methods developed for compressing vectors and reducing memory\u0000consumption. Recently, a method called RaBitQ achieves the state-of-the-art\u0000performance among these methods. It produces better empirical performance in\u0000both accuracy and efficiency when using the same compression rate and provides\u0000rigorous theoretical guarantees. However, the method is only designed for\u0000compressing vectors at high compression rates (32x) and lacks support for\u0000achieving higher accuracy by using more space. In this paper, we introduce a\u0000new quantization method to address this limitation by extending RaBitQ. The new\u0000method inherits the theoretical guarantees of RaBitQ and achieves the\u0000asymptotic optimality in terms of the trade-off between space and error bounds\u0000as to be proven in this study. Additionally, we present efficient\u0000implementations of the method, enabling its application to ANN queries to\u0000reduce both space and time consumption. Extensive experiments on real-world\u0000datasets confirm that our method consistently outperforms the state-of-the-art\u0000baselines in both accuracy and efficiency when using the same amount of memory.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"191 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Incorporating Classifier-Free Guidance in Diffusion Model-Based Recommendation 在基于扩散模型的推荐中纳入无分类指导
Pub Date : 2024-09-16 DOI: arxiv-2409.10494
Noah Buchanan, Susan Gauch, Quan Mai
This paper presents a diffusion-based recommender system that incorporatesclassifier-free guidance. Most current recommender systems providerecommendations using conventional methods such as collaborative orcontent-based filtering. Diffusion is a new approach to generative AI thatimproves on previous generative AI approaches such as Variational Autoencoders(VAEs) and Generative Adversarial Networks (GANs). We incorporate diffusion ina recommender system that mirrors the sequence users take when browsing andrating items. Although a few current recommender systems incorporate diffusion,they do not incorporate classifier-free guidance, a new innovation in diffusionmodels as a whole. In this paper, we present a diffusion recommender systemthat augments the underlying recommender system model for improved performanceand also incorporates classifier-free guidance. Our findings show improvementsover state-of-the-art recommender systems for most metrics for severalrecommendation tasks on a variety of datasets. In particular, our approachdemonstrates the potential to provide better recommendations when data issparse.
本文介绍了一种基于扩散的推荐系统,该系统包含无分类器引导功能。目前的大多数推荐系统都采用传统方法提供推荐,如协同过滤或基于内容的过滤。扩散是生成式人工智能的一种新方法,它改进了以往的生成式人工智能方法,如变异自动编码器(VAE)和生成对抗网络(GAN)。我们将扩散纳入推荐系统,该系统反映了用户浏览和评价项目时的顺序。尽管目前有一些推荐系统采用了扩散模型,但它们并没有采用无分类器指导,而扩散模型在整体上是一种新的创新。在本文中,我们介绍了一种扩散式推荐系统,该系统增强了底层推荐系统模型,从而提高了性能,同时还加入了无分类器引导功能。我们的研究结果表明,在各种数据集上的几项推荐任务中,我们的大多数指标都优于最先进的推荐系统。特别是,我们的方法证明了在数据稀少的情况下提供更好推荐的潜力。
{"title":"Incorporating Classifier-Free Guidance in Diffusion Model-Based Recommendation","authors":"Noah Buchanan, Susan Gauch, Quan Mai","doi":"arxiv-2409.10494","DOIUrl":"https://doi.org/arxiv-2409.10494","url":null,"abstract":"This paper presents a diffusion-based recommender system that incorporates\u0000classifier-free guidance. Most current recommender systems provide\u0000recommendations using conventional methods such as collaborative or\u0000content-based filtering. Diffusion is a new approach to generative AI that\u0000improves on previous generative AI approaches such as Variational Autoencoders\u0000(VAEs) and Generative Adversarial Networks (GANs). We incorporate diffusion in\u0000a recommender system that mirrors the sequence users take when browsing and\u0000rating items. Although a few current recommender systems incorporate diffusion,\u0000they do not incorporate classifier-free guidance, a new innovation in diffusion\u0000models as a whole. In this paper, we present a diffusion recommender system\u0000that augments the underlying recommender system model for improved performance\u0000and also incorporates classifier-free guidance. Our findings show improvements\u0000over state-of-the-art recommender systems for most metrics for several\u0000recommendation tasks on a variety of datasets. In particular, our approach\u0000demonstrates the potential to provide better recommendations when data is\u0000sparse.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Online Learning via Memory: Retrieval-Augmented Detector Adaptation 通过记忆进行在线学习:检索-增强探测器适应性
Pub Date : 2024-09-16 DOI: arxiv-2409.10716
Yanan Jian, Fuxun Yu, Qi Zhang, William Levine, Brandon Dubbs, Nikolaos Karianakis
This paper presents a novel way of online adapting any off-the-shelf objectdetection model to a novel domain without retraining the detector model.Inspired by how humans quickly learn knowledge of a new subject (e.g.,memorization), we allow the detector to look up similar object concepts frommemory during test time. This is achieved through a retrieval augmentedclassification (RAC) module together with a memory bank that can be flexiblyupdated with new domain knowledge. We experimented with various off-the-shelfopen-set detector and close-set detectors. With only a tiny memory bank (e.g.,10 images per category) and being training-free, our online learning methodcould significantly outperform baselines in adapting a detector to noveldomains.
受人类如何快速学习新学科知识(如记忆)的启发,我们允许检测器在测试期间从内存中查找类似的物体概念。这是通过一个检索增强分类(RAC)模块和一个记忆库来实现的,记忆库可以根据新的领域知识进行灵活更新。我们试用了各种现成的开集检测器和闭集检测器。我们的在线学习方法只需一个很小的内存库(例如每个类别 10 幅图像),而且无需训练,因此在将检测器适配到新领域方面的性能明显优于基线方法。
{"title":"Online Learning via Memory: Retrieval-Augmented Detector Adaptation","authors":"Yanan Jian, Fuxun Yu, Qi Zhang, William Levine, Brandon Dubbs, Nikolaos Karianakis","doi":"arxiv-2409.10716","DOIUrl":"https://doi.org/arxiv-2409.10716","url":null,"abstract":"This paper presents a novel way of online adapting any off-the-shelf object\u0000detection model to a novel domain without retraining the detector model.\u0000Inspired by how humans quickly learn knowledge of a new subject (e.g.,\u0000memorization), we allow the detector to look up similar object concepts from\u0000memory during test time. This is achieved through a retrieval augmented\u0000classification (RAC) module together with a memory bank that can be flexibly\u0000updated with new domain knowledge. We experimented with various off-the-shelf\u0000open-set detector and close-set detectors. With only a tiny memory bank (e.g.,\u000010 images per category) and being training-free, our online learning method\u0000could significantly outperform baselines in adapting a detector to novel\u0000domains.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing Personalized Recipe Recommendation Through Multi-Class Classification 通过多类分类加强个性化食谱推荐
Pub Date : 2024-09-16 DOI: arxiv-2409.10267
Harish Neelam, Koushik Sai Veerella
This paper intends to address the challenge of personalized reciperecommendation in the realm of diverse culinary preferences. The problem domaininvolves recipe recommendations, utilizing techniques such as associationanalysis and classification. Association analysis explores the relationshipsand connections between different ingredients to enhance the user experience.Meanwhile, the classification aspect involves categorizing recipes based onuser-defined ingredients and preferences. A unique aspect of the paper is theconsideration of recipes and ingredients belonging to multiple classes,recognizing the complexity of culinary combinations. This necessitates asophisticated approach to classification and recommendation, ensuring thesystem accommodates the nature of recipe categorization. The paper seeks notonly to recommend recipes but also to explore the process involved in achievingaccurate and personalized recommendations.
本文旨在解决在不同烹饪偏好领域中个性化食谱推荐的难题。问题领域涉及利用关联分析和分类等技术进行食谱推荐。关联分析探讨了不同食材之间的关系和联系,以增强用户体验。同时,分类方面涉及根据用户定义的食材和偏好对食谱进行分类。本文的独特之处在于考虑到烹饪组合的复杂性,将食谱和配料分为多个类别。这就需要采用复杂的方法来进行分类和推荐,确保系统适应食谱分类的性质。本文不仅要推荐食谱,还要探索实现准确和个性化推荐的过程。
{"title":"Enhancing Personalized Recipe Recommendation Through Multi-Class Classification","authors":"Harish Neelam, Koushik Sai Veerella","doi":"arxiv-2409.10267","DOIUrl":"https://doi.org/arxiv-2409.10267","url":null,"abstract":"This paper intends to address the challenge of personalized recipe\u0000recommendation in the realm of diverse culinary preferences. The problem domain\u0000involves recipe recommendations, utilizing techniques such as association\u0000analysis and classification. Association analysis explores the relationships\u0000and connections between different ingredients to enhance the user experience.\u0000Meanwhile, the classification aspect involves categorizing recipes based on\u0000user-defined ingredients and preferences. A unique aspect of the paper is the\u0000consideration of recipes and ingredients belonging to multiple classes,\u0000recognizing the complexity of culinary combinations. This necessitates a\u0000sophisticated approach to classification and recommendation, ensuring the\u0000system accommodates the nature of recipe categorization. The paper seeks not\u0000only to recommend recipes but also to explore the process involved in achieving\u0000accurate and personalized recommendations.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - CS - Information Retrieval
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1