Information Retrieval Journal最新文献

英文中文

Short-term POI recommendation with personalized time-weighted latent ranking 利用个性化时间加权潜排名进行短期 POI 推荐

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Retrieval Journal

Pub Date : 2024-07-03 DOI: 10.1007/s10791-024-09450-9

Yufeng Zou, Kaiqi Zhao

In this paper, we formulate a novel Point-of-interest (POI) recommendation task to recommend a set of new POIs for visit in a short period following recent check-ins, named short-term POI recommendation. It differs from previously studied tasks and poses new challenges, such as modeling high-order POI transitions in a short period. We present PTWLR, a personalized time-weighted latent ranking model that jointly learns short-term POI transitions and user preferences with our proposed temporal weighting scheme to capture the temporal context of transitions. We extend our model to accommodate the transition dependencies on multiple recent check-ins. In experiments on real-world datasets, our model consistently outperforms seven widely used methods by significant margins in various contexts, demonstrating its effectiveness on our task. Further analysis shows that all proposed components contribute to performance improvement.

在本文中，我们提出了一个新颖的兴趣点（POI）推荐任务，即在近期签到后的短时间内推荐一组新的兴趣点供访问，并将其命名为短期兴趣点推荐。该任务不同于以往研究过的任务，并提出了新的挑战，例如在短时间内对高阶 POI 过渡进行建模。我们提出的 PTWLR 是一种个性化的时间加权潜在排名模型，该模型可联合学习短期 POI 过渡和用户偏好，并采用我们提出的时间加权方案来捕捉过渡的时间背景。我们对模型进行了扩展，以适应最近多次签到的过渡依赖性。在真实世界数据集的实验中，我们的模型在各种情况下都以显著的优势超越了七种广泛使用的方法，证明了它在我们的任务中的有效性。进一步的分析表明，所有建议的组件都有助于提高性能。

引用次数: 0

IDaTPA: importance degree based thread partitioning approach in thread level speculation IDaTPA：线程级推测中基于重要性程度的线程分区方法

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Retrieval Journal

Pub Date : 2024-06-19 DOI: 10.1007/s10791-024-09440-x

Li Yuxiang, Zhang Zhiyong, Wang Xinyong, Huang Shuaina, Su Yaning

As an auto-parallelization technique with the level of thread on multi-core, Thread-Level Speculation (TLS) which is also called Speculative Multithreading (SpMT), partitions programs into multiple threads and speculatively executes them under conditions of ambiguous data and control dependence. Thread partitioning approach plays a key role to the performance enhancement in TLS. The existing heuristic rules-based approach (HR-based approach) which is an one-size-fits-all strategy, can not guarantee to achieve the satisfied thread partitioning. In this paper, an importance degree based thread partitioning approach (IDaTPA) is proposed to realize the partition of irregular programs into multithreads. IDaTPA implements biasing partitioning for every procedure with a machine learning method. It mainly includes: constructing sample set, expression of knowledge, calculation of similarity, prediction model and the partition of the irregular programs is performed by the prediction model. Using IDaTPA, the subprocedures in unseen irregular programs can obtain their satisfied partition. On a generic SpMT processor (called Prophet) to perform the performance evaluation for multithreaded programs, the IDaTPA is evaluated and averagely delivers a speedup of 1.80 upon a 4-core processor. Furthermore, in order to obtain the portability evaluation of IDaTPA, we port IDaTPA to 8-core processor and obtain a speedup of 2.82 on average. Experiment results show that IDaTPA obtains a significant speedup increasement and Olden benchmarks respectively deliver a 5.75% performance improvement on 4-core and a 6.32% performance improvement on 8-core, and SPEC2020 benchmarks obtain a 38.20% performance improvement than the conventional HR-based approach.

作为多核上的线程级自动并行化技术，线程级投机（TLS）又称投机多线程（SpMT），它将程序划分为多个线程，并在数据和控制依赖性不明确的条件下投机执行。线程分区方法对提高 TLS 性能起着关键作用。现有的基于启发式规则的方法（基于 HR 的方法）是一种放之四海而皆准的策略，不能保证实现满意的线程划分。本文提出了一种基于重要度的线程划分方法（IDaTPA），以实现将不规则程序划分为多线程。IDaTPA 通过机器学习方法对每个程序实现偏置分区。它主要包括：构建样本集、知识表达、相似度计算、预测模型，并通过预测模型对不规则程序进行分区。利用 IDaTPA，未见过的不规则程序中的子程序可以获得其满意的分区。为了对多线程程序进行性能评估，IDaTPA 在通用 SpMT 处理器（称为 Prophet）上进行了评估，在 4 核处理器上平均提速 1.80。此外，为了对 IDaTPA 的可移植性进行评估，我们将 IDaTPA 移植到 8 核处理器上，平均提速 2.82。实验结果表明，与传统的基于 HR 的方法相比，IDaTPA 获得了显著的速度提升，Olden 基准分别在 4 核和 8 核上提高了 5.75% 和 6.32% 的性能，SPEC2020 基准提高了 38.20% 的性能。

{"title":"IDaTPA: importance degree based thread partitioning approach in thread level speculation","authors":"Li Yuxiang, Zhang Zhiyong, Wang Xinyong, Huang Shuaina, Su Yaning","doi":"10.1007/s10791-024-09440-x","DOIUrl":"https://doi.org/10.1007/s10791-024-09440-x","url":null,"abstract":"As an auto-parallelization technique with the level of thread on multi-core, Thread-Level Speculation (TLS) which is also called Speculative Multithreading (SpMT), partitions programs into multiple threads and speculatively executes them under conditions of ambiguous data and control dependence. Thread partitioning approach plays a key role to the performance enhancement in TLS. The existing heuristic rules-based approach (HR-based approach) which is an one-size-fits-all strategy, can not guarantee to achieve the satisfied thread partitioning. In this paper, an importance degree based thread partitioning approach (IDaTPA) is proposed to realize the partition of irregular programs into multithreads. IDaTPA implements biasing partitioning for every procedure with a machine learning method. It mainly includes: constructing sample set, expression of knowledge, calculation of similarity, prediction model and the partition of the irregular programs is performed by the prediction model. Using IDaTPA, the subprocedures in unseen irregular programs can obtain their satisfied partition. On a generic SpMT processor (called Prophet) to perform the performance evaluation for multithreaded programs, the IDaTPA is evaluated and averagely delivers a speedup of 1.80 upon a 4-core processor. Furthermore, in order to obtain the portability evaluation of IDaTPA, we port IDaTPA to 8-core processor and obtain a speedup of 2.82 on average. Experiment results show that IDaTPA obtains a significant speedup increasement and Olden benchmarks respectively deliver a 5.75% performance improvement on 4-core and a 6.32% performance improvement on 8-core, and SPEC2020 benchmarks obtain a 38.20% performance improvement than the conventional HR-based approach.","PeriodicalId":54352,"journal":{"name":"Information Retrieval Journal","volume":"50 1","pages":""},"PeriodicalIF":2.5,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141551920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Geth-based real-time detection system for sandwich attacks in Ethereum 基于 Geth 的以太坊三明治攻击实时检测系统

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Retrieval Journal

Pub Date : 2024-05-30 DOI: 10.1007/s10791-024-09445-6

Dongze Li, Kejia Zhang, Lei Wang, Gang Du

With the rapid development of the Ethereum ecosystem and the increasing applications of decentralized finance (DeFi), the security research of smart contracts and blockchain transactions has attracted more and more attention. In particular, front-running attacks on the Ethereum platform have become a major security concern. These attack strategies exploit the transparency and certainty of the blockchain, enabling attackers to gain unfair economic benefits by manipulating the transaction order. This study proposes a sandwich attack detection system integrated into the go-Ethereum client (Geth). This system, by analyzing transaction data streams, effectively detects and defends against front-running and sandwich attacks. It achieves real-time analysis of transactions within blocks, quickly and effectively identifying abnormal patterns and potential attack behaviors. The system has been optimized for performance, with an average processing time of 0.442 s per block and an accuracy rate of 83%. Response time for real-time detection new blocks is within 5 s, with the majority occurring between 1 and 2 s, which is considered acceptable. Research findings indicate that as a part of the go-Ethereum client, this detection system helps enhance the security of the Ethereum blockchain, contributing to the protection of DeFi users’ private funds and the safety of smart contracts. The primary contribution of this study lies in offering an efficient blockchain transaction monitoring system, capable of accurately detecting sandwich attack transactions within blocks while maintaining normal operation speeds as a full node.

随着以太坊生态系统的快速发展和去中心化金融（DeFi）应用的不断增加，智能合约和区块链交易的安全研究引起了越来越多的关注。特别是，以太坊平台上的前置运行攻击已成为一个主要的安全问题。这些攻击策略利用了区块链的透明性和确定性，使攻击者能够通过操纵交易顺序获得不公平的经济利益。本研究提出了一种集成到 go-Ethereum 客户端（Geth）中的三明治攻击检测系统。该系统通过分析交易数据流，有效检测和防御前置运行攻击和三明治攻击。它能对区块内的交易进行实时分析，快速有效地识别异常模式和潜在攻击行为。该系统对性能进行了优化，每个区块的平均处理时间为 0.442 秒，准确率高达 83%。实时检测新数据块的响应时间在 5 秒以内，大部分在 1 到 2 秒之间，这被认为是可以接受的。研究结果表明，作为 go-Ethereum 客户端的一部分，该检测系统有助于增强以太坊区块链的安全性，有助于保护 DeFi 用户的私人资金和智能合约的安全。本研究的主要贡献在于提供了一个高效的区块链交易监控系统，能够准确检测区块内的夹层攻击交易，同时保持作为完整节点的正常运行速度。

{"title":"A Geth-based real-time detection system for sandwich attacks in Ethereum","authors":"Dongze Li, Kejia Zhang, Lei Wang, Gang Du","doi":"10.1007/s10791-024-09445-6","DOIUrl":"https://doi.org/10.1007/s10791-024-09445-6","url":null,"abstract":"With the rapid development of the Ethereum ecosystem and the increasing applications of decentralized finance (DeFi), the security research of smart contracts and blockchain transactions has attracted more and more attention. In particular, front-running attacks on the Ethereum platform have become a major security concern. These attack strategies exploit the transparency and certainty of the blockchain, enabling attackers to gain unfair economic benefits by manipulating the transaction order. This study proposes a sandwich attack detection system integrated into the go-Ethereum client (Geth). This system, by analyzing transaction data streams, effectively detects and defends against front-running and sandwich attacks. It achieves real-time analysis of transactions within blocks, quickly and effectively identifying abnormal patterns and potential attack behaviors. The system has been optimized for performance, with an average processing time of 0.442 s per block and an accuracy rate of 83%. Response time for real-time detection new blocks is within 5 s, with the majority occurring between 1 and 2 s, which is considered acceptable. Research findings indicate that as a part of the go-Ethereum client, this detection system helps enhance the security of the Ethereum blockchain, contributing to the protection of DeFi users’ private funds and the safety of smart contracts. The primary contribution of this study lies in offering an efficient blockchain transaction monitoring system, capable of accurately detecting sandwich attack transactions within blocks while maintaining normal operation speeds as a full node.","PeriodicalId":54352,"journal":{"name":"Information Retrieval Journal","volume":"93 1","pages":""},"PeriodicalIF":2.5,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141197283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Advances in information retrieval collection on the European conference on information retrieval 2023 2023 年欧洲信息检索会议信息检索进展集

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Retrieval Journal

Pub Date : 2024-05-23 DOI: 10.1007/s10791-024-09442-9

Jaap Kamps, Lorraine Goeuriot, Fabio Crestani

This paper introduces the Collection on ECIR 2023. The 45th European Conference on Information Retrieval (ECIR 2023) was held in Dublin, Ireland, during April 2–6, 2023. The conference was the largest ECIR ever, and brought together hundreds of researchers from Europe and abroad. A selection of papers shortlisted for the best paper awards was asked to submit expanded versions appearing in this Discover Computing (formerly the Information Retrieval Journal) Collection on ECIR 2023. First, an analytic paper on incorporating first stage retrieval status values as input in neural cross-encoder re-rankers. Second, new models and new data for a new task of temporal natural language inference. Third, a weak supervision approach to video retrieval overcoming the need for large-scale human labeled training data. Together, these papers showcase the breadth and diversity of current research on information retrieval.

本文介绍《ECIR 2023 文集》。第 45 届欧洲信息检索大会（ECIR 2023）于 2023 年 4 月 2 日至 6 日在爱尔兰都柏林举行。这是 ECIR 有史以来规模最大的一次会议，汇集了来自欧洲和国外的数百名研究人员。入围最佳论文奖的部分论文应邀提交了扩充版，刊登在本期《发现计算》（Discover Computing）（前身为《信息检索期刊》）关于 ECIR 2023 的论文集中。第一，关于将第一阶段检索状态值作为神经交叉编码器重新排序器输入的分析论文。第二，关于时态自然语言推理新任务的新模型和新数据。第三，视频检索的弱监督方法克服了对大规模人类标注训练数据的需求。这些论文共同展示了当前信息检索研究的广度和多样性。

引用次数: 0

Temporal validity reassessment: commonsense reasoning about information obsoleteness 时间有效性再评估：关于信息过时的常识推理

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Retrieval Journal

Pub Date : 2024-05-06 DOI: 10.1007/s10791-024-09433-w

Taishi Hosokawa, Adam Jatowt, Kazunari Sugiyama

It is useful for machines to know whether text information remains valid or not for various applications including text comprehension, story understanding, temporal information retrieval, and user state tracking on microblogs as well as via chatbot conversations. This kind of inference is still difficult for current models, including also large language models, as it requires temporal commonsense knowledge and reasoning. We approach in this paper the task of Temporal Validity Reassessment, inspired by traditional natural language reasoning to determine the updates of the temporal validity of text content. The task requires judgment whether actions expressed in a sentence are still ongoing or rather completed, hence, whether the sentence still remains valid or has become obsolete, given the presence of context in the form of a supplementary content such as a follow-up sentence. We first construct our own dataset for this task and train several machine learning models. Then we propose an effective method for learning information from an external knowledge base that gives information regarding temporal commonsense knowledge. Using our prepared dataset, we introduce a machine learning model that incorporates the information from the knowledge base and demonstrate that incorporating external knowledge generally improves the results. We also experiment with different embedding types to represent temporal commonsense knowledge as well as with data augmentation methods to increase the size of our dataset.

在文本理解、故事理解、时态信息检索、微博用户状态跟踪以及聊天机器人对话等各种应用中，让机器知道文本信息是否仍然有效是非常有用的。对于目前的模型（包括大型语言模型）来说，这种推理仍然很困难，因为它需要时态常识知识和推理。受传统自然语言推理的启发，我们在本文中探讨了 "时间有效性重新评估 "任务，以确定文本内容的时间有效性更新。这项任务要求判断一个句子中表达的行为是仍在进行还是已经完成，因此，考虑到后续句子等补充内容形式的上下文存在，该句子是仍然有效还是已经过时。我们首先为这项任务构建了自己的数据集，并训练了几个机器学习模型。然后，我们提出了一种从外部知识库中学习信息的有效方法，该知识库提供了有关时间常识的信息。利用我们准备好的数据集，我们引入了一个包含知识库信息的机器学习模型，并证明包含外部知识通常会改善结果。我们还尝试使用不同的嵌入类型来表示时态常识知识，并使用数据增强方法来增加数据集的规模。

{"title":"Temporal validity reassessment: commonsense reasoning about information obsoleteness","authors":"Taishi Hosokawa, Adam Jatowt, Kazunari Sugiyama","doi":"10.1007/s10791-024-09433-w","DOIUrl":"https://doi.org/10.1007/s10791-024-09433-w","url":null,"abstract":"It is useful for machines to know whether text information remains valid or not for various applications including text comprehension, story understanding, temporal information retrieval, and user state tracking on microblogs as well as via chatbot conversations. This kind of inference is still difficult for current models, including also large language models, as it requires temporal commonsense knowledge and reasoning. We approach in this paper the task of Temporal Validity Reassessment, inspired by traditional natural language reasoning to determine the updates of the temporal validity of text content. The task requires judgment whether actions expressed in a sentence are still ongoing or rather completed, hence, whether the sentence still remains valid or has become obsolete, given the presence of context in the form of a supplementary content such as a follow-up sentence. We first construct our own dataset for this task and train several machine learning models. Then we propose an effective method for learning information from an external knowledge base that gives information regarding temporal commonsense knowledge. Using our prepared dataset, we introduce a machine learning model that incorporates the information from the knowledge base and demonstrate that incorporating external knowledge generally improves the results. We also experiment with different embedding types to represent temporal commonsense knowledge as well as with data augmentation methods to increase the size of our dataset.","PeriodicalId":54352,"journal":{"name":"Information Retrieval Journal","volume":"18 1","pages":""},"PeriodicalIF":2.5,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140887333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bcubed revisited: elements like me Bcubed 重访：像我这样的元素

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Retrieval Journal

Pub Date : 2024-05-06 DOI: 10.1007/s10791-024-09436-7

Ruben van Heusden, Jaap Kamps, Maarten Marx

BCubed is a mathematically clean, elegant and intuitively well behaved external performance metric for clustering tasks. BCubed compares a predicted clustering to a known ground truth clustering through elementwise precision and recall scores. For each element, the predicted and ground truth clusters containing the element are compared, and the mean over all elements is taken. We argue that BCubed overestimates performance, for the intuitive reason that the clustering gets credit for putting an element into its own cluster. This is repaired, and we investigate the repaired version, called “Elements Like Me (ELM)”. We extensively evaluate ELM from both a theoretical and empirical perspective, and conclude that it retains all of its positive properties, and yields a minimum zero score when it should. Synthetic experiments show that ELM can produce different rankings of predicted clusterings when compared to BCubed, and that the ELM scores are distributed with lower mean and a larger variance than BCubed.

BCubed 是一种用于聚类任务的外部性能指标，它在数学上简洁、优雅、直观。BCubed 通过元素精确度和召回分数将预测聚类与已知地面实况聚类进行比较。对于每个元素，都要对包含该元素的预测聚类和地面实况聚类进行比较，并取所有元素的平均值。我们认为 BCubed 高估了性能，其直观原因是聚类将元素归入了自己的聚类。我们研究了修复后的版本，称为 "Elements Like Me (ELM)"。我们从理论和实证的角度对 ELM 进行了广泛评估，得出的结论是，它保留了其所有积极的特性，并在适当的时候产生了最小零分。合成实验表明，与 BCubed 相比，ELM 可以产生不同的预测聚类排名，而且 ELM 分数的分布比 BCubed 平均值低、方差大。

引用次数: 0

A model of the relationship between the variations of effectiveness and fairness in information retrieval 信息检索中有效性和公平性变化之间的关系模型

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Retrieval Journal

Pub Date : 2024-04-23 DOI: 10.1007/s10791-024-09434-9

Massimo Melucci

The requirement that, for fair document retrieval, the documents should be ranked in the order to equally expose authors and organizations has been studied for some years. The fair exposure of a ranking, however, undermines the optimality of the Probability Ranking Principle and as a consequence retrieval effectiveness. It is shown how the variations of fairness and effectiveness can be related by a model. To this end, the paper introduces a fairness measure inspired in Gini’s index of mutability for non-ordinal variables and relates it to a general enough measure of effectiveness, thus modeling the connection between these two dimensions of Information Retrieval. The paper also introduces the measurement of the statistical significance of the fairness measure. An empirical study completes the paper.

多年来，人们一直在研究这样一个问题，即为了实现公平的文献检索，文献的排序应使作者和机构的曝光率相同。然而，排序的公平性会破坏概率排序原则的最优性，从而影响检索效果。本文展示了如何通过一个模型将公平性和有效性的变化联系起来。为此，本文引入了一种公平性测量方法，其灵感来自基尼指数（Gini's index of mutability for non-ordinal variables），并将其与足够通用的有效性测量方法联系起来，从而为信息检索的这两个维度之间的联系建立模型。本文还介绍了公平度统计意义的测量方法。本文最后还进行了一项实证研究。

引用次数: 0

Boolean interpretation, matching, and ranking of natural language queries in product selection systems 产品选择系统中自然语言查询的布尔解释、匹配和排序

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Retrieval Journal

Pub Date : 2024-04-03 DOI: 10.1007/s10791-024-09432-x

Matthew Moulton, Yiu-Kai Ng

Abstract

E-commerce is a massive sector in the US economy, generating $767.7 billion in revenue in 2021. E-commerce sites maximize their revenue by helping customers find, examine, and purchase products. To help users easily find the most relevant products in the database for their individual needs, e-commerce sites are equipped with a product retrieval system. Many of these modern retrieval systems parse user-specified constraints or keywords embedded in a simple natural language query, which is generally easier and faster for the customer to specify their needs than navigating a product specification form, and does not require the seller to design or develop such a form. These natural language product retrieval systems, however, suffer from low relevance in retrieved products, especially for complex constraints specified on products. The reduced accuracy is in part due to under-utilizing the rich semantics of natural language, specifically queries that include Boolean operators, and lacking of the ranking on partially-matched relevant results that could be of interest to the customers. This undesirable effect costs e-commerce vendors to lose sales on their merchandise. In solving this problem, we propose a novel product retrieval system, called ({textit{QuePR}}) , that parses arbitrarily simple and complex natural language queries with(out) Boolean operators, utilizes combinatorial numeric and content-based matching to extract relevant products from a database, and ranks retrieved resultant products by relevance before presenting them to the end-user. The advantages of ({textit{QuePR}}) are its ability to process explicit and implicit Boolean operators in queries, handle natural language queries using similarity measures on partially-matched records, and perform best guess or match on ambiguous or incomplete queries. ({textit{QuePR}}) is unique, easy to use, and scalable to all product categories. To verify the accuracy of ({textit{QuePR}}) in retrieving relevant products on different product domains, we have conducted different performance analyses and compared ({textit{QuePR}}) with other ranking and retrieval systems. The empirical results verify that ({textit{QuePR}}) outperforms others while maintaining an optimal runtime speed.

摘要电子商务是美国经济中的一个庞大行业，2021 年将创造 7 677 亿美元的收入。电子商务网站通过帮助客户查找、检查和购买产品来实现收入最大化。为了帮助用户在数据库中轻松找到与其个人需求最相关的产品，电子商务网站配备了产品检索系统。许多现代检索系统都能解析用户指定的限制条件或嵌入在简单自然语言查询中的关键字，这通常比浏览产品说明表单更方便快捷，也不需要卖方设计或开发这样的表单。然而，这些自然语言产品检索系统存在检索产品相关性低的问题，特别是对产品指定的复杂限制条件。准确性降低的部分原因是没有充分利用自然语言的丰富语义，特别是包含布尔运算符的查询，以及缺乏对客户可能感兴趣的部分匹配相关结果的排序。这种不良后果导致电子商务供应商的商品销售损失。为了解决这个问题，我们提出了一种新颖的产品检索系统，称为（{textit{QuePR}}），它可以解析任意简单和复杂的带有布尔运算符的自然语言查询，利用组合式数字匹配和基于内容的匹配从数据库中提取相关产品，并在向最终用户展示之前根据相关性对检索结果的产品进行排名。({textit{QuePR}})的优势在于它能够处理查询中的显式和隐式布尔运算符，在部分匹配的记录上使用相似性度量处理自然语言查询，并在模棱两可或不完整的查询上执行最佳猜测或匹配。 ({textit{QuePR}})是独一无二的，易于使用，并且可以扩展到所有产品类别。为了验证({textit{QuePR}})在不同产品领域检索相关产品的准确性，我们进行了不同的性能分析，并将({textit{QuePR}})与其他排名和检索系统进行了比较。实证结果验证了({textit{QuePR}})的性能优于其他系统，同时还保持了最佳的运行速度。

{"title":"Boolean interpretation, matching, and ranking of natural language queries in product selection systems","authors":"Matthew Moulton, Yiu-Kai Ng","doi":"10.1007/s10791-024-09432-x","DOIUrl":"https://doi.org/10.1007/s10791-024-09432-x","url":null,"abstract":"<h3>Abstract</h3> E-commerce is a massive sector in the US economy, generating $767.7 billion in revenue in 2021. E-commerce sites maximize their revenue by helping customers find, examine, and purchase products. To help users easily find the most relevant products in the database for their individual needs, e-commerce sites are equipped with a product retrieval system. Many of these modern retrieval systems parse user-specified constraints or keywords embedded in a simple natural language query, which is generally easier and faster for the customer to specify their needs than navigating a product specification form, and does not require the seller to design or develop such a form. These natural language product retrieval systems, however, suffer from low relevance in retrieved products, especially for complex constraints specified on products. The reduced accuracy is in part due to under-utilizing the rich semantics of natural language, specifically queries that include Boolean operators, and lacking of the ranking on partially-matched relevant results that could be of interest to the customers. This undesirable effect costs e-commerce vendors to lose sales on their merchandise. In solving this problem, we propose a novel product retrieval system, called ({textit{QuePR}}) , that parses arbitrarily simple and complex natural language queries with(out) Boolean operators, utilizes combinatorial numeric and content-based matching to extract relevant products from a database, and ranks retrieved resultant products by relevance before presenting them to the end-user. The advantages of ({textit{QuePR}}) are its ability to process explicit and implicit Boolean operators in queries, handle natural language queries using similarity measures on partially-matched records, and perform best guess or match on ambiguous or incomplete queries. ({textit{QuePR}}) is unique, easy to use, and scalable to all product categories. To verify the accuracy of ({textit{QuePR}}) in retrieving relevant products on different product domains, we have conducted different performance analyses and compared ({textit{QuePR}}) with other ranking and retrieval systems. The empirical results verify that ({textit{QuePR}}) outperforms others while maintaining an optimal runtime speed.","PeriodicalId":54352,"journal":{"name":"Information Retrieval Journal","volume":"7 1","pages":""},"PeriodicalIF":2.5,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140562051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Arithmetic N-gram: an efficient data compression technique 算术 N 图：一种高效的数据压缩技术

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Retrieval Journal

Pub Date : 2024-03-13 DOI: 10.1007/s10791-024-09431-y

Ali Hassan, Sadaf Javed, Sajjad Hussain, Rizwan Ahmad, Shams Qazi

Due to the increase in the growth of data in this era of the digital world and limited resources, there is a need for more efficient data compression techniques for storing and transmitting data. Data compression can significantly reduce the amount of storage space and transmission time to store and transmit given data. More specifically, text compression has got more attention for effectively managing and processing data due to the increased use of the internet, digital devices, data transfer, etc. Over the years, various algorithms have been used for text compression such as Huffman coding, Lempel-Ziv-Welch (LZW) coding, arithmetic coding, etc. However, these methods have a limited compression ratio specifically for data storage applications where a considerable amount of data must be compressed to use storage resources efficiently. They consider individual characters to compress data. It can be more advantageous to consider words or sequences of words rather than individual characters to get a better compression ratio. Compressing individual characters results in a sizeable compressed representation due to their less repetition and structure in the data. In this paper, we proposed the ArthNgram model, in which the N-gram language model coupled with arithmetic coding is used to compress data more efficiently for data storage applications. The performance of the proposed model is evaluated based on compression ratio and compression speed. Results show that the proposed model performs better than traditional techniques.

在这个数字世界的时代，由于数据的增长和资源的有限，人们需要更有效的数据压缩技术来存储和传输数据。数据压缩可以大大减少存储和传输给定数据的存储空间和传输时间。更具体地说，由于互联网、数字设备和数据传输等的使用越来越多，文本压缩在有效管理和处理数据方面得到了更多关注。多年来，人们使用了各种算法来进行文本压缩，如 Huffman 编码、Lempel-Ziv-Welch（LZW）编码、算术编码等。然而，这些方法的压缩率有限，特别是在数据存储应用中，必须压缩大量数据才能有效利用存储资源。它们只考虑单个字符来压缩数据。如果考虑单词或单词序列，而不是单个字符，可能会获得更好的压缩率。由于单个字符在数据中的重复性和结构性较低，因此压缩单个字符可获得较大的压缩表示。在本文中，我们提出了 ArthNgram 模型，在该模型中，N-gram 语言模型与算术编码相结合，可更有效地压缩数据，用于数据存储应用。我们根据压缩率和压缩速度对所提模型的性能进行了评估。结果表明，所提模型的性能优于传统技术。

{"title":"Arithmetic N-gram: an efficient data compression technique","authors":"Ali Hassan, Sadaf Javed, Sajjad Hussain, Rizwan Ahmad, Shams Qazi","doi":"10.1007/s10791-024-09431-y","DOIUrl":"https://doi.org/10.1007/s10791-024-09431-y","url":null,"abstract":"Due to the increase in the growth of data in this era of the digital world and limited resources, there is a need for more efficient data compression techniques for storing and transmitting data. Data compression can significantly reduce the amount of storage space and transmission time to store and transmit given data. More specifically, text compression has got more attention for effectively managing and processing data due to the increased use of the internet, digital devices, data transfer, etc. Over the years, various algorithms have been used for text compression such as Huffman coding, Lempel-Ziv-Welch (LZW) coding, arithmetic coding, etc. However, these methods have a limited compression ratio specifically for data storage applications where a considerable amount of data must be compressed to use storage resources efficiently. They consider individual characters to compress data. It can be more advantageous to consider words or sequences of words rather than individual characters to get a better compression ratio. Compressing individual characters results in a sizeable compressed representation due to their less repetition and structure in the data. In this paper, we proposed the ArthNgram model, in which the N-gram language model coupled with arithmetic coding is used to compress data more efficiently for data storage applications. The performance of the proposed model is evaluated based on compression ratio and compression speed. Results show that the proposed model performs better than traditional techniques.","PeriodicalId":54352,"journal":{"name":"Information Retrieval Journal","volume":"126 1","pages":""},"PeriodicalIF":2.5,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140129892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Tashaphyne0.4: a new arabic light stemmer based on rhyzome modeling approach Tashaphyne0.4：基于韵律体建模方法的新型阿拉伯语光干器

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Retrieval Journal

Pub Date : 2023-12-14 DOI: 10.1007/s10791-023-09429-y

Ra’ed M. Al-Khatib, Taha Zerrouki, Mohammed M. Abu Shquier, Amar Balla

Stemming algorithms are crucial tools for enhancing the information retrieval process in natural language processing. This paper presents a novel Arabic light stemming algorithm called Tashaphyne0.4, the idea behind this algorithm is to extract the most precise ‘roots’, and ‘stems’ from words of an Arabic text. Thus, the proposed algorithm acts as rooter, stemmer, and segmentation tools at the same time. Our approach involves tri-fold phases (i.e., Preparation, Stems-Extractor, and Root-Extractor). Tashaphyne0.4 has shown better results than six other stemmers (i.e., Khoja, ISRI, Motaz/Light10, Tashaphyne0.3, FARASA, and Assem stemmers). The comparison is performed using four different Arabic comprehensive-benchmarks datasets. In conclusion, our proposed stemmer achieved remarkable results and outperformed other competitive stemmers in extracting ‘Roots’ and ‘Stems’.

词干处理算法是增强自然语言处理中信息检索过程的重要工具。本文提出了一种名为 Tashaphyne0.4 的新型阿拉伯语轻词干算法，该算法的理念是从阿拉伯语文本中提取最精确的 "词根 "和 "词干"。因此，所提出的算法可同时充当词根、词干和分段工具。我们的方法包括三个阶段（即准备阶段、词干提取阶段和词根提取阶段）。Tashaphyne0.4 的效果优于其他六种词干提取器（即 Khoja、ISRI、Motaz/Light10、Tashaphyne0.3、FARASA 和 Assem 词干提取器）。比较使用了四个不同的阿拉伯语综合基准数据集。总之，我们提出的干词器在提取 "根 "和 "茎 "方面取得了显著的效果，优于其他同类干词器。

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Information Retrieval Journal

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀