Pub Date : 2024-07-03DOI: 10.1007/s10791-024-09450-9
Yufeng Zou, Kaiqi Zhao
In this paper, we formulate a novel Point-of-interest (POI) recommendation task to recommend a set of new POIs for visit in a short period following recent check-ins, named short-term POI recommendation. It differs from previously studied tasks and poses new challenges, such as modeling high-order POI transitions in a short period. We present PTWLR, a personalized time-weighted latent ranking model that jointly learns short-term POI transitions and user preferences with our proposed temporal weighting scheme to capture the temporal context of transitions. We extend our model to accommodate the transition dependencies on multiple recent check-ins. In experiments on real-world datasets, our model consistently outperforms seven widely used methods by significant margins in various contexts, demonstrating its effectiveness on our task. Further analysis shows that all proposed components contribute to performance improvement.
在本文中,我们提出了一个新颖的兴趣点(POI)推荐任务,即在近期签到后的短时间内推荐一组新的兴趣点供访问,并将其命名为短期兴趣点推荐。该任务不同于以往研究过的任务,并提出了新的挑战,例如在短时间内对高阶 POI 过渡进行建模。我们提出的 PTWLR 是一种个性化的时间加权潜在排名模型,该模型可联合学习短期 POI 过渡和用户偏好,并采用我们提出的时间加权方案来捕捉过渡的时间背景。我们对模型进行了扩展,以适应最近多次签到的过渡依赖性。在真实世界数据集的实验中,我们的模型在各种情况下都以显著的优势超越了七种广泛使用的方法,证明了它在我们的任务中的有效性。进一步的分析表明,所有建议的组件都有助于提高性能。
{"title":"Short-term POI recommendation with personalized time-weighted latent ranking","authors":"Yufeng Zou, Kaiqi Zhao","doi":"10.1007/s10791-024-09450-9","DOIUrl":"https://doi.org/10.1007/s10791-024-09450-9","url":null,"abstract":"<p>In this paper, we formulate a novel Point-of-interest (POI) recommendation task to recommend a set of new POIs for visit in a short period following recent check-ins, named short-term POI recommendation. It differs from previously studied tasks and poses new challenges, such as modeling high-order POI transitions in a short period. We present PTWLR, a personalized time-weighted latent ranking model that jointly learns short-term POI transitions and user preferences with our proposed temporal weighting scheme to capture the temporal context of transitions. We extend our model to accommodate the transition dependencies on multiple recent check-ins. In experiments on real-world datasets, our model consistently outperforms seven widely used methods by significant margins in various contexts, demonstrating its effectiveness on our task. Further analysis shows that all proposed components contribute to performance improvement.</p>","PeriodicalId":54352,"journal":{"name":"Information Retrieval Journal","volume":"67 1","pages":""},"PeriodicalIF":2.5,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141551919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-19DOI: 10.1007/s10791-024-09440-x
Li Yuxiang, Zhang Zhiyong, Wang Xinyong, Huang Shuaina, Su Yaning
As an auto-parallelization technique with the level of thread on multi-core, Thread-Level Speculation (TLS) which is also called Speculative Multithreading (SpMT), partitions programs into multiple threads and speculatively executes them under conditions of ambiguous data and control dependence. Thread partitioning approach plays a key role to the performance enhancement in TLS. The existing heuristic rules-based approach (HR-based approach) which is an one-size-fits-all strategy, can not guarantee to achieve the satisfied thread partitioning. In this paper, an importance degree based thread partitioning approach (IDaTPA) is proposed to realize the partition of irregular programs into multithreads. IDaTPA implements biasing partitioning for every procedure with a machine learning method. It mainly includes: constructing sample set, expression of knowledge, calculation of similarity, prediction model and the partition of the irregular programs is performed by the prediction model. Using IDaTPA, the subprocedures in unseen irregular programs can obtain their satisfied partition. On a generic SpMT processor (called Prophet) to perform the performance evaluation for multithreaded programs, the IDaTPA is evaluated and averagely delivers a speedup of 1.80 upon a 4-core processor. Furthermore, in order to obtain the portability evaluation of IDaTPA, we port IDaTPA to 8-core processor and obtain a speedup of 2.82 on average. Experiment results show that IDaTPA obtains a significant speedup increasement and Olden benchmarks respectively deliver a 5.75% performance improvement on 4-core and a 6.32% performance improvement on 8-core, and SPEC2020 benchmarks obtain a 38.20% performance improvement than the conventional HR-based approach.
{"title":"IDaTPA: importance degree based thread partitioning approach in thread level speculation","authors":"Li Yuxiang, Zhang Zhiyong, Wang Xinyong, Huang Shuaina, Su Yaning","doi":"10.1007/s10791-024-09440-x","DOIUrl":"https://doi.org/10.1007/s10791-024-09440-x","url":null,"abstract":"<p>As an auto-parallelization technique with the level of thread on multi-core, Thread-Level Speculation (TLS) which is also called Speculative Multithreading (SpMT), partitions programs into multiple threads and speculatively executes them under conditions of ambiguous data and control dependence. Thread partitioning approach plays a key role to the performance enhancement in TLS. The existing heuristic rules-based approach (HR-based approach) which is an one-size-fits-all strategy, can not guarantee to achieve the satisfied thread partitioning. In this paper, an importance <u>d</u>egree b<u>a</u>sed <u>t</u>hread <u>p</u>artitioning <u>a</u>pproach (IDaTPA) is proposed to realize the partition of irregular programs into multithreads. IDaTPA implements biasing partitioning for every procedure with a machine learning method. It mainly includes: constructing sample set, expression of knowledge, calculation of similarity, prediction model and the partition of the irregular programs is performed by the prediction model. Using IDaTPA, the subprocedures in unseen irregular programs can obtain their satisfied partition. On a generic SpMT processor (called Prophet) to perform the performance evaluation for multithreaded programs, the IDaTPA is evaluated and averagely delivers a speedup of 1.80 upon a 4-core processor. Furthermore, in order to obtain the portability evaluation of IDaTPA, we port IDaTPA to 8-core processor and obtain a speedup of 2.82 on average. Experiment results show that IDaTPA obtains a significant speedup increasement and Olden benchmarks respectively deliver a 5.75% performance improvement on 4-core and a 6.32% performance improvement on 8-core, and SPEC2020 benchmarks obtain a 38.20% performance improvement than the conventional HR-based approach.</p>","PeriodicalId":54352,"journal":{"name":"Information Retrieval Journal","volume":"50 1","pages":""},"PeriodicalIF":2.5,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141551920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-30DOI: 10.1007/s10791-024-09445-6
Dongze Li, Kejia Zhang, Lei Wang, Gang Du
With the rapid development of the Ethereum ecosystem and the increasing applications of decentralized finance (DeFi), the security research of smart contracts and blockchain transactions has attracted more and more attention. In particular, front-running attacks on the Ethereum platform have become a major security concern. These attack strategies exploit the transparency and certainty of the blockchain, enabling attackers to gain unfair economic benefits by manipulating the transaction order. This study proposes a sandwich attack detection system integrated into the go-Ethereum client (Geth). This system, by analyzing transaction data streams, effectively detects and defends against front-running and sandwich attacks. It achieves real-time analysis of transactions within blocks, quickly and effectively identifying abnormal patterns and potential attack behaviors. The system has been optimized for performance, with an average processing time of 0.442 s per block and an accuracy rate of 83%. Response time for real-time detection new blocks is within 5 s, with the majority occurring between 1 and 2 s, which is considered acceptable. Research findings indicate that as a part of the go-Ethereum client, this detection system helps enhance the security of the Ethereum blockchain, contributing to the protection of DeFi users’ private funds and the safety of smart contracts. The primary contribution of this study lies in offering an efficient blockchain transaction monitoring system, capable of accurately detecting sandwich attack transactions within blocks while maintaining normal operation speeds as a full node.
{"title":"A Geth-based real-time detection system for sandwich attacks in Ethereum","authors":"Dongze Li, Kejia Zhang, Lei Wang, Gang Du","doi":"10.1007/s10791-024-09445-6","DOIUrl":"https://doi.org/10.1007/s10791-024-09445-6","url":null,"abstract":"<p>With the rapid development of the Ethereum ecosystem and the increasing applications of decentralized finance (DeFi), the security research of smart contracts and blockchain transactions has attracted more and more attention. In particular, front-running attacks on the Ethereum platform have become a major security concern. These attack strategies exploit the transparency and certainty of the blockchain, enabling attackers to gain unfair economic benefits by manipulating the transaction order. This study proposes a sandwich attack detection system integrated into the go-Ethereum client (Geth). This system, by analyzing transaction data streams, effectively detects and defends against front-running and sandwich attacks. It achieves real-time analysis of transactions within blocks, quickly and effectively identifying abnormal patterns and potential attack behaviors. The system has been optimized for performance, with an average processing time of 0.442 s per block and an accuracy rate of 83%. Response time for real-time detection new blocks is within 5 s, with the majority occurring between 1 and 2 s, which is considered acceptable. Research findings indicate that as a part of the go-Ethereum client, this detection system helps enhance the security of the Ethereum blockchain, contributing to the protection of DeFi users’ private funds and the safety of smart contracts. The primary contribution of this study lies in offering an efficient blockchain transaction monitoring system, capable of accurately detecting sandwich attack transactions within blocks while maintaining normal operation speeds as a full node.</p>","PeriodicalId":54352,"journal":{"name":"Information Retrieval Journal","volume":"93 1","pages":""},"PeriodicalIF":2.5,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141197283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-23DOI: 10.1007/s10791-024-09442-9
Jaap Kamps, Lorraine Goeuriot, Fabio Crestani
This paper introduces the Collection on ECIR 2023. The 45th European Conference on Information Retrieval (ECIR 2023) was held in Dublin, Ireland, during April 2–6, 2023. The conference was the largest ECIR ever, and brought together hundreds of researchers from Europe and abroad. A selection of papers shortlisted for the best paper awards was asked to submit expanded versions appearing in this Discover Computing (formerly the Information Retrieval Journal) Collection on ECIR 2023. First, an analytic paper on incorporating first stage retrieval status values as input in neural cross-encoder re-rankers. Second, new models and new data for a new task of temporal natural language inference. Third, a weak supervision approach to video retrieval overcoming the need for large-scale human labeled training data. Together, these papers showcase the breadth and diversity of current research on information retrieval.
{"title":"Advances in information retrieval collection on the European conference on information retrieval 2023","authors":"Jaap Kamps, Lorraine Goeuriot, Fabio Crestani","doi":"10.1007/s10791-024-09442-9","DOIUrl":"https://doi.org/10.1007/s10791-024-09442-9","url":null,"abstract":"<p>This paper introduces the Collection on ECIR 2023. The 45th European Conference on Information Retrieval (ECIR 2023) was held in Dublin, Ireland, during April 2–6, 2023. The conference was the largest ECIR ever, and brought together hundreds of researchers from Europe and abroad. A selection of papers shortlisted for the best paper awards was asked to submit expanded versions appearing in this Discover Computing (formerly the Information Retrieval Journal) Collection on ECIR 2023. First, an analytic paper on incorporating first stage retrieval status values as input in neural cross-encoder re-rankers. Second, new models and new data for a new task of temporal natural language inference. Third, a weak supervision approach to video retrieval overcoming the need for large-scale human labeled training data. Together, these papers showcase the breadth and diversity of current research on information retrieval.</p>","PeriodicalId":54352,"journal":{"name":"Information Retrieval Journal","volume":"46 1","pages":""},"PeriodicalIF":2.5,"publicationDate":"2024-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141150214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-06DOI: 10.1007/s10791-024-09433-w
Taishi Hosokawa, Adam Jatowt, Kazunari Sugiyama
It is useful for machines to know whether text information remains valid or not for various applications including text comprehension, story understanding, temporal information retrieval, and user state tracking on microblogs as well as via chatbot conversations. This kind of inference is still difficult for current models, including also large language models, as it requires temporal commonsense knowledge and reasoning. We approach in this paper the task of Temporal Validity Reassessment, inspired by traditional natural language reasoning to determine the updates of the temporal validity of text content. The task requires judgment whether actions expressed in a sentence are still ongoing or rather completed, hence, whether the sentence still remains valid or has become obsolete, given the presence of context in the form of a supplementary content such as a follow-up sentence. We first construct our own dataset for this task and train several machine learning models. Then we propose an effective method for learning information from an external knowledge base that gives information regarding temporal commonsense knowledge. Using our prepared dataset, we introduce a machine learning model that incorporates the information from the knowledge base and demonstrate that incorporating external knowledge generally improves the results. We also experiment with different embedding types to represent temporal commonsense knowledge as well as with data augmentation methods to increase the size of our dataset.
{"title":"Temporal validity reassessment: commonsense reasoning about information obsoleteness","authors":"Taishi Hosokawa, Adam Jatowt, Kazunari Sugiyama","doi":"10.1007/s10791-024-09433-w","DOIUrl":"https://doi.org/10.1007/s10791-024-09433-w","url":null,"abstract":"<p>It is useful for machines to know whether text information remains valid or not for various applications including text comprehension, story understanding, temporal information retrieval, and user state tracking on microblogs as well as via chatbot conversations. This kind of inference is still difficult for current models, including also large language models, as it requires temporal commonsense knowledge and reasoning. We approach in this paper the task of Temporal Validity Reassessment, inspired by traditional natural language reasoning to determine the updates of the temporal validity of text content. The task requires judgment whether actions expressed in a sentence are still ongoing or rather completed, hence, whether the sentence still remains valid or has become obsolete, given the presence of context in the form of a supplementary content such as a follow-up sentence. We first construct our own dataset for this task and train several machine learning models. Then we propose an effective method for learning information from an external knowledge base that gives information regarding temporal commonsense knowledge. Using our prepared dataset, we introduce a machine learning model that incorporates the information from the knowledge base and demonstrate that incorporating external knowledge generally improves the results. We also experiment with different embedding types to represent temporal commonsense knowledge as well as with data augmentation methods to increase the size of our dataset.</p>","PeriodicalId":54352,"journal":{"name":"Information Retrieval Journal","volume":"18 1","pages":""},"PeriodicalIF":2.5,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140887333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-06DOI: 10.1007/s10791-024-09436-7
Ruben van Heusden, Jaap Kamps, Maarten Marx
BCubed is a mathematically clean, elegant and intuitively well behaved external performance metric for clustering tasks. BCubed compares a predicted clustering to a known ground truth clustering through elementwise precision and recall scores. For each element, the predicted and ground truth clusters containing the element are compared, and the mean over all elements is taken. We argue that BCubed overestimates performance, for the intuitive reason that the clustering gets credit for putting an element into its own cluster. This is repaired, and we investigate the repaired version, called “Elements Like Me (ELM)”. We extensively evaluate ELM from both a theoretical and empirical perspective, and conclude that it retains all of its positive properties, and yields a minimum zero score when it should. Synthetic experiments show that ELM can produce different rankings of predicted clusterings when compared to BCubed, and that the ELM scores are distributed with lower mean and a larger variance than BCubed.
BCubed 是一种用于聚类任务的外部性能指标,它在数学上简洁、优雅、直观。BCubed 通过元素精确度和召回分数将预测聚类与已知地面实况聚类进行比较。对于每个元素,都要对包含该元素的预测聚类和地面实况聚类进行比较,并取所有元素的平均值。我们认为 BCubed 高估了性能,其直观原因是聚类将元素归入了自己的聚类。我们研究了修复后的版本,称为 "Elements Like Me (ELM)"。我们从理论和实证的角度对 ELM 进行了广泛评估,得出的结论是,它保留了其所有积极的特性,并在适当的时候产生了最小零分。合成实验表明,与 BCubed 相比,ELM 可以产生不同的预测聚类排名,而且 ELM 分数的分布比 BCubed 平均值低、方差大。
{"title":"Bcubed revisited: elements like me","authors":"Ruben van Heusden, Jaap Kamps, Maarten Marx","doi":"10.1007/s10791-024-09436-7","DOIUrl":"https://doi.org/10.1007/s10791-024-09436-7","url":null,"abstract":"<p>BCubed is a mathematically clean, elegant and intuitively well behaved external performance metric for clustering tasks. BCubed compares a predicted clustering to a known ground truth clustering through elementwise precision and recall scores. For each element, the predicted and ground truth clusters containing the element are compared, and the mean over all elements is taken. We argue that BCubed overestimates performance, for the intuitive reason that the clustering gets credit for putting an element into its own cluster. This is repaired, and we investigate the repaired version, called “Elements Like Me (ELM)”. We extensively evaluate ELM from both a theoretical and empirical perspective, and conclude that it retains all of its positive properties, and yields a minimum zero score when it should. Synthetic experiments show that ELM can produce different rankings of predicted clusterings when compared to BCubed, and that the ELM scores are distributed with lower mean and a larger variance than BCubed.\u0000</p>","PeriodicalId":54352,"journal":{"name":"Information Retrieval Journal","volume":"23 1","pages":""},"PeriodicalIF":2.5,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140887669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-23DOI: 10.1007/s10791-024-09434-9
Massimo Melucci
The requirement that, for fair document retrieval, the documents should be ranked in the order to equally expose authors and organizations has been studied for some years. The fair exposure of a ranking, however, undermines the optimality of the Probability Ranking Principle and as a consequence retrieval effectiveness. It is shown how the variations of fairness and effectiveness can be related by a model. To this end, the paper introduces a fairness measure inspired in Gini’s index of mutability for non-ordinal variables and relates it to a general enough measure of effectiveness, thus modeling the connection between these two dimensions of Information Retrieval. The paper also introduces the measurement of the statistical significance of the fairness measure. An empirical study completes the paper.
多年来,人们一直在研究这样一个问题,即为了实现公平的文献检索,文献的排序应使作者和机构的曝光率相同。然而,排序的公平性会破坏概率排序原则的最优性,从而影响检索效果。本文展示了如何通过一个模型将公平性和有效性的变化联系起来。为此,本文引入了一种公平性测量方法,其灵感来自基尼指数(Gini's index of mutability for non-ordinal variables),并将其与足够通用的有效性测量方法联系起来,从而为信息检索的这两个维度之间的联系建立模型。本文还介绍了公平度统计意义的测量方法。本文最后还进行了一项实证研究。
{"title":"A model of the relationship between the variations of effectiveness and fairness in information retrieval","authors":"Massimo Melucci","doi":"10.1007/s10791-024-09434-9","DOIUrl":"https://doi.org/10.1007/s10791-024-09434-9","url":null,"abstract":"<p>The requirement that, for fair document retrieval, the documents should be ranked in the order to equally expose authors and organizations has been studied for some years. The fair exposure of a ranking, however, undermines the optimality of the Probability Ranking Principle and as a consequence retrieval effectiveness. It is shown how the variations of fairness and effectiveness can be related by a model. To this end, the paper introduces a fairness measure inspired in Gini’s index of mutability for non-ordinal variables and relates it to a general enough measure of effectiveness, thus modeling the connection between these two dimensions of Information Retrieval. The paper also introduces the measurement of the statistical significance of the fairness measure. An empirical study completes the paper.</p>","PeriodicalId":54352,"journal":{"name":"Information Retrieval Journal","volume":"48 1","pages":""},"PeriodicalIF":2.5,"publicationDate":"2024-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140804491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-03DOI: 10.1007/s10791-024-09432-x
Matthew Moulton, Yiu-Kai Ng
Abstract
E-commerce is a massive sector in the US economy, generating $767.7 billion in revenue in 2021. E-commerce sites maximize their revenue by helping customers find, examine, and purchase products. To help users easily find the most relevant products in the database for their individual needs, e-commerce sites are equipped with a product retrieval system. Many of these modern retrieval systems parse user-specified constraints or keywords embedded in a simple natural language query, which is generally easier and faster for the customer to specify their needs than navigating a product specification form, and does not require the seller to design or develop such a form. These natural language product retrieval systems, however, suffer from low relevance in retrieved products, especially for complex constraints specified on products. The reduced accuracy is in part due to under-utilizing the rich semantics of natural language, specifically queries that include Boolean operators, and lacking of the ranking on partially-matched relevant results that could be of interest to the customers. This undesirable effect costs e-commerce vendors to lose sales on their merchandise. In solving this problem, we propose a novel product retrieval system, called ({textit{QuePR}}), that parses arbitrarily simple and complex natural language queries with(out) Boolean operators, utilizes combinatorial numeric and content-based matching to extract relevant products from a database, and ranks retrieved resultant products by relevance before presenting them to the end-user. The advantages of ({textit{QuePR}}) are its ability to process explicit and implicit Boolean operators in queries, handle natural language queries using similarity measures on partially-matched records, and perform best guess or match on ambiguous or incomplete queries. ({textit{QuePR}}) is unique, easy to use, and scalable to all product categories. To verify the accuracy of ({textit{QuePR}}) in retrieving relevant products on different product domains, we have conducted different performance analyses and compared ({textit{QuePR}}) with other ranking and retrieval systems. The empirical results verify that ({textit{QuePR}}) outperforms others while maintaining an optimal runtime speed.
{"title":"Boolean interpretation, matching, and ranking of natural language queries in product selection systems","authors":"Matthew Moulton, Yiu-Kai Ng","doi":"10.1007/s10791-024-09432-x","DOIUrl":"https://doi.org/10.1007/s10791-024-09432-x","url":null,"abstract":"<h3>Abstract</h3> <p>E-commerce is a massive sector in the US economy, generating $767.7 billion in revenue in 2021. E-commerce sites maximize their revenue by helping customers find, examine, and purchase products. To help users easily find the most relevant products in the database for their individual needs, e-commerce sites are equipped with a product retrieval system. Many of these modern retrieval systems parse user-specified constraints or keywords embedded in a simple natural language query, which is generally easier and faster for the customer to specify their needs than navigating a product specification form, and does not require the seller to design or develop such a form. These natural language product retrieval systems, however, suffer from <em>low</em> relevance in retrieved products, especially for <em>complex</em> constraints specified on products. The reduced accuracy is in part due to under-utilizing the rich semantics of natural language, specifically queries that include Boolean operators, and lacking of the ranking on partially-matched relevant results that could be of interest to the customers. This undesirable effect costs e-commerce vendors to lose sales on their merchandise. In solving this problem, we propose a novel product retrieval system, called <span> <span>({textit{QuePR}})</span> </span>, that parses arbitrarily simple and complex natural language queries with(out) Boolean operators, utilizes combinatorial numeric and content-based matching to extract relevant products from a database, and ranks retrieved resultant products by relevance before presenting them to the end-user. The advantages of <span> <span>({textit{QuePR}})</span> </span> are its ability to process explicit and implicit Boolean operators in queries, handle natural language queries using similarity measures on partially-matched records, and perform best guess or match on ambiguous or incomplete queries. <span> <span>({textit{QuePR}})</span> </span> is unique, easy to use, and scalable to all product categories. To verify the accuracy of <span> <span>({textit{QuePR}})</span> </span> in retrieving relevant products on different product domains, we have conducted different performance analyses and compared <span> <span>({textit{QuePR}})</span> </span> with other ranking and retrieval systems. The empirical results verify that <span> <span>({textit{QuePR}})</span> </span> outperforms others while maintaining an optimal runtime speed.</p>","PeriodicalId":54352,"journal":{"name":"Information Retrieval Journal","volume":"7 1","pages":""},"PeriodicalIF":2.5,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140562051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-13DOI: 10.1007/s10791-024-09431-y
Ali Hassan, Sadaf Javed, Sajjad Hussain, Rizwan Ahmad, Shams Qazi
Due to the increase in the growth of data in this era of the digital world and limited resources, there is a need for more efficient data compression techniques for storing and transmitting data. Data compression can significantly reduce the amount of storage space and transmission time to store and transmit given data. More specifically, text compression has got more attention for effectively managing and processing data due to the increased use of the internet, digital devices, data transfer, etc. Over the years, various algorithms have been used for text compression such as Huffman coding, Lempel-Ziv-Welch (LZW) coding, arithmetic coding, etc. However, these methods have a limited compression ratio specifically for data storage applications where a considerable amount of data must be compressed to use storage resources efficiently. They consider individual characters to compress data. It can be more advantageous to consider words or sequences of words rather than individual characters to get a better compression ratio. Compressing individual characters results in a sizeable compressed representation due to their less repetition and structure in the data. In this paper, we proposed the ArthNgram model, in which the N-gram language model coupled with arithmetic coding is used to compress data more efficiently for data storage applications. The performance of the proposed model is evaluated based on compression ratio and compression speed. Results show that the proposed model performs better than traditional techniques.
{"title":"Arithmetic N-gram: an efficient data compression technique","authors":"Ali Hassan, Sadaf Javed, Sajjad Hussain, Rizwan Ahmad, Shams Qazi","doi":"10.1007/s10791-024-09431-y","DOIUrl":"https://doi.org/10.1007/s10791-024-09431-y","url":null,"abstract":"<p>Due to the increase in the growth of data in this era of the digital world and limited resources, there is a need for more efficient data compression techniques for storing and transmitting data. Data compression can significantly reduce the amount of storage space and transmission time to store and transmit given data. More specifically, text compression has got more attention for effectively managing and processing data due to the increased use of the internet, digital devices, data transfer, etc. Over the years, various algorithms have been used for text compression such as Huffman coding, Lempel-Ziv-Welch (LZW) coding, arithmetic coding, etc. However, these methods have a limited compression ratio specifically for data storage applications where a considerable amount of data must be compressed to use storage resources efficiently. They consider individual characters to compress data. It can be more advantageous to consider words or sequences of words rather than individual characters to get a better compression ratio. Compressing individual characters results in a sizeable compressed representation due to their less repetition and structure in the data. In this paper, we proposed the ArthNgram model, in which the N-gram language model coupled with arithmetic coding is used to compress data more efficiently for data storage applications. The performance of the proposed model is evaluated based on compression ratio and compression speed. Results show that the proposed model performs better than traditional techniques.</p>","PeriodicalId":54352,"journal":{"name":"Information Retrieval Journal","volume":"126 1","pages":""},"PeriodicalIF":2.5,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140129892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-14DOI: 10.1007/s10791-023-09429-y
Ra’ed M. Al-Khatib, Taha Zerrouki, Mohammed M. Abu Shquier, Amar Balla
Stemming algorithms are crucial tools for enhancing the information retrieval process in natural language processing. This paper presents a novel Arabic light stemming algorithm called Tashaphyne0.4, the idea behind this algorithm is to extract the most precise ‘roots’, and ‘stems’ from words of an Arabic text. Thus, the proposed algorithm acts as rooter, stemmer, and segmentation tools at the same time. Our approach involves tri-fold phases (i.e., Preparation, Stems-Extractor, and Root-Extractor). Tashaphyne0.4 has shown better results than six other stemmers (i.e., Khoja, ISRI, Motaz/Light10, Tashaphyne0.3, FARASA, and Assem stemmers). The comparison is performed using four different Arabic comprehensive-benchmarks datasets. In conclusion, our proposed stemmer achieved remarkable results and outperformed other competitive stemmers in extracting ‘Roots’ and ‘Stems’.
{"title":"Tashaphyne0.4: a new arabic light stemmer based on rhyzome modeling approach","authors":"Ra’ed M. Al-Khatib, Taha Zerrouki, Mohammed M. Abu Shquier, Amar Balla","doi":"10.1007/s10791-023-09429-y","DOIUrl":"https://doi.org/10.1007/s10791-023-09429-y","url":null,"abstract":"<p>Stemming algorithms are crucial tools for enhancing the information retrieval process in natural language processing. This paper presents a novel Arabic light stemming algorithm called Tashaphyne0.4, the idea behind this algorithm is to extract the most precise ‘<i>roots</i>’, and ‘<i>stems</i>’ from words of an Arabic text. Thus, the proposed algorithm acts as rooter, stemmer, and segmentation tools at the same time. Our approach involves tri-fold phases (i.e., Preparation, Stems-Extractor, and Root-Extractor). Tashaphyne0.4 has shown better results than six other stemmers (i.e., Khoja, ISRI, Motaz/Light10, Tashaphyne0.3, FARASA, and Assem stemmers). The comparison is performed using four different Arabic comprehensive-benchmarks datasets. In conclusion, our proposed stemmer achieved remarkable results and outperformed other competitive stemmers in extracting ‘<i>Roots</i>’ and ‘<i>Stems</i>’.</p>","PeriodicalId":54352,"journal":{"name":"Information Retrieval Journal","volume":"18 1","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138680535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}