{"title":"随机分层子采样对协同滤波性能和效率的最优依赖性","authors":"Samin Poudel;Marwan Bikdash","doi":"10.26599/BDMA.2021.9020032","DOIUrl":null,"url":null,"abstract":"Dropping fractions of users or items judiciously can reduce the computational cost of Collaborative Filtering (CF) algorithms. The effect of this subsampling on the computing time and accuracy of CF is not fully understood, and clear guidelines for selecting optimal or even appropriate subsampling levels are not available. In this paper, we present a Density-based Random Stratified Subsampling using Clustering (DRSC) algorithm in which the desired Fraction of Users Dropped (FUD) and Fraction of Items Dropped (FID) are specified, and the overall density during subsampling is maintained. Subsequently, we develop simple models of the Training Time Improvement (TTI) and the Accuracy Loss (AL) as functions of FUD and FID, based on extensive simulations of seven standard CF algorithms as applied to various primary matrices from MovieLens, Yahoo Music Rating, and Amazon Automotive data. Simulations show that both TTI and a scaled AL are bi-linear in FID and FUD for all seven methods. The TTI linear regression of a CF method appears to be same for all datasets. Extensive simulations illustrate that TTI can be estimated reliably with FUD and FID only, but AL requires considering additional dataset characteristics. The derived models are then used to optimize the levels of subsampling addressing the tradeoff between TTI and AL. A simple sub-optimal approximation was found, in which the optimal AL is proportional to the optimal Training Time Reduction Factor (TTRF) for higher values of TTRF, and the optimal subsampling levels, like optimal FID/(1–FID), are proportional to the square root of TTRF.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"5 3","pages":"192-205"},"PeriodicalIF":7.7000,"publicationDate":"2022-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9793354/09793360.pdf","citationCount":"9","resultStr":"{\"title\":\"Optimal Dependence of Performance and Efficiency of Collaborative Filtering on Random Stratified Subsampling\",\"authors\":\"Samin Poudel;Marwan Bikdash\",\"doi\":\"10.26599/BDMA.2021.9020032\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Dropping fractions of users or items judiciously can reduce the computational cost of Collaborative Filtering (CF) algorithms. The effect of this subsampling on the computing time and accuracy of CF is not fully understood, and clear guidelines for selecting optimal or even appropriate subsampling levels are not available. In this paper, we present a Density-based Random Stratified Subsampling using Clustering (DRSC) algorithm in which the desired Fraction of Users Dropped (FUD) and Fraction of Items Dropped (FID) are specified, and the overall density during subsampling is maintained. Subsequently, we develop simple models of the Training Time Improvement (TTI) and the Accuracy Loss (AL) as functions of FUD and FID, based on extensive simulations of seven standard CF algorithms as applied to various primary matrices from MovieLens, Yahoo Music Rating, and Amazon Automotive data. Simulations show that both TTI and a scaled AL are bi-linear in FID and FUD for all seven methods. The TTI linear regression of a CF method appears to be same for all datasets. Extensive simulations illustrate that TTI can be estimated reliably with FUD and FID only, but AL requires considering additional dataset characteristics. The derived models are then used to optimize the levels of subsampling addressing the tradeoff between TTI and AL. A simple sub-optimal approximation was found, in which the optimal AL is proportional to the optimal Training Time Reduction Factor (TTRF) for higher values of TTRF, and the optimal subsampling levels, like optimal FID/(1–FID), are proportional to the square root of TTRF.\",\"PeriodicalId\":52355,\"journal\":{\"name\":\"Big Data Mining and Analytics\",\"volume\":\"5 3\",\"pages\":\"192-205\"},\"PeriodicalIF\":7.7000,\"publicationDate\":\"2022-06-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/iel7/8254253/9793354/09793360.pdf\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Big Data Mining and Analytics\",\"FirstCategoryId\":\"1093\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/9793360/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Data Mining and Analytics","FirstCategoryId":"1093","ListUrlMain":"https://ieeexplore.ieee.org/document/9793360/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 9
摘要
明智地丢弃部分用户或项目可以降低协同过滤(CF)算法的计算成本。这种二次采样对CF的计算时间和精度的影响尚不完全清楚,也没有选择最佳甚至适当的二次采样水平的明确指南。在本文中,我们提出了一种使用聚类的基于密度的随机分层子采样(DRSC)算法,其中指定了所需的丢弃用户分数(FUD)和丢弃项目分数(FID),并保持子采样期间的总体密度。随后,我们基于对应用于MovieLens、Yahoo Music Rating和Amazon Automotive数据中的各种主矩阵的七种标准CF算法的广泛模拟,开发了作为FUD和FID函数的训练时间改进(TTI)和精度损失(AL)的简单模型。仿真表明,对于所有七种方法,在FID和FUD中,TTI和缩放的AL都是双线性的。CF方法的TTI线性回归似乎对所有数据集都是相同的。大量模拟表明,仅使用FUD和FID就可以可靠地估计TTI,但AL需要考虑额外的数据集特征。然后,将导出的模型用于优化子采样水平,以解决TTI和AL之间的折衷问题。找到了一个简单的次优近似,其中,对于较高的TTRF值,最佳AL与最佳训练时间缩减因子(TTRF)成比例,而最佳子采样水平(如最佳FID/(1–FID))与TTRF的平方根成比例。
Optimal Dependence of Performance and Efficiency of Collaborative Filtering on Random Stratified Subsampling
Dropping fractions of users or items judiciously can reduce the computational cost of Collaborative Filtering (CF) algorithms. The effect of this subsampling on the computing time and accuracy of CF is not fully understood, and clear guidelines for selecting optimal or even appropriate subsampling levels are not available. In this paper, we present a Density-based Random Stratified Subsampling using Clustering (DRSC) algorithm in which the desired Fraction of Users Dropped (FUD) and Fraction of Items Dropped (FID) are specified, and the overall density during subsampling is maintained. Subsequently, we develop simple models of the Training Time Improvement (TTI) and the Accuracy Loss (AL) as functions of FUD and FID, based on extensive simulations of seven standard CF algorithms as applied to various primary matrices from MovieLens, Yahoo Music Rating, and Amazon Automotive data. Simulations show that both TTI and a scaled AL are bi-linear in FID and FUD for all seven methods. The TTI linear regression of a CF method appears to be same for all datasets. Extensive simulations illustrate that TTI can be estimated reliably with FUD and FID only, but AL requires considering additional dataset characteristics. The derived models are then used to optimize the levels of subsampling addressing the tradeoff between TTI and AL. A simple sub-optimal approximation was found, in which the optimal AL is proportional to the optimal Training Time Reduction Factor (TTRF) for higher values of TTRF, and the optimal subsampling levels, like optimal FID/(1–FID), are proportional to the square root of TTRF.
期刊介绍:
Big Data Mining and Analytics, a publication by Tsinghua University Press, presents groundbreaking research in the field of big data research and its applications. This comprehensive book delves into the exploration and analysis of vast amounts of data from diverse sources to uncover hidden patterns, correlations, insights, and knowledge.
Featuring the latest developments, research issues, and solutions, this book offers valuable insights into the world of big data. It provides a deep understanding of data mining techniques, data analytics, and their practical applications.
Big Data Mining and Analytics has gained significant recognition and is indexed and abstracted in esteemed platforms such as ESCI, EI, Scopus, DBLP Computer Science, Google Scholar, INSPEC, CSCD, DOAJ, CNKI, and more.
With its wealth of information and its ability to transform the way we perceive and utilize data, this book is a must-read for researchers, professionals, and anyone interested in the field of big data analytics.