基于归纳法的增量数据集高效高效用项集挖掘方法

IF 4.6 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Data Science and Engineering Pub Date : 2023-09-29 DOI:10.1007/s41019-023-00229-4

Pushp Sra, Satish Chand

{"title":"基于归纳法的增量数据集高效高效用项集挖掘方法","authors":"Pushp Sra, Satish Chand","doi":"10.1007/s41019-023-00229-4","DOIUrl":null,"url":null,"abstract":"Abstract High utility itemset mining is a crucial research area that focuses on identifying combinations of itemsets from databases that possess a utility value higher than a user-specified threshold. However, most existing algorithms assume that the databases are static, which is not realistic for real-life datasets that are continuously growing with new data. Furthermore, existing algorithms only rely on the utility value to identify relevant itemsets, leading to even the earliest occurring combinations being produced as output. Although some mining algorithms adopt a support-based approach to account for itemset frequency, they do not consider the temporal nature of itemsets. To address these challenges, this paper proposes the Scented Utility Miner (SUM) algorithm that uses a reinduction strategy to track the recency of itemset occurrence and mine itemsets from incremental databases. The paper provides a novel approach for mining high utility itemsets from dynamic databases and presents several experiments that demonstrate the effectiveness of the proposed approach.","PeriodicalId":52220,"journal":{"name":"Data Science and Engineering","volume":"35 1","pages":"0"},"PeriodicalIF":4.6000,"publicationDate":"2023-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Reinduction-Based Approach for Efficient High Utility Itemset Mining from Incremental Datasets\",\"authors\":\"Pushp Sra, Satish Chand\",\"doi\":\"10.1007/s41019-023-00229-4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract High utility itemset mining is a crucial research area that focuses on identifying combinations of itemsets from databases that possess a utility value higher than a user-specified threshold. However, most existing algorithms assume that the databases are static, which is not realistic for real-life datasets that are continuously growing with new data. Furthermore, existing algorithms only rely on the utility value to identify relevant itemsets, leading to even the earliest occurring combinations being produced as output. Although some mining algorithms adopt a support-based approach to account for itemset frequency, they do not consider the temporal nature of itemsets. To address these challenges, this paper proposes the Scented Utility Miner (SUM) algorithm that uses a reinduction strategy to track the recency of itemset occurrence and mine itemsets from incremental databases. The paper provides a novel approach for mining high utility itemsets from dynamic databases and presents several experiments that demonstrate the effectiveness of the proposed approach.\",\"PeriodicalId\":52220,\"journal\":{\"name\":\"Data Science and Engineering\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2023-09-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Data Science and Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s41019-023-00229-4\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Science and Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s41019-023-00229-4","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

摘要高效用项集挖掘是一个重要的研究领域，它关注于从数据库中识别具有高于用户指定阈值的效用值的项集组合。然而，大多数现有算法假设数据库是静态的，这对于随着新数据不断增长的现实数据集来说是不现实的。此外，现有算法仅依赖效用值来识别相关的项集，导致即使是最早出现的组合也会作为输出产生。尽管一些挖掘算法采用基于支持的方法来考虑项目集的频率，但它们没有考虑项目集的时间性质。为了解决这些挑战，本文提出了气味效用矿工(SUM)算法，该算法使用重新归纳策略来跟踪项目集的出现频率，并从增量数据库中挖掘项目集。本文提出了一种从动态数据库中挖掘高效用项集的新方法，并通过几个实验证明了该方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Reinduction-Based Approach for Efficient High Utility Itemset Mining from Incremental Datasets

Abstract High utility itemset mining is a crucial research area that focuses on identifying combinations of itemsets from databases that possess a utility value higher than a user-specified threshold. However, most existing algorithms assume that the databases are static, which is not realistic for real-life datasets that are continuously growing with new data. Furthermore, existing algorithms only rely on the utility value to identify relevant itemsets, leading to even the earliest occurring combinations being produced as output. Although some mining algorithms adopt a support-based approach to account for itemset frequency, they do not consider the temporal nature of itemsets. To address these challenges, this paper proposes the Scented Utility Miner (SUM) algorithm that uses a reinduction strategy to track the recency of itemset occurrence and mine itemsets from incremental databases. The paper provides a novel approach for mining high utility itemsets from dynamic databases and presents several experiments that demonstrate the effectiveness of the proposed approach.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Data Science and Engineering Engineering-Computational Mechanics

CiteScore

10.40

自引率

2.40%

发文量

审稿时长

12 weeks

期刊介绍： The journal of Data Science and Engineering (DSE) responds to the remarkable change in the focus of information technology development from CPU-intensive computation to data-intensive computation, where the effective application of data, especially big data, becomes vital. The emerging discipline data science and engineering, an interdisciplinary field integrating theories and methods from computer science, statistics, information science, and other fields, focuses on the foundations and engineering of efficient and effective techniques and systems for data collection and management, for data integration and correlation, for information and knowledge extraction from massive data sets, and for data use in different application domains. Focusing on the theoretical background and advanced engineering approaches, DSE aims to offer a prime forum for researchers, professionals, and industrial practitioners to share their knowledge in this rapidly growing area. It provides in-depth coverage of the latest advances in the closely related fields of data science and data engineering. More specifically, DSE covers four areas: (i) the data itself, i.e., the nature and quality of the data, especially big data; (ii) the principles of information extraction from data, especially big data; (iii) the theory behind data-intensive computing; and (iv) the techniques and systems used to analyze and manage big data. DSE welcomes papers that explore the above subjects. Specific topics include, but are not limited to: (a) the nature and quality of data, (b) the computational complexity of data-intensive computing,(c) new methods for the design and analysis of the algorithms for solving problems with big data input,(d) collection and integration of data collected from internet and sensing devises or sensor networks, (e) representation, modeling, and visualization of big data,(f) storage, transmission, and management of big data,(g) methods and algorithms of data intensive computing, such asmining big data,online analysis processing of big data,big data-based machine learning, big data based decision-making, statistical computation of big data, graph-theoretic computation of big data, linear algebraic computation of big data, and big data-based optimization. (h) hardware systems and software systems for data-intensive computing, (i) data security, privacy, and trust, and(j) novel applications of big data.