{"title":"R-Apriori: An Efficient Apriori based Algorithm on Spark","authors":"Sanjay Rathee, Manohar Kaul, Arti Kashyap","doi":"10.1145/2809890.2809893","DOIUrl":null,"url":null,"abstract":"Association rule mining remains a very popular and effective method to extract meaningful information from large datasets. It tries to find possible associations between items in large transaction based datasets. In order to create these associations, frequent patterns have to be generated. The \"Apriori\" algorithm along with its set of improved variants, which were one of the earliest proposed frequent pattern generation algorithms still remain a preferred choice due to their ease of implementation and natural tendency to be parallelized. While many efficient single-machine methods for Apriori exist, the massive amount of data available these days is far beyond the capacity of a single machine. Hence, there is a need to scale across multiple machines to meet the demands of this ever-growing data. MapReduce is a popular fault-tolerant framework for distributed applications. Nevertheless, heavy disk I/O at each MapReduce operation hinders the implementation of efficient iterative data mining algorithms, such as Apriori, on MapReduce platforms. A newly proposed in-memory distributed dataflow platform called Spark overcomes the disk I/O bottlenecks in MapReduce. Therefore, Spark presents an ideal platform for distributed Apriori. However, in the implementation of Apriori, the most computationally expensive task is the generation of candidate sets having all possible pairs for singleton frequent items and comparing each pair with every transaction record. Here, we propose a new approach which dramatically reduces this computational complexity by eliminating the candidate generation step and avoiding costly comparisons. We conduct in-depth experiments to gain insight into the effectiveness, efficiency and scalability of our approach. Our studies show that our approach outperforms the classical Apriori and state-of-the-art on Spark by many times for different datasets.","PeriodicalId":67056,"journal":{"name":"车间管理","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"71","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"车间管理","FirstCategoryId":"96","ListUrlMain":"https://doi.org/10.1145/2809890.2809893","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 71
Abstract
Association rule mining remains a very popular and effective method to extract meaningful information from large datasets. It tries to find possible associations between items in large transaction based datasets. In order to create these associations, frequent patterns have to be generated. The "Apriori" algorithm along with its set of improved variants, which were one of the earliest proposed frequent pattern generation algorithms still remain a preferred choice due to their ease of implementation and natural tendency to be parallelized. While many efficient single-machine methods for Apriori exist, the massive amount of data available these days is far beyond the capacity of a single machine. Hence, there is a need to scale across multiple machines to meet the demands of this ever-growing data. MapReduce is a popular fault-tolerant framework for distributed applications. Nevertheless, heavy disk I/O at each MapReduce operation hinders the implementation of efficient iterative data mining algorithms, such as Apriori, on MapReduce platforms. A newly proposed in-memory distributed dataflow platform called Spark overcomes the disk I/O bottlenecks in MapReduce. Therefore, Spark presents an ideal platform for distributed Apriori. However, in the implementation of Apriori, the most computationally expensive task is the generation of candidate sets having all possible pairs for singleton frequent items and comparing each pair with every transaction record. Here, we propose a new approach which dramatically reduces this computational complexity by eliminating the candidate generation step and avoiding costly comparisons. We conduct in-depth experiments to gain insight into the effectiveness, efficiency and scalability of our approach. Our studies show that our approach outperforms the classical Apriori and state-of-the-art on Spark by many times for different datasets.