Web based commercial recommender systems (RS) can help users to make decisions about which product to purchase from the vast amount of products available on the Internet. Currently, many commercial recommender systems are developed for recommending frequently purchased products where a large amount of explicit ratings or purchase history data is available to predict user preferences. However, for products that are infrequently purchased by users, it is difficult to collect such data and, thus, user profiling becomes a major challenge for recommending these kinds of products. This paper proposes a recommendation approach for infrequently purchased products based on user opinions and navigation data. User opinion data, which is collected from product review data, is used to generate product profiles and user navigation data is used to generate user profiles, both of which are used for recommending products that best satisfy the users’ needs. Experiments conducted on real e-commerce data show that the proposed approach, named, Adaptive Collaborative Filtering (ACF), which utilizes user and product profiles, outperforms the Query Expansion (QE) approach that only utilizes product profiles to recommend products. The ACF also performs better than the Basic Search (BS) approach, which is widely applied by the current e-commerce applications.
{"title":"Infrequent Purchased Product Recommendation Making Based on User Behaviour and Opinions in E-commerce Sites","authors":"N. Abdullah, Yue Xu, S. Geva, Jinghong Chen","doi":"10.1109/ICDMW.2010.116","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.116","url":null,"abstract":"Web based commercial recommender systems (RS) can help users to make decisions about which product to purchase from the vast amount of products available on the Internet. Currently, many commercial recommender systems are developed for recommending frequently purchased products where a large amount of explicit ratings or purchase history data is available to predict user preferences. However, for products that are infrequently purchased by users, it is difficult to collect such data and, thus, user profiling becomes a major challenge for recommending these kinds of products. This paper proposes a recommendation approach for infrequently purchased products based on user opinions and navigation data. User opinion data, which is collected from product review data, is used to generate product profiles and user navigation data is used to generate user profiles, both of which are used for recommending products that best satisfy the users’ needs. Experiments conducted on real e-commerce data show that the proposed approach, named, Adaptive Collaborative Filtering (ACF), which utilizes user and product profiles, outperforms the Query Expansion (QE) approach that only utilizes product profiles to recommend products. The ACF also performs better than the Basic Search (BS) approach, which is widely applied by the current e-commerce applications.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"&NA; 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126027959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To solve the problem of complexity and uncertainty in logistics and distribution system, a kind of Ant Colony Optimization algorithm based on immunity vaccine and dynamic pheromone updating is put forward in this paper. In this new Algorithm, initial antibody is vaccinated first to produce better solutions, and initial parameters are then set depending on these better solutions by the Ant Colony. Mechanism of dynamic adjustment and pheromone updating is also introduced into this algorithm to slow down the increment of the pheromone concentration difference between the paths, avoiding sectional solution of the algorithm. Experiment results show that DACOIV can effectively find optimal path for logistics and distribution system.
{"title":"Ant Colony Optimization Algorithm Based on Immunity Vaccine and Dynamic Pheromone Updating","authors":"Wanjun Liu, Juan Zhang, Junli Liu","doi":"10.1109/ICDMW.2010.16","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.16","url":null,"abstract":"To solve the problem of complexity and uncertainty in logistics and distribution system, a kind of Ant Colony Optimization algorithm based on immunity vaccine and dynamic pheromone updating is put forward in this paper. In this new Algorithm, initial antibody is vaccinated first to produce better solutions, and initial parameters are then set depending on these better solutions by the Ant Colony. Mechanism of dynamic adjustment and pheromone updating is also introduced into this algorithm to slow down the increment of the pheromone concentration difference between the paths, avoiding sectional solution of the algorithm. Experiment results show that DACOIV can effectively find optimal path for logistics and distribution system.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127290604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose the use of commute distance, a random walk metric, to discover anomalies in network traffic data. The commute distance based anomaly detection approach has several advantages over Principal Component Analysis (PCA), which is the method of choice for this task: (i) It generalizes both distance and density based anomaly detection techniques while PCA is primarily distance-based (ii) It is agnostic about the underlying data distribution, while PCA is based on the assumption that data follows a Gaussian distribution and (iii) It is more robust compared to PCA, i.e., a perturbation of the underlying data or changes in parameters used will have a less significant effect on the output of it than PCA. Experiments and analysis on simulated and real datasets are used to validate our claims.
{"title":"Network Anomaly Detection Using a Commute Distance Based Approach","authors":"N. Khoa, T. Babaie, S. Chawla, Z. Zaidi","doi":"10.1109/ICDMW.2010.90","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.90","url":null,"abstract":"We propose the use of commute distance, a random walk metric, to discover anomalies in network traffic data. The commute distance based anomaly detection approach has several advantages over Principal Component Analysis (PCA), which is the method of choice for this task: (i) It generalizes both distance and density based anomaly detection techniques while PCA is primarily distance-based (ii) It is agnostic about the underlying data distribution, while PCA is based on the assumption that data follows a Gaussian distribution and (iii) It is more robust compared to PCA, i.e., a perturbation of the underlying data or changes in parameters used will have a less significant effect on the output of it than PCA. Experiments and analysis on simulated and real datasets are used to validate our claims.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121925590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Various real-life datasets can be viewed as a set of records consisting of attributes explaining the records and set of measures evaluating the records. In this paper, we address the problem of automatically discovering interesting subsets from such a dataset, such that the discovered interesting subsets have significantly different characteristics of performance than the rest of the dataset. We present an algorithm to discover such interesting subsets. The proposed algorithm uses a generic domain-independent definition of interestingness and uses various heuristics to intelligently prune the search space in order to build a solution scalable to large size datasets. This paper presents application of the interesting subset discovery algorithm on four real-world case-studies and demonstrates the effectiveness of the interesting subset discovery algorithm in extracting insights in order to identify problem areas and provide improvement recommendations to wide variety of systems.
{"title":"Interesting Subset Discovery and Its Application on Service Processes","authors":"M. Natu, Girish Keshav Palshikar","doi":"10.1109/ICDMW.2010.98","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.98","url":null,"abstract":"Various real-life datasets can be viewed as a set of records consisting of attributes explaining the records and set of measures evaluating the records. In this paper, we address the problem of automatically discovering interesting subsets from such a dataset, such that the discovered interesting subsets have significantly different characteristics of performance than the rest of the dataset. We present an algorithm to discover such interesting subsets. The proposed algorithm uses a generic domain-independent definition of interestingness and uses various heuristics to intelligently prune the search space in order to build a solution scalable to large size datasets. This paper presents application of the interesting subset discovery algorithm on four real-world case-studies and demonstrates the effectiveness of the interesting subset discovery algorithm in extracting insights in order to identify problem areas and provide improvement recommendations to wide variety of systems.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122669075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The growing use of Internet in everyday life has been creating new challenges and opportunities to use data mining techniques. A relatively new trend in the Internet is the deep web. As a large number of deep web data sources tend to provide similar data, an important problem is to perform offline analysis to understand the differences in data available from different sources. This paper introduces data mining methods to extract a high-level summary of the differences in data provided by different deep web data sources. We consider pattern of values with respect to the same entity and we formulate a new data mining problem, which we refer to as differential rule mining. We have developed an algorithm for mining such rules. Our method includes a pruning method to summarize the identified differential rules. For efficiency, a hash-table is used to accelerate the pruning process. We show the effectiveness, efficiency, and utility of our methods by analyzing data across four travel-related web-sites.
{"title":"Differential Analysis on Deep Web Data Sources","authors":"Tantan Liu, Fan Wang, Jiedan Zhu, G. Agrawal","doi":"10.1109/ICDMW.2010.22","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.22","url":null,"abstract":"The growing use of Internet in everyday life has been creating new challenges and opportunities to use data mining techniques. A relatively new trend in the Internet is the deep web. As a large number of deep web data sources tend to provide similar data, an important problem is to perform offline analysis to understand the differences in data available from different sources. This paper introduces data mining methods to extract a high-level summary of the differences in data provided by different deep web data sources. We consider pattern of values with respect to the same entity and we formulate a new data mining problem, which we refer to as differential rule mining. We have developed an algorithm for mining such rules. Our method includes a pruning method to summarize the identified differential rules. For efficiency, a hash-table is used to accelerate the pruning process. We show the effectiveness, efficiency, and utility of our methods by analyzing data across four travel-related web-sites.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131347216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Miao Wang, Xuequn Shang, Shaohua Zhang, Zhanhuai Li
Biclustering is a methodology allowing for condition set and gene set points clustering simultaneously. Almost all the current biclustering algorithms find bicluster in one microarray dataset. In order to reduce the noise influence and find more biological biclusters, we propose an algorithm, FDCluster, to mine frequent closed discriminative bicluster in multiple microarray datasets. FDCluster uses Apriori property and several novel techniques for pruning to mine frequent closed bicluster without candidate maintenance. The experimental results show that FDCluster is more effectiveness than traditional method in either single micorarray dataset or multiple microarray datasets. We also test the biological significance using GO to show our proposed method is able to produce biologically relevant biclusters.
{"title":"FDCluster: Mining Frequent Closed Discriminative Bicluster without Candidate Maintenance in Multiple Microarray Datasets","authors":"Miao Wang, Xuequn Shang, Shaohua Zhang, Zhanhuai Li","doi":"10.1109/ICDMW.2010.10","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.10","url":null,"abstract":"Biclustering is a methodology allowing for condition set and gene set points clustering simultaneously. Almost all the current biclustering algorithms find bicluster in one microarray dataset. In order to reduce the noise influence and find more biological biclusters, we propose an algorithm, FDCluster, to mine frequent closed discriminative bicluster in multiple microarray datasets. FDCluster uses Apriori property and several novel techniques for pruning to mine frequent closed bicluster without candidate maintenance. The experimental results show that FDCluster is more effectiveness than traditional method in either single micorarray dataset or multiple microarray datasets. We also test the biological significance using GO to show our proposed method is able to produce biologically relevant biclusters.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125784561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Optimally combining available information is one of the key challenges in knowledge-driven prediction techniques. In this study, we evaluate six Phi and Psi-based backbone alphabets. We show that the addition of predicted backbone conformations to SVM classifiers can improve fold recognition. Our experimental results show that the inclusion of predicted backbone conformations in our feature representation leads to higher overall accuracy compared to when using amino acid residues alone.
{"title":"Evaluation of Protein Backbone Alphabets: Using Predicted Local Structure for Fold Recognition","authors":"Kyong Jin Shim","doi":"10.1109/ICDMW.2010.168","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.168","url":null,"abstract":"Optimally combining available information is one of the key challenges in knowledge-driven prediction techniques. In this study, we evaluate six Phi and Psi-based backbone alphabets. We show that the addition of predicted backbone conformations to SVM classifiers can improve fold recognition. Our experimental results show that the inclusion of predicted backbone conformations in our feature representation leads to higher overall accuracy compared to when using amino acid residues alone.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129902364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a framework for learning generalized additive models at very little additional cost (a small constant) compared to some of the most efficient schemes for learning linear classifiers such as linear SVMs and regularized logistic regression. We achieve this through a simple feature encoding scheme followed by a novel approach to regularization which we term ``generalized lasso''. Addtive models offer an attractive alternative to linear models for many large scale tasks as they have significantly higher predictive power while remaining easily interpretable. Furthermore, our regularizations approach extends to arbitrary graphs, allowing, for example, to explicitly incorporate spatial information or similar priors. Traditional approaches for learning additive models, such as back fitting, do not scale to large datasets. Our new formulation of the resulting optimization problem allows us to investigate the use of recent accelerated gradient algorithms and demonstrate speed comparable to state of the art linear SVM training methods, making additive models suitable for very large problems. In our experiments we find that additive models consistently outperform linear models on various datasets.
{"title":"Efficient Additive Models via the Generalized Lasso","authors":"D. Semenovich, Nobuyuki Morioka, A. Sowmya","doi":"10.1109/ICDMW.2010.184","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.184","url":null,"abstract":"We propose a framework for learning generalized additive models at very little additional cost (a small constant) compared to some of the most efficient schemes for learning linear classifiers such as linear SVMs and regularized logistic regression. We achieve this through a simple feature encoding scheme followed by a novel approach to regularization which we term ``generalized lasso''. Addtive models offer an attractive alternative to linear models for many large scale tasks as they have significantly higher predictive power while remaining easily interpretable. Furthermore, our regularizations approach extends to arbitrary graphs, allowing, for example, to explicitly incorporate spatial information or similar priors. Traditional approaches for learning additive models, such as back fitting, do not scale to large datasets. Our new formulation of the resulting optimization problem allows us to investigate the use of recent accelerated gradient algorithms and demonstrate speed comparable to state of the art linear SVM training methods, making additive models suitable for very large problems. In our experiments we find that additive models consistently outperform linear models on various datasets.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"220 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129957210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Air pollution has been a huge problem for a long time, more and more scientists focus on this hot topic, In this paper we presented a series data analysis methods for Los Angeles Long Beach datasets by Seasonal ARIMA(autoregressive integrated moving average) model and MCMC(Markov chain Monte Carlo) method. The MCMC methods are studied with LA long beach air pollution PM 2.5 traffic from 1997 to 2008 observations. The conclusion illustrated that experimental results indicate that the seasonal ARIMA model can be an effective way to forecast air pollution, and also know the MCMC model fitting the datasets very significantly. This approach applied to a large class of utility functions and models for Air pollution and traffic fields.
{"title":"Data Analysis in Los Angeles Long Beach with Seasonal Time Series Model","authors":"Weiqiang Wang, Zhendong Niu","doi":"10.1109/ICDMW.2010.93","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.93","url":null,"abstract":"Air pollution has been a huge problem for a long time, more and more scientists focus on this hot topic, In this paper we presented a series data analysis methods for Los Angeles Long Beach datasets by Seasonal ARIMA(autoregressive integrated moving average) model and MCMC(Markov chain Monte Carlo) method. The MCMC methods are studied with LA long beach air pollution PM 2.5 traffic from 1997 to 2008 observations. The conclusion illustrated that experimental results indicate that the seasonal ARIMA model can be an effective way to forecast air pollution, and also know the MCMC model fitting the datasets very significantly. This approach applied to a large class of utility functions and models for Air pollution and traffic fields.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130705725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Farra, Elie Challita, R. A. Assi, Hazem M. Hajj
In this work, we investigate sentiment mining of Arabic text at both the sentence level and the document level. Existing research in Arabic sentiment mining remains very limited. For sentence-level classification, we investigate two approaches. The first is a novel grammatical approach that employs the use of a general structure for the Arabic sentence. The second approach is based on the semantic orientation of words and their corresponding frequencies, to do this we built an interactive learning semantic dictionary which stores the polarities of the roots of different words and identifies new polarities based on these roots. For document-level classification, we use sentences of known classes to classify whole documents, using a novel approach whereby documents are divided dynamically into chunks and classification is based on the semantic contributions of different chunks in the document. This dynamic chunking approach can also be investigated for sentiment mining in other languages. Finally, we propose a hierarchical classification scheme that uses the results of the sentence-level classifier as input to the document-level classifier, an approach which has not been investigated previously for Arabic documents. We also pinpoint the various challenges that are faced by sentiment mining for Arabic texts and propose suggestions for its development. We demonstrate promising results with our sentence-level approach, and our document-level experiments show, with high accuracy, that it is feasible to extract the sentiment of an Arabic document based on the classes of its sentences.
{"title":"Sentence-Level and Document-Level Sentiment Mining for Arabic Texts","authors":"N. Farra, Elie Challita, R. A. Assi, Hazem M. Hajj","doi":"10.1109/ICDMW.2010.95","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.95","url":null,"abstract":"In this work, we investigate sentiment mining of Arabic text at both the sentence level and the document level. Existing research in Arabic sentiment mining remains very limited. For sentence-level classification, we investigate two approaches. The first is a novel grammatical approach that employs the use of a general structure for the Arabic sentence. The second approach is based on the semantic orientation of words and their corresponding frequencies, to do this we built an interactive learning semantic dictionary which stores the polarities of the roots of different words and identifies new polarities based on these roots. For document-level classification, we use sentences of known classes to classify whole documents, using a novel approach whereby documents are divided dynamically into chunks and classification is based on the semantic contributions of different chunks in the document. This dynamic chunking approach can also be investigated for sentiment mining in other languages. Finally, we propose a hierarchical classification scheme that uses the results of the sentence-level classifier as input to the document-level classifier, an approach which has not been investigated previously for Arabic documents. We also pinpoint the various challenges that are faced by sentiment mining for Arabic texts and propose suggestions for its development. We demonstrate promising results with our sentence-level approach, and our document-level experiments show, with high accuracy, that it is feasible to extract the sentiment of an Arabic document based on the classes of its sentences.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130709289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}