Wei Shen, Y. Kamarianakis, L. Wynter, Jingrui He, Qing He, Richard D. Lawrence, G. Swirszcz
This report summarizes the methodologies and techniques we developed and applied for tackling task 3 of the IEEE ICDM Contest on predicting traffic velocity based on GPS data. The major components of our solution include 1) A pre-processing procedure to map GPS data to the network, 2) A K-nearest neighbor approach for identifying the most similar training hours for every test hour, and 3) A heuristic evaluation framework for optimizing parameters and avoiding over-fitting. Our solution finished Second in the final evaluation.
{"title":"Traffic Velocity Prediction Using GPS Data: IEEE ICDM Contest Task 3 Report","authors":"Wei Shen, Y. Kamarianakis, L. Wynter, Jingrui He, Qing He, Richard D. Lawrence, G. Swirszcz","doi":"10.1109/ICDMW.2010.52","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.52","url":null,"abstract":"This report summarizes the methodologies and techniques we developed and applied for tackling task 3 of the IEEE ICDM Contest on predicting traffic velocity based on GPS data. The major components of our solution include 1) A pre-processing procedure to map GPS data to the network, 2) A K-nearest neighbor approach for identifying the most similar training hours for every test hour, and 3) A heuristic evaluation framework for optimizing parameters and avoiding over-fitting. Our solution finished Second in the final evaluation.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123148969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In knowledge discovery in single sequences, different results could be discovered from the same sequence when different frequency measures are adopted. It is natural to raise such questions as (1) do these frequency measures reflect actual frequencies accurately? (2) what impacts do frequency measures have on discovered knowledge? (3) are discovered results accurate and reliable? and (4) which measures are appropriate for reflecting frequencies accurately? In this paper, taking three major factors (anti-monotonicity, maximum-frequency and window-width restriction) into account, we identify inaccuracies inherent in seven existing frequency measures, and investigate their impacts on the soundness and completeness of two kinds of knowledge, frequent episodes and episode rules, discovered from single sequences. In order to obtain more accurate frequencies and knowledge, we provide three recommendations for defining appropriate frequency measures. Following the recommendations, we introduce a more appropriate frequency measure. Empirical evaluation reveals the inaccuracies and verifies our findings.
{"title":"A Study on the Accuracy of Frequency Measures and Its Impact on Knowledge Discovery in Single Sequences","authors":"M. Gan, H. Dai","doi":"10.1109/ICDMW.2010.83","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.83","url":null,"abstract":"In knowledge discovery in single sequences, different results could be discovered from the same sequence when different frequency measures are adopted. It is natural to raise such questions as (1) do these frequency measures reflect actual frequencies accurately? (2) what impacts do frequency measures have on discovered knowledge? (3) are discovered results accurate and reliable? and (4) which measures are appropriate for reflecting frequencies accurately? In this paper, taking three major factors (anti-monotonicity, maximum-frequency and window-width restriction) into account, we identify inaccuracies inherent in seven existing frequency measures, and investigate their impacts on the soundness and completeness of two kinds of knowledge, frequent episodes and episode rules, discovered from single sequences. In order to obtain more accurate frequencies and knowledge, we provide three recommendations for defining appropriate frequency measures. Following the recommendations, we introduce a more appropriate frequency measure. Empirical evaluation reveals the inaccuracies and verifies our findings.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124017604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Barnan Das, Chao Chen, N. Dasgupta, D. Cook, Adriana M. Seelye
With more older adults and people with cognitive disorders preferring to stay independently at home, prompting systems that assist with Activities of Daily Living (ADLs) are in demand. In this paper, with the introduction of “The PUCKâ€, we take the very first approach to automate a prompting system without any predefined rule set or user feedback. We statistically analyze realistic prompting data and devise a classifier from statistical outlier detection methods. Further, we devise a sampling technique to help with skewed and scanty data sets. We empirically find a class distribution that would be suitable for our work and validate our claims with the help of three classical machine learning algorithms.
{"title":"Automated Prompting in a Smart Home Environment","authors":"Barnan Das, Chao Chen, N. Dasgupta, D. Cook, Adriana M. Seelye","doi":"10.1109/ICDMW.2010.147","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.147","url":null,"abstract":"With more older adults and people with cognitive disorders preferring to stay independently at home, prompting systems that assist with Activities of Daily Living (ADLs) are in demand. In this paper, with the introduction of “The PUCKâ€, we take the very first approach to automate a prompting system without any predefined rule set or user feedback. We statistically analyze realistic prompting data and devise a classifier from statistical outlier detection methods. Further, we devise a sampling technique to help with skewed and scanty data sets. We empirically find a class distribution that would be suitable for our work and validate our claims with the help of three classical machine learning algorithms.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"236 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125212565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jin Chang, Jun Luo, J. Huang, Shengzhong Feng, Jianping Fan
Rapid growth of data has provided us with more information, yet challenges the tradition techniques to extract the useful knowledge. In this paper, we propose MCMM, a Minimum spanning tree (MST) based Classification model for Massive data with MapReduce implementation. It can be viewed as an intermediate model between the traditional K nearest neighbor method and cluster based classification method, aiming to overcome their disadvantages and cope with large amount of data. Our model is implemented on Hadoop platform, using its MapReduce programming framework, which is particular suitable for cloud computing. We have done experiments on several data sets including real world data from UCI repository and synthetic data, using Downing 4000 clusters, installed with Hadoop. The results show that our model outperforms KNN and some other classification methods on a general basis with respect to accuracy and scalability.
{"title":"Minimum Spanning Tree Based Classification Model for Massive Data with MapReduce Implementation","authors":"Jin Chang, Jun Luo, J. Huang, Shengzhong Feng, Jianping Fan","doi":"10.1109/ICDMW.2010.14","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.14","url":null,"abstract":"Rapid growth of data has provided us with more information, yet challenges the tradition techniques to extract the useful knowledge. In this paper, we propose MCMM, a Minimum spanning tree (MST) based Classification model for Massive data with MapReduce implementation. It can be viewed as an intermediate model between the traditional K nearest neighbor method and cluster based classification method, aiming to overcome their disadvantages and cope with large amount of data. Our model is implemented on Hadoop platform, using its MapReduce programming framework, which is particular suitable for cloud computing. We have done experiments on several data sets including real world data from UCI repository and synthetic data, using Downing 4000 clusters, installed with Hadoop. The results show that our model outperforms KNN and some other classification methods on a general basis with respect to accuracy and scalability.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128723822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Diagnosis Related Group (DRG) upcoding is an anomaly in healthcare data that costs hundreds of millions of dollars in many developed countries. DRG upcoding is typically detected through resource intensive auditing. As supervised modeling of DRG upcoding is severely constrained by scope and timeliness of past audit data, we propose in this paper an unsupervised algorithm to filter data for potential identification of DRG upcoding. The algorithm has been applied to a hip replacement/revision dataset and a heart-attack dataset. The results are consistent with the assumptions held by domain experts.
{"title":"Unsupervised DRG Upcoding Detection in Healthcare Databases","authors":"Wei Luo, M. Gallagher","doi":"10.1109/ICDMW.2010.108","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.108","url":null,"abstract":"Diagnosis Related Group (DRG) upcoding is an anomaly in healthcare data that costs hundreds of millions of dollars in many developed countries. DRG upcoding is typically detected through resource intensive auditing. As supervised modeling of DRG upcoding is severely constrained by scope and timeliness of past audit data, we propose in this paper an unsupervised algorithm to filter data for potential identification of DRG upcoding. The algorithm has been applied to a hip replacement/revision dataset and a heart-attack dataset. The results are consistent with the assumptions held by domain experts.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121802116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Item folksonomy or tag information is a kind of typical and prevalent web 2.0 information. Item folksonmy contains rich opinion information of users on item classifications and descriptions. It can be used as another important information source to conduct opinion mining. On the other hand, each item is associated with taxonomy information that reflects the viewpoints of experts. In this paper, we propose to mine for users¡¯ opinions on items based on item taxonomy developed by experts and folksonomy contributed by users. In addition, we explore how to make personalized item recommendations based on users¡¯ opinions. The experiments conducted on real word datasets collected from Amazon.com and CiteULike demonstrated the effectiveness of the proposed approaches.
{"title":"Mining Users' Opinions Based on Item Folksonomy and Taxonomy for Personalized Recommender Systems","authors":"Huizhi Liang, Yue Xu, Yuefeng Li","doi":"10.1109/ICDMW.2010.163","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.163","url":null,"abstract":"Item folksonomy or tag information is a kind of typical and prevalent web 2.0 information. Item folksonmy contains rich opinion information of users on item classifications and descriptions. It can be used as another important information source to conduct opinion mining. On the other hand, each item is associated with taxonomy information that reflects the viewpoints of experts. In this paper, we propose to mine for users¡¯ opinions on items based on item taxonomy developed by experts and folksonomy contributed by users. In addition, we explore how to make personalized item recommendations based on users¡¯ opinions. The experiments conducted on real word datasets collected from Amazon.com and CiteULike demonstrated the effectiveness of the proposed approaches.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121174167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
During the early phase of drug discovery, machine learning methods are often utilized to select compounds to send for experimental screening. In order to accomplish this goal, any method that can provide estimates of error rate for a given set of predictions is an extremely valuable tool. In this paper we compare Platt Calibration Algorithm and recently introduced Conformal Algorithm to control the error rate in the sense of precision while preserving the ability to identify as many compounds as possible (recall) that are highly likely to be bio-active in a certain context. We empirically evaluate and compare the performance of Platt’s Calibration and offline Mondrian ICM in the context of SVM-based classification on 75 distinct classification problems. We perform this evaluation in the real world setting where the true class labels of compounds are unknown at the time of prediction and are only revealed after the biological experiment is completed. Our empirical results show that under this setting, offline Mondrian ICM and Platt Calibration are not able to bound precision rates very well on an absolute basis. Comparatively, Mondrian ICM, even though not theoretically designed to control precision directly, compares favorably with Platt Calibration for this task.
{"title":"An Empirical Comparison of Platt Calibration and Inductive Confidence Machines for Predictions in Drug Discovery","authors":"Nikil Wale","doi":"10.1109/ICDMW.2010.111","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.111","url":null,"abstract":"During the early phase of drug discovery, machine learning methods are often utilized to select compounds to send for experimental screening. In order to accomplish this goal, any method that can provide estimates of error rate for a given set of predictions is an extremely valuable tool. In this paper we compare Platt Calibration Algorithm and recently introduced Conformal Algorithm to control the error rate in the sense of precision while preserving the ability to identify as many compounds as possible (recall) that are highly likely to be bio-active in a certain context. We empirically evaluate and compare the performance of Platt’s Calibration and offline Mondrian ICM in the context of SVM-based classification on 75 distinct classification problems. We perform this evaluation in the real world setting where the true class labels of compounds are unknown at the time of prediction and are only revealed after the biological experiment is completed. Our empirical results show that under this setting, offline Mondrian ICM and Platt Calibration are not able to bound precision rates very well on an absolute basis. Comparatively, Mondrian ICM, even though not theoretically designed to control precision directly, compares favorably with Platt Calibration for this task.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116400542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a new approach for analyzing the Japanese government bond (JGB) market using text-mining technology. First, we extracted the feature vectors of the monthly reports from the Bank of Japan (BOJ). Then, the trends in the JGB market were estimated by a regression analysis using the feature vectors. As a result of comparison with support vector regression and other methods, the proposal method could forecast in higher accuracy about both the level and direction of long-term market trends. Moreover, our method showed high returns with annual rate averages as a result of the implementation test.
{"title":"Trading Tests of Long-Term Market Forecast by Text Mining","authors":"K. Izumi, Takashi Goto, Tohgoroh Matsui","doi":"10.1109/ICDMW.2010.60","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.60","url":null,"abstract":"We propose a new approach for analyzing the Japanese government bond (JGB) market using text-mining technology. First, we extracted the feature vectors of the monthly reports from the Bank of Japan (BOJ). Then, the trends in the JGB market were estimated by a regression analysis using the feature vectors. As a result of comparison with support vector regression and other methods, the proposal method could forecast in higher accuracy about both the level and direction of long-term market trends. Moreover, our method showed high returns with annual rate averages as a result of the implementation test.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133771147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Continuously increasing amounts of data in data warehouses are providing companies with ample opportunity to conduct analytical customer relationship management (CRM). However, how to utilize the information retrieved from the analysis of these data to retain the most valuable customers, identify customers with additional revenue potential, and achieve cost-effective customer relationship management, continue to pose challenges for companies. This study proposes a two-level approach combining SOM-Ward clustering and predictive analytics to segment the customer base of a case company with 1.5 million customers. First, according to the spending amount, demographic and behavioral characteristics of the customers, we adopt SOM-Ward clustering to segment the customer base into seven segments: exclusive customers, high-spending customers, and five segments of mass customers. Then, three classification models - the support vector machine (SVM), the neural network, and the decision tree, are employed to classify high-spending and low-spending customers. The performance of the three classification models is evaluated and compared. The three models are then combined to predict potential high-spending customers from the mass customers. It is found that this hybrid approach could provide more thorough and detailed information about the customer base, especially the untapped mass market with potential high revenue contribution, for tailoring actionable marketing strategies.
{"title":"Using SOM-Ward Clustering and Predictive Analytics for Conducting Customer Segmentation","authors":"Zhiyuan Yao, T. Eklund, B. Back","doi":"10.1109/ICDMW.2010.121","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.121","url":null,"abstract":"Continuously increasing amounts of data in data warehouses are providing companies with ample opportunity to conduct analytical customer relationship management (CRM). However, how to utilize the information retrieved from the analysis of these data to retain the most valuable customers, identify customers with additional revenue potential, and achieve cost-effective customer relationship management, continue to pose challenges for companies. This study proposes a two-level approach combining SOM-Ward clustering and predictive analytics to segment the customer base of a case company with 1.5 million customers. First, according to the spending amount, demographic and behavioral characteristics of the customers, we adopt SOM-Ward clustering to segment the customer base into seven segments: exclusive customers, high-spending customers, and five segments of mass customers. Then, three classification models - the support vector machine (SVM), the neural network, and the decision tree, are employed to classify high-spending and low-spending customers. The performance of the three classification models is evaluated and compared. The three models are then combined to predict potential high-spending customers from the mass customers. It is found that this hybrid approach could provide more thorough and detailed information about the customer base, especially the untapped mass market with potential high revenue contribution, for tailoring actionable marketing strategies.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131402435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Clique detection and analysis is one of the fundamental problems in graph theory. However, as the size of graphs increases (e.g., those of social networks), it becomes difficult to conduct such analysis using existing sequential algorithms due to the computation and memory limitation. In this paper, we present a distributed algorithm, dMaximalCliques, which can obtain clique information from million-node graphs within a few minutes on an 80-node computer cluster. dMaximalCliques is a distributed algorithm for share-nothing systems, such as racks of clusters. We use very large scale real and synthetic graphs in the experimental studies to prove the efficiency of the algorithm. In addition, we propose to use the distribution of the size of maximal cliques in a graph (Maximal Clique Distribution) as a new measure for measuring the structural properties of a graph and for distinguishing different types of graphs. Meanwhile, we find that this distribution can be well fitted by lognormal distribution.
{"title":"dMaximalCliques: A Distributed Algorithm for Enumerating All Maximal Cliques and Maximal Clique Distribution","authors":"Li Lu, Yunhong Gu, R. Grossman","doi":"10.1109/ICDMW.2010.13","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.13","url":null,"abstract":"Clique detection and analysis is one of the fundamental problems in graph theory. However, as the size of graphs increases (e.g., those of social networks), it becomes difficult to conduct such analysis using existing sequential algorithms due to the computation and memory limitation. In this paper, we present a distributed algorithm, dMaximalCliques, which can obtain clique information from million-node graphs within a few minutes on an 80-node computer cluster. dMaximalCliques is a distributed algorithm for share-nothing systems, such as racks of clusters. We use very large scale real and synthetic graphs in the experimental studies to prove the efficiency of the algorithm. In addition, we propose to use the distribution of the size of maximal cliques in a graph (Maximal Clique Distribution) as a new measure for measuring the structural properties of a graph and for distinguishing different types of graphs. Meanwhile, we find that this distribution can be well fitted by lognormal distribution.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132525173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}