We propose a new approach for analyzing the Japanese government bond (JGB) market using text-mining technology. First, we extracted the feature vectors of the monthly reports from the Bank of Japan (BOJ). Then, the trends in the JGB market were estimated by a regression analysis using the feature vectors. As a result of comparison with support vector regression and other methods, the proposal method could forecast in higher accuracy about both the level and direction of long-term market trends. Moreover, our method showed high returns with annual rate averages as a result of the implementation test.
{"title":"Trading Tests of Long-Term Market Forecast by Text Mining","authors":"K. Izumi, Takashi Goto, Tohgoroh Matsui","doi":"10.1109/ICDMW.2010.60","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.60","url":null,"abstract":"We propose a new approach for analyzing the Japanese government bond (JGB) market using text-mining technology. First, we extracted the feature vectors of the monthly reports from the Bank of Japan (BOJ). Then, the trends in the JGB market were estimated by a regression analysis using the feature vectors. As a result of comparison with support vector regression and other methods, the proposal method could forecast in higher accuracy about both the level and direction of long-term market trends. Moreover, our method showed high returns with annual rate averages as a result of the implementation test.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133771147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, multivariate temporal data classification has been widely applied on many fields, such as bio-signals analysis, stocks prediction and weather forecasting. Multivariate temporal data contains hybrid type of attributes like numeric and categorical ones. However, most classification methods proposed in the past researches are not directly applicable to the multivariate temporal data with multiple types. Additionally, no useful and readable rules are provided in the existing methods for advanced classification analysis. In this paper, we proposed a novel algorithm named Progressive Temporal Class Rule Miner (PTCR-Miner) for classification on multivariate temporal data with a rule-based design. Through our algorithm, the classification rules discovered follow the purification concept we defined to be comprehensible and intuitive for general users to use on data classification. A series of experiments were conducted to evaluate our method with a multivariate temporal data simulator. The experimental results showed that PTCR-Miner performs effectively and efficiently on different simulated multivariate temporal datasets. Additionally, a real dataset related to asthma monitoring was also tested and the results showed that our classification mechanism works stably for asthma attack predictions. This means the discovered rules are really helpful and comprehensible for data classification. Furthermore, the rule-based and flexible architecture make PTCR-Miner more applicable to different application areas of multivariate temporal data classification.
{"title":"PTCR-Miner: Progressive Temporal Class Rule Mining for Multivariate Temporal Data Classification","authors":"Chao-Hui Lee, V. Tseng","doi":"10.1109/ICDMW.2010.171","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.171","url":null,"abstract":"Recently, multivariate temporal data classification has been widely applied on many fields, such as bio-signals analysis, stocks prediction and weather forecasting. Multivariate temporal data contains hybrid type of attributes like numeric and categorical ones. However, most classification methods proposed in the past researches are not directly applicable to the multivariate temporal data with multiple types. Additionally, no useful and readable rules are provided in the existing methods for advanced classification analysis. In this paper, we proposed a novel algorithm named Progressive Temporal Class Rule Miner (PTCR-Miner) for classification on multivariate temporal data with a rule-based design. Through our algorithm, the classification rules discovered follow the purification concept we defined to be comprehensible and intuitive for general users to use on data classification. A series of experiments were conducted to evaluate our method with a multivariate temporal data simulator. The experimental results showed that PTCR-Miner performs effectively and efficiently on different simulated multivariate temporal datasets. Additionally, a real dataset related to asthma monitoring was also tested and the results showed that our classification mechanism works stably for asthma attack predictions. This means the discovered rules are really helpful and comprehensible for data classification. Furthermore, the rule-based and flexible architecture make PTCR-Miner more applicable to different application areas of multivariate temporal data classification.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"76 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114147515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Diagnosis Related Group (DRG) upcoding is an anomaly in healthcare data that costs hundreds of millions of dollars in many developed countries. DRG upcoding is typically detected through resource intensive auditing. As supervised modeling of DRG upcoding is severely constrained by scope and timeliness of past audit data, we propose in this paper an unsupervised algorithm to filter data for potential identification of DRG upcoding. The algorithm has been applied to a hip replacement/revision dataset and a heart-attack dataset. The results are consistent with the assumptions held by domain experts.
{"title":"Unsupervised DRG Upcoding Detection in Healthcare Databases","authors":"Wei Luo, M. Gallagher","doi":"10.1109/ICDMW.2010.108","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.108","url":null,"abstract":"Diagnosis Related Group (DRG) upcoding is an anomaly in healthcare data that costs hundreds of millions of dollars in many developed countries. DRG upcoding is typically detected through resource intensive auditing. As supervised modeling of DRG upcoding is severely constrained by scope and timeliness of past audit data, we propose in this paper an unsupervised algorithm to filter data for potential identification of DRG upcoding. The algorithm has been applied to a hip replacement/revision dataset and a heart-attack dataset. The results are consistent with the assumptions held by domain experts.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121802116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
During the early phase of drug discovery, machine learning methods are often utilized to select compounds to send for experimental screening. In order to accomplish this goal, any method that can provide estimates of error rate for a given set of predictions is an extremely valuable tool. In this paper we compare Platt Calibration Algorithm and recently introduced Conformal Algorithm to control the error rate in the sense of precision while preserving the ability to identify as many compounds as possible (recall) that are highly likely to be bio-active in a certain context. We empirically evaluate and compare the performance of Platt’s Calibration and offline Mondrian ICM in the context of SVM-based classification on 75 distinct classification problems. We perform this evaluation in the real world setting where the true class labels of compounds are unknown at the time of prediction and are only revealed after the biological experiment is completed. Our empirical results show that under this setting, offline Mondrian ICM and Platt Calibration are not able to bound precision rates very well on an absolute basis. Comparatively, Mondrian ICM, even though not theoretically designed to control precision directly, compares favorably with Platt Calibration for this task.
{"title":"An Empirical Comparison of Platt Calibration and Inductive Confidence Machines for Predictions in Drug Discovery","authors":"Nikil Wale","doi":"10.1109/ICDMW.2010.111","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.111","url":null,"abstract":"During the early phase of drug discovery, machine learning methods are often utilized to select compounds to send for experimental screening. In order to accomplish this goal, any method that can provide estimates of error rate for a given set of predictions is an extremely valuable tool. In this paper we compare Platt Calibration Algorithm and recently introduced Conformal Algorithm to control the error rate in the sense of precision while preserving the ability to identify as many compounds as possible (recall) that are highly likely to be bio-active in a certain context. We empirically evaluate and compare the performance of Platt’s Calibration and offline Mondrian ICM in the context of SVM-based classification on 75 distinct classification problems. We perform this evaluation in the real world setting where the true class labels of compounds are unknown at the time of prediction and are only revealed after the biological experiment is completed. Our empirical results show that under this setting, offline Mondrian ICM and Platt Calibration are not able to bound precision rates very well on an absolute basis. Comparatively, Mondrian ICM, even though not theoretically designed to control precision directly, compares favorably with Platt Calibration for this task.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116400542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An efficient incremental approach to the discriminative common vector (DCV) method for dimensionality reduction and classification is presented. Starting from the original batch method, an incremental formulation is given. The main idea is to minimize both matrix operations and space constraints. To this end, an straightforward per sample correction is obtained enabling the possibility of setting up an efficient online algorithm. The performance results and the same good properties than the original method are preserved but with a very significant decrease in computational burden when used in dynamic contexts. Extensive experimentation assessing the properties of the proposed algorithms with regard to previously proposed ones using several publicly available high dimensional databases has been carried out.
{"title":"Efficient Dimensionality Reduction on Undersampled Problems through Incremental Discriminative Common Vectors","authors":"F. Ferri, Katerine Díaz-Chito, W. D. Villanueva","doi":"10.1109/ICDMW.2010.50","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.50","url":null,"abstract":"An efficient incremental approach to the discriminative common vector (DCV) method for dimensionality reduction and classification is presented. Starting from the original batch method, an incremental formulation is given. The main idea is to minimize both matrix operations and space constraints. To this end, an straightforward per sample correction is obtained enabling the possibility of setting up an efficient online algorithm. The performance results and the same good properties than the original method are preserved but with a very significant decrease in computational burden when used in dynamic contexts. Extensive experimentation assessing the properties of the proposed algorithms with regard to previously proposed ones using several publicly available high dimensional databases has been carried out.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116855655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Item folksonomy or tag information is a kind of typical and prevalent web 2.0 information. Item folksonmy contains rich opinion information of users on item classifications and descriptions. It can be used as another important information source to conduct opinion mining. On the other hand, each item is associated with taxonomy information that reflects the viewpoints of experts. In this paper, we propose to mine for users¡¯ opinions on items based on item taxonomy developed by experts and folksonomy contributed by users. In addition, we explore how to make personalized item recommendations based on users¡¯ opinions. The experiments conducted on real word datasets collected from Amazon.com and CiteULike demonstrated the effectiveness of the proposed approaches.
{"title":"Mining Users' Opinions Based on Item Folksonomy and Taxonomy for Personalized Recommender Systems","authors":"Huizhi Liang, Yue Xu, Yuefeng Li","doi":"10.1109/ICDMW.2010.163","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.163","url":null,"abstract":"Item folksonomy or tag information is a kind of typical and prevalent web 2.0 information. Item folksonmy contains rich opinion information of users on item classifications and descriptions. It can be used as another important information source to conduct opinion mining. On the other hand, each item is associated with taxonomy information that reflects the viewpoints of experts. In this paper, we propose to mine for users¡¯ opinions on items based on item taxonomy developed by experts and folksonomy contributed by users. In addition, we explore how to make personalized item recommendations based on users¡¯ opinions. The experiments conducted on real word datasets collected from Amazon.com and CiteULike demonstrated the effectiveness of the proposed approaches.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121174167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In knowledge discovery in single sequences, different results could be discovered from the same sequence when different frequency measures are adopted. It is natural to raise such questions as (1) do these frequency measures reflect actual frequencies accurately? (2) what impacts do frequency measures have on discovered knowledge? (3) are discovered results accurate and reliable? and (4) which measures are appropriate for reflecting frequencies accurately? In this paper, taking three major factors (anti-monotonicity, maximum-frequency and window-width restriction) into account, we identify inaccuracies inherent in seven existing frequency measures, and investigate their impacts on the soundness and completeness of two kinds of knowledge, frequent episodes and episode rules, discovered from single sequences. In order to obtain more accurate frequencies and knowledge, we provide three recommendations for defining appropriate frequency measures. Following the recommendations, we introduce a more appropriate frequency measure. Empirical evaluation reveals the inaccuracies and verifies our findings.
{"title":"A Study on the Accuracy of Frequency Measures and Its Impact on Knowledge Discovery in Single Sequences","authors":"M. Gan, H. Dai","doi":"10.1109/ICDMW.2010.83","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.83","url":null,"abstract":"In knowledge discovery in single sequences, different results could be discovered from the same sequence when different frequency measures are adopted. It is natural to raise such questions as (1) do these frequency measures reflect actual frequencies accurately? (2) what impacts do frequency measures have on discovered knowledge? (3) are discovered results accurate and reliable? and (4) which measures are appropriate for reflecting frequencies accurately? In this paper, taking three major factors (anti-monotonicity, maximum-frequency and window-width restriction) into account, we identify inaccuracies inherent in seven existing frequency measures, and investigate their impacts on the soundness and completeness of two kinds of knowledge, frequent episodes and episode rules, discovered from single sequences. In order to obtain more accurate frequencies and knowledge, we provide three recommendations for defining appropriate frequency measures. Following the recommendations, we introduce a more appropriate frequency measure. Empirical evaluation reveals the inaccuracies and verifies our findings.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124017604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tree Augmented Na¨ýve Bayes(TAN) is a robust classification model. However, so far some researchers still attempt to improve the performance by considering directions of edges, because traditional learning method merely takes into account log likelihood, which is not suitable for learning classifiers, when learning a tree topological structure. In this paper, we analyze search spaces of TAN, research equivalent classes in them. Accordingly, we point out it is not necessary to pay attention to the dependent directions between conditional variables for these directions do not play a role in maximizing log conditional likelihood. For application, we propose a novel framework for learning TAN classifiers. Finally, we run experiments on Weka platform using 45 problems from the University of California at Irvine repository. Experimental results show that classification accuracy and stability do not change statistically in our leraning framework.
Tree Augmented Na¨ýve贝叶斯(TAN)是一种鲁棒分类模型。然而,由于传统的学习方法在学习树状拓扑结构时只考虑对数似然,不适合学习分类器,因此目前仍有一些研究者试图通过考虑边的方向来提高性能。本文分析了TAN的搜索空间,研究了其中的等价类。因此,我们指出没有必要注意条件变量之间的依赖方向,因为这些方向在最大化对数条件似然中不起作用。为了应用,我们提出了一种新的学习TAN分类器的框架。最后,我们使用来自加州大学欧文分校知识库的45个问题在Weka平台上运行实验。实验结果表明,在我们的学习框架下,分类精度和稳定性没有统计学上的变化。
{"title":"Learning Robust Bayesian Network Classifiers in the Space of Markov Equivalent Classes","authors":"Zhongfeng Wang, Zhihai Wang, Bin Fu","doi":"10.1109/ICDMW.2010.91","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.91","url":null,"abstract":"Tree Augmented Na¨ýve Bayes(TAN) is a robust classification model. However, so far some researchers still attempt to improve the performance by considering directions of edges, because traditional learning method merely takes into account log likelihood, which is not suitable for learning classifiers, when learning a tree topological structure. In this paper, we analyze search spaces of TAN, research equivalent classes in them. Accordingly, we point out it is not necessary to pay attention to the dependent directions between conditional variables for these directions do not play a role in maximizing log conditional likelihood. For application, we propose a novel framework for learning TAN classifiers. Finally, we run experiments on Weka platform using 45 problems from the University of California at Irvine repository. Experimental results show that classification accuracy and stability do not change statistically in our leraning framework.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124461374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Transcriptional regulatory network identification is both a fundamental challenge in systems biology and an important practical application of data mining and machine learning. In this study, we propose a semi-supervised learning-based integrative scoring approach to tackle this challenge and predict transcriptional regulations. Our approach out-performs a state-of-the-art label propagation method and reaches AUC scores above 0.96 for three datasets from microarray experiments in the validation. A map of the transcriptional regulatory network controlling lung surfactant homeostasis was constructed. The predicted and prioritized transcriptional regulations were further validated through experimental verifications. Many other predicted novel regulations may serve as candidates for future experimental investigations.
{"title":"An Integrative Scoring Approach to Identify Transcriptional Regulations Controlling Lung Surfactant Homeostasis","authors":"Minlu Zhang, C. Fang, Yan Xu, R. Bhatnagar, L. Lu","doi":"10.1109/ICDMW.2010.110","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.110","url":null,"abstract":"Transcriptional regulatory network identification is both a fundamental challenge in systems biology and an important practical application of data mining and machine learning. In this study, we propose a semi-supervised learning-based integrative scoring approach to tackle this challenge and predict transcriptional regulations. Our approach out-performs a state-of-the-art label propagation method and reaches AUC scores above 0.96 for three datasets from microarray experiments in the validation. A map of the transcriptional regulatory network controlling lung surfactant homeostasis was constructed. The predicted and prioritized transcriptional regulations were further validated through experimental verifications. Many other predicted novel regulations may serve as candidates for future experimental investigations.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"369 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123316694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mobile devices provide the availability of tracking and collecting trajectories of moving objects such as vehicles, people or animals. There exists a well-known collection of patterns which can occur for a subset of trajectories. Specifically we study the so-called Popular Places, that is regions that are visited by many distinct moving objects.We propose algorithms to efficiently compute different forms of reporting Popular Places, that take benefit of the Graphics Processing Unit parallelism capabilities. We also describe how to visualize the reported solutions. Finally we present and discuss experimentalresults obtained with the implementation of our algorithms.
{"title":"Computing Popular Places Using Graphics Processors","authors":"Marta Fort, J. A. Sellarès, Nacho Valladares","doi":"10.1109/ICDMW.2010.45","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.45","url":null,"abstract":"Mobile devices provide the availability of tracking and collecting trajectories of moving objects such as vehicles, people or animals. There exists a well-known collection of patterns which can occur for a subset of trajectories. Specifically we study the so-called Popular Places, that is regions that are visited by many distinct moving objects.We propose algorithms to efficiently compute different forms of reporting Popular Places, that take benefit of the Graphics Processing Unit parallelism capabilities. We also describe how to visualize the reported solutions. Finally we present and discuss experimentalresults obtained with the implementation of our algorithms.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128166041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}