2010 IEEE International Conference on Data Mining Workshops最新文献

英文中文

Traffic Velocity Prediction Using GPS Data: IEEE ICDM Contest Task 3 Report 基于GPS数据的交通速度预测:IEEE ICDM竞赛任务3报告

2010 IEEE International Conference on Data Mining Workshops

Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.52

Wei Shen, Y. Kamarianakis, L. Wynter, Jingrui He, Qing He, Richard D. Lawrence, G. Swirszcz

This report summarizes the methodologies and techniques we developed and applied for tackling task 3 of the IEEE ICDM Contest on predicting traffic velocity based on GPS data. The major components of our solution include 1) A pre-processing procedure to map GPS data to the network, 2) A K-nearest neighbor approach for identifying the most similar training hours for every test hour, and 3) A heuristic evaluation framework for optimizing parameters and avoiding over-fitting. Our solution finished Second in the final evaluation.

本报告总结了我们开发和应用的方法和技术，以解决IEEE ICDM竞赛的任务3——基于GPS数据预测交通速度。我们的解决方案的主要组成部分包括1)将GPS数据映射到网络的预处理程序，2)用于识别每个测试小时最相似训练时间的k近邻方法，以及3)用于优化参数和避免过度拟合的启发式评估框架。我们的方案在最终评估中获得了第二名。

引用次数: 11

A Study on the Accuracy of Frequency Measures and Its Impact on Knowledge Discovery in Single Sequences 单序列中频率度量的准确性及其对知识发现的影响研究

2010 IEEE International Conference on Data Mining Workshops

Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.83

M. Gan, H. Dai

In knowledge discovery in single sequences, different results could be discovered from the same sequence when different frequency measures are adopted. It is natural to raise such questions as (1) do these frequency measures reflect actual frequencies accurately? (2) what impacts do frequency measures have on discovered knowledge? (3) are discovered results accurate and reliable? and (4) which measures are appropriate for reflecting frequencies accurately? In this paper, taking three major factors (anti-monotonicity, maximum-frequency and window-width restriction) into account, we identify inaccuracies inherent in seven existing frequency measures, and investigate their impacts on the soundness and completeness of two kinds of knowledge, frequent episodes and episode rules, discovered from single sequences. In order to obtain more accurate frequencies and knowledge, we provide three recommendations for defining appropriate frequency measures. Following the recommendations, we introduce a more appropriate frequency measure. Empirical evaluation reveals the inaccuracies and verifies our findings.

在单序列的知识发现中，采用不同的频率度量，同一序列的知识发现结果可能不同。人们自然会提出这样的问题:(1)这些频率测量是否准确地反映了实际频率?(2)频率度量对发现的知识有什么影响?(3)发现结果是否准确可靠?(4)哪些措施适合准确反映频率?在本文中，我们考虑了三个主要因素(反单调性、最大频率和窗宽限制)，识别了现有七种频率度量中固有的不准确性，并研究了它们对从单个序列中发现的频繁事件和事件规则两类知识的健全性和完整性的影响。为了获得更准确的频率和知识，我们提供了定义适当频率测量的三个建议。根据这些建议，我们引入一个更合适的频率度量。实证评估揭示了不准确性，并验证了我们的发现。

引用次数: 9

Automated Prompting in a Smart Home Environment 智能家居环境中的自动提示

2010 IEEE International Conference on Data Mining Workshops

Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.147

Barnan Das, Chao Chen, N. Dasgupta, D. Cook, Adriana M. Seelye

With more older adults and people with cognitive disorders preferring to stay independently at home, prompting systems that assist with Activities of Daily Living (ADLs) are in demand. In this paper, with the introduction of â€œThe PUCKâ€, we take the very first approach to automate a prompting system without any predefined rule set or user feedback. We statistically analyze realistic prompting data and devise a classifier from statistical outlier detection methods. Further, we devise a sampling technique to help with skewed and scanty data sets. We empirically find a class distribution that would be suitable for our work and validate our claims with the help of three classical machine learning algorithms.

随着越来越多的老年人和有认知障碍的人倾向于独立地呆在家里，辅助日常生活活动(ADLs)的提示系统的需求越来越大。在本文中，通过引入 - œThe puck -”，我们采用了第一种方法来自动化提示系统，而不需要任何预定义的规则集或用户反馈。我们对现实提示数据进行统计分析，并根据统计离群值检测方法设计分类器。此外，我们还设计了一种采样技术来帮助处理倾斜和稀疏的数据集。我们根据经验找到了一个适合我们工作的类分布，并在三种经典机器学习算法的帮助下验证了我们的主张。

引用次数: 21

Minimum Spanning Tree Based Classification Model for Massive Data with MapReduce Implementation 基于MapReduce实现的海量数据最小生成树分类模型

2010 IEEE International Conference on Data Mining Workshops

Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.14

Jin Chang, Jun Luo, J. Huang, Shengzhong Feng, Jianping Fan

Rapid growth of data has provided us with more information, yet challenges the tradition techniques to extract the useful knowledge. In this paper, we propose MCMM, a Minimum spanning tree (MST) based Classification model for Massive data with MapReduce implementation. It can be viewed as an intermediate model between the traditional K nearest neighbor method and cluster based classification method, aiming to overcome their disadvantages and cope with large amount of data. Our model is implemented on Hadoop platform, using its MapReduce programming framework, which is particular suitable for cloud computing. We have done experiments on several data sets including real world data from UCI repository and synthetic data, using Downing 4000 clusters, installed with Hadoop. The results show that our model outperforms KNN and some other classification methods on a general basis with respect to accuracy and scalability.

数据的快速增长为我们提供了更多的信息，但也挑战了传统的提取有用知识的技术。在本文中，我们提出了MCMM，一个基于最小生成树(MST)的海量数据分类模型，并实现了MapReduce。它可以看作是传统的K近邻方法和基于聚类的分类方法之间的一种中间模型，旨在克服它们的缺点和应对大数据量。我们的模型是在Hadoop平台上实现的，使用它的MapReduce编程框架，它特别适合云计算。我们在几个数据集上做了实验，包括来自UCI存储库的真实数据和合成数据，使用安装了Hadoop的Downing 4000集群。结果表明，在准确率和可扩展性方面，我们的模型在一般基础上优于KNN和其他一些分类方法。

引用次数: 9

Unsupervised DRG Upcoding Detection in Healthcare Databases 医疗保健数据库中的无监督DRG升级检测

2010 IEEE International Conference on Data Mining Workshops

Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.108

Wei Luo, M. Gallagher

Diagnosis Related Group (DRG) upcoding is an anomaly in healthcare data that costs hundreds of millions of dollars in many developed countries. DRG upcoding is typically detected through resource intensive auditing. As supervised modeling of DRG upcoding is severely constrained by scope and timeliness of past audit data, we propose in this paper an unsupervised algorithm to filter data for potential identification of DRG upcoding. The algorithm has been applied to a hip replacement/revision dataset and a heart-attack dataset. The results are consistent with the assumptions held by domain experts.

诊断相关组(DRG)升级编码是医疗保健数据中的一种异常现象，在许多发达国家造成数亿美元的损失。DRG升级编码通常通过资源密集型审计来检测。由于DRG升级编码的监督建模受到过去审计数据的范围和时效性的严重限制，我们在本文中提出了一种无监督算法来过滤数据以识别DRG升级编码的潜在特征。该算法已应用于髋关节置换/修复数据集和心脏病数据集。研究结果与领域专家的假设基本一致。

引用次数: 20

Mining Users' Opinions Based on Item Folksonomy and Taxonomy for Personalized Recommender Systems 基于条目分类法的个性化推荐系统用户意见挖掘

2010 IEEE International Conference on Data Mining Workshops

Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.163

Huizhi Liang, Yue Xu, Yuefeng Li

Item folksonomy or tag information is a kind of typical and prevalent web 2.0 information. Item folksonmy contains rich opinion information of users on item classifications and descriptions. It can be used as another important information source to conduct opinion mining. On the other hand, each item is associated with taxonomy information that reflects the viewpoints of experts. In this paper, we propose to mine for users¡¯ opinions on items based on item taxonomy developed by experts and folksonomy contributed by users. In addition, we explore how to make personalized item recommendations based on users¡¯ opinions. The experiments conducted on real word datasets collected from Amazon.com and CiteULike demonstrated the effectiveness of the proposed approaches.

条目分类法或标签信息是一种典型的、流行的web 2.0信息。物品民俗包含了用户对物品分类和描述的丰富意见信息。它可以作为进行意见挖掘的另一个重要信息源。另一方面，每个条目都与反映专家观点的分类法信息相关联。在本文中，我们提出基于专家开发的物品分类法和用户贡献的大众分类法来挖掘用户对物品的意见。此外，我们还将探索如何根据用户的意见进行个性化的商品推荐。在Amazon.com和CiteULike上收集的真实单词数据集上进行的实验证明了所提出方法的有效性。

引用次数: 8

An Empirical Comparison of Platt Calibration and Inductive Confidence Machines for Predictions in Drug Discovery 药物发现预测中普拉特校准和归纳置信机的实证比较

2010 IEEE International Conference on Data Mining Workshops

Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.111

Nikil Wale

During the early phase of drug discovery, machine learning methods are often utilized to select compounds to send for experimental screening. In order to accomplish this goal, any method that can provide estimates of error rate for a given set of predictions is an extremely valuable tool. In this paper we compare Platt Calibration Algorithm and recently introduced Conformal Algorithm to control the error rate in the sense of precision while preserving the ability to identify as many compounds as possible (recall) that are highly likely to be bio-active in a certain context. We empirically evaluate and compare the performance of Platt’s Calibration and offline Mondrian ICM in the context of SVM-based classification on 75 distinct classification problems. We perform this evaluation in the real world setting where the true class labels of compounds are unknown at the time of prediction and are only revealed after the biological experiment is completed. Our empirical results show that under this setting, offline Mondrian ICM and Platt Calibration are not able to bound precision rates very well on an absolute basis. Comparatively, Mondrian ICM, even though not theoretically designed to control precision directly, compares favorably with Platt Calibration for this task.

在药物发现的早期阶段，机器学习方法通常用于选择化合物进行实验筛选。为了实现这一目标，任何能够提供给定预测集错误率估计的方法都是非常有价值的工具。在本文中，我们比较了Platt校准算法和最近引入的保形算法，以控制精度意义上的错误率，同时保留了识别尽可能多的化合物(召回率)的能力，这些化合物在特定环境中极有可能具有生物活性。我们在基于svm的分类背景下，对75个不同的分类问题进行了Platt’s Calibration和离线Mondrian ICM的性能进行了实证评估和比较。我们在现实世界中进行评估，在预测时化合物的真实类别标签是未知的，只有在生物实验完成后才会显示出来。我们的实证结果表明，在这种设置下，离线Mondrian ICM和Platt校准不能在绝对基础上很好地约束精度率。相比之下，蒙德里安ICM，即使不是理论上设计直接控制精度，比较有利的普拉特校准这项任务。

{"title":"An Empirical Comparison of Platt Calibration and Inductive Confidence Machines for Predictions in Drug Discovery","authors":"Nikil Wale","doi":"10.1109/ICDMW.2010.111","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.111","url":null,"abstract":"During the early phase of drug discovery, machine learning methods are often utilized to select compounds to send for experimental screening. In order to accomplish this goal, any method that can provide estimates of error rate for a given set of predictions is an extremely valuable tool. In this paper we compare Platt Calibration Algorithm and recently introduced Conformal Algorithm to control the error rate in the sense of precision while preserving the ability to identify as many compounds as possible (recall) that are highly likely to be bio-active in a certain context. We empirically evaluate and compare the performance of Platt’s Calibration and offline Mondrian ICM in the context of SVM-based classification on 75 distinct classification problems. We perform this evaluation in the real world setting where the true class labels of compounds are unknown at the time of prediction and are only revealed after the biological experiment is completed. Our empirical results show that under this setting, offline Mondrian ICM and Platt Calibration are not able to bound precision rates very well on an absolute basis. Comparatively, Mondrian ICM, even though not theoretically designed to control precision directly, compares favorably with Platt Calibration for this task.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116400542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Trading Tests of Long-Term Market Forecast by Text Mining 基于文本挖掘的长期市场预测交易检验

2010 IEEE International Conference on Data Mining Workshops

Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.60

K. Izumi, Takashi Goto, Tohgoroh Matsui

We propose a new approach for analyzing the Japanese government bond (JGB) market using text-mining technology. First, we extracted the feature vectors of the monthly reports from the Bank of Japan (BOJ). Then, the trends in the JGB market were estimated by a regression analysis using the feature vectors. As a result of comparison with support vector regression and other methods, the proposal method could forecast in higher accuracy about both the level and direction of long-term market trends. Moreover, our method showed high returns with annual rate averages as a result of the implementation test.

本文提出了一种利用文本挖掘技术分析日本国债市场的新方法。首先，我们提取了日本银行(BOJ)月度报告的特征向量。然后，利用特征向量进行回归分析，估计了日本国债市场的趋势。通过与支持向量回归等方法的比较，该方法对市场长期趋势的水平和方向都有较高的预测精度。此外，通过实施测试，我们的方法显示出较高的年平均收益率。

引用次数: 8

Using SOM-Ward Clustering and Predictive Analytics for Conducting Customer Segmentation 使用SOM-Ward聚类和预测分析进行客户细分

2010 IEEE International Conference on Data Mining Workshops

Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.121

Zhiyuan Yao, T. Eklund, B. Back

Continuously increasing amounts of data in data warehouses are providing companies with ample opportunity to conduct analytical customer relationship management (CRM). However, how to utilize the information retrieved from the analysis of these data to retain the most valuable customers, identify customers with additional revenue potential, and achieve cost-effective customer relationship management, continue to pose challenges for companies. This study proposes a two-level approach combining SOM-Ward clustering and predictive analytics to segment the customer base of a case company with 1.5 million customers. First, according to the spending amount, demographic and behavioral characteristics of the customers, we adopt SOM-Ward clustering to segment the customer base into seven segments: exclusive customers, high-spending customers, and five segments of mass customers. Then, three classification models - the support vector machine (SVM), the neural network, and the decision tree, are employed to classify high-spending and low-spending customers. The performance of the three classification models is evaluated and compared. The three models are then combined to predict potential high-spending customers from the mass customers. It is found that this hybrid approach could provide more thorough and detailed information about the customer base, especially the untapped mass market with potential high revenue contribution, for tailoring actionable marketing strategies.

数据仓库中不断增加的数据量为公司提供了进行分析性客户关系管理(CRM)的充足机会。然而，如何利用从这些数据分析中检索到的信息来保留最有价值的客户，识别具有额外收入潜力的客户，并实现具有成本效益的客户关系管理，仍然是企业面临的挑战。本研究提出了一种结合SOM-Ward聚类和预测分析的两级方法，以细分拥有150万客户的案例公司的客户群。首先，根据客户的消费金额、人口特征和行为特征，采用SOM-Ward聚类方法将客户群体划分为7个细分市场:专属客户、高消费客户和5个大众客户。然后，采用支持向量机(SVM)、神经网络和决策树三种分类模型对高消费客户和低消费客户进行分类。对三种分类模型的性能进行了评价和比较。然后将这三种模型结合起来，从大众客户中预测潜在的高消费客户。研究发现，这种混合方法可以提供更全面和详细的客户群信息，特别是潜在的高收入贡献尚未开发的大众市场，为定制可操作的营销策略。

{"title":"Using SOM-Ward Clustering and Predictive Analytics for Conducting Customer Segmentation","authors":"Zhiyuan Yao, T. Eklund, B. Back","doi":"10.1109/ICDMW.2010.121","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.121","url":null,"abstract":"Continuously increasing amounts of data in data warehouses are providing companies with ample opportunity to conduct analytical customer relationship management (CRM). However, how to utilize the information retrieved from the analysis of these data to retain the most valuable customers, identify customers with additional revenue potential, and achieve cost-effective customer relationship management, continue to pose challenges for companies. This study proposes a two-level approach combining SOM-Ward clustering and predictive analytics to segment the customer base of a case company with 1.5 million customers. First, according to the spending amount, demographic and behavioral characteristics of the customers, we adopt SOM-Ward clustering to segment the customer base into seven segments: exclusive customers, high-spending customers, and five segments of mass customers. Then, three classification models - the support vector machine (SVM), the neural network, and the decision tree, are employed to classify high-spending and low-spending customers. The performance of the three classification models is evaluated and compared. The three models are then combined to predict potential high-spending customers from the mass customers. It is found that this hybrid approach could provide more thorough and detailed information about the customer base, especially the untapped mass market with potential high revenue contribution, for tailoring actionable marketing strategies.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131402435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

dMaximalCliques: A Distributed Algorithm for Enumerating All Maximal Cliques and Maximal Clique Distribution dMaximalCliques:一种枚举所有最大团和最大团分布的分布式算法

2010 IEEE International Conference on Data Mining Workshops

Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.13

Li Lu, Yunhong Gu, R. Grossman

Clique detection and analysis is one of the fundamental problems in graph theory. However, as the size of graphs increases (e.g., those of social networks), it becomes difficult to conduct such analysis using existing sequential algorithms due to the computation and memory limitation. In this paper, we present a distributed algorithm, dMaximalCliques, which can obtain clique information from million-node graphs within a few minutes on an 80-node computer cluster. dMaximalCliques is a distributed algorithm for share-nothing systems, such as racks of clusters. We use very large scale real and synthetic graphs in the experimental studies to prove the efficiency of the algorithm. In addition, we propose to use the distribution of the size of maximal cliques in a graph (Maximal Clique Distribution) as a new measure for measuring the structural properties of a graph and for distinguishing different types of graphs. Meanwhile, we find that this distribution can be well fitted by lognormal distribution.

团的检测与分析是图论的基本问题之一。然而，随着图的大小增加(例如，社交网络)，由于计算和内存的限制，使用现有的顺序算法很难进行这种分析。在本文中，我们提出了一种分布式算法dMaximalCliques，它可以在几分钟内在80节点的计算机集群上从百万节点图中获得团信息。dMaximalCliques是一种分布式算法，适用于无共享系统，比如集群的机架。在实验研究中，我们使用了非常大规模的实图和合成图来证明算法的有效性。此外，我们提出用图中最大团的大小分布(maximum Clique distribution)作为衡量图的结构性质和区分不同类型图的新度量。同时，我们发现对数正态分布可以很好地拟合该分布。

引用次数: 30

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2010 IEEE International Conference on Data Mining Workshops

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀