首页 > 最新文献

Sixth International Conference on Data Mining (ICDM'06)最新文献

英文 中文
The PDD Framework for Detecting Categories of Peculiar Data 特殊数据分类检测的PDD框架
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.159
M. Shrestha, Howard J. Hamilton, Yiyu Yao, K. Konkel, Liqiang Geng
Peculiar data are objects that are relatively few in number and significantly different from the other objects in a data set. In this paper, we propose the PDD framework for detecting multiple categories of peculiar data. This framework provides an extensible set of perspectives for viewing data, currently including viewing data as a set of records, attributes, frequencies, intervals, sequences, or sequences of changes. By using these six views of the data, multiple categories of peculiar data can be detected to reveal different aspects of the data. For each view, the framework provides an extensible set of peculiarity measures to detect outliers and other kinds of peculiar data. The PDD framework has been implemented for Oracle and Access. Experiments are reported for data sets concerning Regina weather and NHL hockey.
特殊数据是指数量相对较少且与数据集中其他对象明显不同的对象。在本文中,我们提出了用于检测多类特殊数据的PDD框架。该框架为查看数据提供了一组可扩展的透视图,目前包括将数据查看为一组记录、属性、频率、间隔、序列或更改序列。通过使用这六种数据视图,可以检测到多类特殊数据,从而揭示数据的不同方面。对于每个视图,框架提供了一组可扩展的特性度量来检测异常值和其他类型的特殊数据。PDD框架已经在Oracle和Access上实现。实验报告了有关里贾纳天气和NHL曲棍球的数据集。
{"title":"The PDD Framework for Detecting Categories of Peculiar Data","authors":"M. Shrestha, Howard J. Hamilton, Yiyu Yao, K. Konkel, Liqiang Geng","doi":"10.1109/ICDM.2006.159","DOIUrl":"https://doi.org/10.1109/ICDM.2006.159","url":null,"abstract":"Peculiar data are objects that are relatively few in number and significantly different from the other objects in a data set. In this paper, we propose the PDD framework for detecting multiple categories of peculiar data. This framework provides an extensible set of perspectives for viewing data, currently including viewing data as a set of records, attributes, frequencies, intervals, sequences, or sequences of changes. By using these six views of the data, multiple categories of peculiar data can be detected to reveal different aspects of the data. For each view, the framework provides an extensible set of peculiarity measures to detect outliers and other kinds of peculiar data. The PDD framework has been implemented for Oracle and Access. Experiments are reported for data sets concerning Regina weather and NHL hockey.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"05 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127325740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Discovery of Collocation Episodes in Spatiotemporal Data 时空数据中搭配情节的发现
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.59
H. Cao, N. Mamoulis, D. Cheung
Given a collection of trajectories of moving objects with different types (e.g., pumas, deers, vultures, etc.), we introduce the problem of discovering collocation episodes in them (e.g., if a puma is moving near a deer, then a vulture is also going to move close to the same deer with high probability within the next 3 minutes). Collocation episodes catch the inter-movement regularities among different types of objects. We formally define the problem of mining collocation episodes and propose two scaleable algorithms for its efficient solution. We empirically evaluate the performance of the proposed methods using synthetically generated data that emulate real-world object movements.
给定一组不同类型的运动物体的轨迹(例如,美洲狮、鹿、秃鹫等),我们引入了发现其中的搭配情节的问题(例如,如果美洲狮靠近鹿,那么秃鹫也将在接下来的3分钟内以高概率靠近同一只鹿)。搭配情节捕捉了不同类型对象之间的相互运动规律。我们正式定义了挖掘搭配集的问题,并提出了两种可扩展的算法来有效地解决这个问题。我们使用模拟现实世界物体运动的综合生成数据来经验地评估所提出方法的性能。
{"title":"Discovery of Collocation Episodes in Spatiotemporal Data","authors":"H. Cao, N. Mamoulis, D. Cheung","doi":"10.1109/ICDM.2006.59","DOIUrl":"https://doi.org/10.1109/ICDM.2006.59","url":null,"abstract":"Given a collection of trajectories of moving objects with different types (e.g., pumas, deers, vultures, etc.), we introduce the problem of discovering collocation episodes in them (e.g., if a puma is moving near a deer, then a vulture is also going to move close to the same deer with high probability within the next 3 minutes). Collocation episodes catch the inter-movement regularities among different types of objects. We formally define the problem of mining collocation episodes and propose two scaleable algorithms for its efficient solution. We empirically evaluate the performance of the proposed methods using synthetically generated data that emulate real-world object movements.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129036244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 64
Learning to Use a Learned Model: A Two-Stage Approach to Classification 学习使用学习模型:两阶段分类方法
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.97
M. Antonie, Osmar R Zaiane, R. Holte
Association rule-based classifiers have recently emerged as competitive classification systems. However, there are still deficiencies that hinder their performance. One deficiency is the use of rules in the classification stage. Current systems assign classes to new objects based on the best rule applied or on some predefined scoring of multiple rules. In this paper we propose a new technique where the system automatically learns how to use the rules. We achieve this by developing a two-stage classification model. First, we use association rule mining to discover classification rules. Second, we employ another learning algorithm to learn how to use these rules in the prediction process. Our two-stage approach outperforms C4.5 and RIPPER on the UCI datasets in our study, and outperforms other rule- learning methods on more than half the datasets. The versatility of our method is also demonstrated by applying it to text classification, where it equals the performance of the best known systems for this task, SVMs.
基于关联规则的分类器最近作为竞争性分类系统出现。然而,仍有一些缺陷阻碍了它们的性能。一个缺陷是在分类阶段使用规则。当前系统根据应用的最佳规则或多个规则的一些预定义评分为新对象分配类。在本文中,我们提出了一种系统自动学习如何使用规则的新技术。我们通过开发一个两阶段分类模型来实现这一点。首先,我们使用关联规则挖掘来发现分类规则。其次,我们使用另一种学习算法来学习如何在预测过程中使用这些规则。在我们的研究中,我们的两阶段方法在UCI数据集上优于C4.5和RIPPER,并且在超过一半的数据集上优于其他规则学习方法。我们的方法的多功能性也通过将其应用于文本分类得到了证明,在文本分类中,它相当于该任务中最著名的系统svm的性能。
{"title":"Learning to Use a Learned Model: A Two-Stage Approach to Classification","authors":"M. Antonie, Osmar R Zaiane, R. Holte","doi":"10.1109/ICDM.2006.97","DOIUrl":"https://doi.org/10.1109/ICDM.2006.97","url":null,"abstract":"Association rule-based classifiers have recently emerged as competitive classification systems. However, there are still deficiencies that hinder their performance. One deficiency is the use of rules in the classification stage. Current systems assign classes to new objects based on the best rule applied or on some predefined scoring of multiple rules. In this paper we propose a new technique where the system automatically learns how to use the rules. We achieve this by developing a two-stage classification model. First, we use association rule mining to discover classification rules. Second, we employ another learning algorithm to learn how to use these rules in the prediction process. Our two-stage approach outperforms C4.5 and RIPPER on the UCI datasets in our study, and outperforms other rule- learning methods on more than half the datasets. The versatility of our method is also demonstrated by applying it to text classification, where it equals the performance of the best known systems for this task, SVMs.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129192499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 46
On Trajectory Representation for Scientific Features 科学特征的轨迹表示
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.120
S. Mehta, S. Parthasarathy, R. Machiraju
In this article, we present trajectory representation algorithms for tangible features found in temporally varying scientific datasets. Rather than modeling the features as points, we take attributes like shape and extent of the feature into account. Our contention is that these attributes play an important role in understanding the temporal evolution and interactions among features. The proposed representation scheme is based on motion and shape parameters including linear velocity, angular velocity, etc. We use these parameters to segment the trajectory instead of relying on the geometry of the trajectory. We evaluate our algorithms on real datasets originating from different domains. We show the accuracy of the motion and shape parameter estimation by reconstructing the trajectories with high accuracy. Finally, we present performance and scalability results.
在本文中,我们提出了在时间变化的科学数据集中发现的有形特征的轨迹表示算法。我们没有将特征建模为点,而是考虑了特征的形状和范围等属性。我们的论点是,这些属性在理解时间演化和特征之间的相互作用方面起着重要作用。该方法基于运动和形状参数,包括线速度、角速度等。我们使用这些参数来分割轨迹,而不是依赖于轨迹的几何形状。我们在来自不同领域的真实数据集上评估了我们的算法。我们通过高精度地重建轨迹来证明运动和形状参数估计的准确性。最后,我们给出了性能和可伸缩性结果。
{"title":"On Trajectory Representation for Scientific Features","authors":"S. Mehta, S. Parthasarathy, R. Machiraju","doi":"10.1109/ICDM.2006.120","DOIUrl":"https://doi.org/10.1109/ICDM.2006.120","url":null,"abstract":"In this article, we present trajectory representation algorithms for tangible features found in temporally varying scientific datasets. Rather than modeling the features as points, we take attributes like shape and extent of the feature into account. Our contention is that these attributes play an important role in understanding the temporal evolution and interactions among features. The proposed representation scheme is based on motion and shape parameters including linear velocity, angular velocity, etc. We use these parameters to segment the trajectory instead of relying on the geometry of the trajectory. We evaluate our algorithms on real datasets originating from different domains. We show the accuracy of the motion and shape parameter estimation by reconstructing the trajectories with high accuracy. Finally, we present performance and scalability results.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117234303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Opening the Black Box of Feature Extraction: Incorporating Visualization into High-Dimensional Data Mining Processes 打开特征提取的黑箱:将可视化融入高维数据挖掘过程
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.121
Jianting Zhang, L. Gruenwald
Feature extraction techniques have been used to handle high-dimensional data and experimental studies often show improved classification accuracies. Unfortunately very few studies provide concrete evidences on the effectiveness of these feature extraction techniques and they largely remain to be black boxes. In this study, we design and implement a visualization prototype system that allows users to look into the classification processes, explore the links among the original and extracted features in different classifiers, examine why and how an instance is correctly or incorrectly classified. We demonstrate the prototype's capabilities by combining a feature extraction method based on hierarchical feature space clustering with J48 decision tree classifiers and perform experiments on a real hyperspectral remote sensing image dataset.
特征提取技术已被用于处理高维数据,实验研究经常显示出分类精度的提高。不幸的是,很少有研究提供具体的证据来证明这些特征提取技术的有效性,它们在很大程度上仍然是黑盒。在本研究中,我们设计并实现了一个可视化原型系统,该系统允许用户查看分类过程,探索不同分类器中原始特征和提取特征之间的联系,检查实例正确或错误分类的原因和方式。通过将基于层次特征空间聚类的特征提取方法与J48决策树分类器相结合,验证了原型的能力,并在真实高光谱遥感图像数据集上进行了实验。
{"title":"Opening the Black Box of Feature Extraction: Incorporating Visualization into High-Dimensional Data Mining Processes","authors":"Jianting Zhang, L. Gruenwald","doi":"10.1109/ICDM.2006.121","DOIUrl":"https://doi.org/10.1109/ICDM.2006.121","url":null,"abstract":"Feature extraction techniques have been used to handle high-dimensional data and experimental studies often show improved classification accuracies. Unfortunately very few studies provide concrete evidences on the effectiveness of these feature extraction techniques and they largely remain to be black boxes. In this study, we design and implement a visualization prototype system that allows users to look into the classification processes, explore the links among the original and extracted features in different classifiers, examine why and how an instance is correctly or incorrectly classified. We demonstrate the prototype's capabilities by combining a feature extraction method based on hierarchical feature space clustering with J48 decision tree classifiers and perform experiments on a real hyperspectral remote sensing image dataset.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122700583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Finding "Who Is Talking to Whom" in VoIP Networks via Progressive Stream Clustering 通过渐进式流聚类在VoIP网络中寻找“谁在和谁说话”
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.72
O. Verscheure, M. Vlachos, A. Anagnostopoulos, P. Frossard, E. Bouillet, Philip S. Yu
Technologies that use the Internet network to deliver voice communications have the potential to reduce costs and improve access to communications services around the world. However, these new technologies pose several challenges in terms of confidentiality of the conversations and anonymity of the conversing parties. Call authentication and encryption techniques provide a way to protect confidentiality, while anonymity is typically preserved by an anonymizing service (anonymous call). This work studies the feasibility of revealing pairs of anonymous and encrypted conversing parties (caller/callee pair of streams) by exploiting the vulnerabilities inherent to VoIP systems. In particular, by exploiting the aperiodic inter-departure time of VoIP packets, we can trivialize each VoIP stream into a binary time-series. We first define a simple yet intuitive metric to gauge the correlation between two VoIP binary streams. Then we propose an effective technique that progressively pairs conversing parties with high accuracy and in a limited amount of time. Our metric and method are justified analytically and validated by experiments on a very large standard corpus of conversational speech. We obtain impressively high pairing accuracy that reaches 97% after 5 minutes of voice conversations.
利用互联网提供语音通信的技术有可能降低成本,改善世界各地的通信服务。然而,这些新技术在对话的保密性和对话方的匿名性方面提出了一些挑战。呼叫身份验证和加密技术提供了一种保护机密性的方法,而匿名性通常由匿名化服务(匿名呼叫)来保持。这项工作研究了通过利用VoIP系统固有的漏洞来揭示匿名和加密对话方对(呼叫方/被呼叫方对流)的可行性。特别是,通过利用VoIP数据包的非周期间隔时间,我们可以将每个VoIP流简化为二进制时间序列。我们首先定义一个简单而直观的度量来衡量两个VoIP二进制流之间的相关性。然后,我们提出了一种有效的技术,可以在有限的时间内以高精度逐步配对对话方。我们的度量标准和方法在一个非常大的标准会话语料库上进行了分析和实验验证。我们获得了令人印象深刻的高配对准确率,语音对话5分钟后达到97%。
{"title":"Finding \"Who Is Talking to Whom\" in VoIP Networks via Progressive Stream Clustering","authors":"O. Verscheure, M. Vlachos, A. Anagnostopoulos, P. Frossard, E. Bouillet, Philip S. Yu","doi":"10.1109/ICDM.2006.72","DOIUrl":"https://doi.org/10.1109/ICDM.2006.72","url":null,"abstract":"Technologies that use the Internet network to deliver voice communications have the potential to reduce costs and improve access to communications services around the world. However, these new technologies pose several challenges in terms of confidentiality of the conversations and anonymity of the conversing parties. Call authentication and encryption techniques provide a way to protect confidentiality, while anonymity is typically preserved by an anonymizing service (anonymous call). This work studies the feasibility of revealing pairs of anonymous and encrypted conversing parties (caller/callee pair of streams) by exploiting the vulnerabilities inherent to VoIP systems. In particular, by exploiting the aperiodic inter-departure time of VoIP packets, we can trivialize each VoIP stream into a binary time-series. We first define a simple yet intuitive metric to gauge the correlation between two VoIP binary streams. Then we propose an effective technique that progressively pairs conversing parties with high accuracy and in a limited amount of time. Our metric and method are justified analytically and validated by experiments on a very large standard corpus of conversational speech. We obtain impressively high pairing accuracy that reaches 97% after 5 minutes of voice conversations.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123011769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Fast Random Walk with Restart and Its Applications 带重启的快速随机漫步及其应用
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.70
Hanghang Tong, C. Faloutsos, Jia-Yu Pan
How closely related are two nodes in a graph? How to compute this score quickly, on huge, disk-resident, real graphs? Random walk with restart (RWR) provides a good relevance score between two nodes in a weighted graph, and it has been successfully used in numerous settings, like automatic captioning of images, generalizations to the "connection subgraphs", personalized PageRank, and many more. However, the straightforward implementations of RWR do not scale for large graphs, requiring either quadratic space and cubic pre-computation time, or slow response time on queries. We propose fast solutions to this problem. The heart of our approach is to exploit two important properties shared by many real graphs: (a) linear correlations and (b) block- wise, community-like structure. We exploit the linearity by using low-rank matrix approximation, and the community structure by graph partitioning, followed by the Sherman- Morrison lemma for matrix inversion. Experimental results on the Corel image and the DBLP dabasets demonstrate that our proposed methods achieve significant savings over the straightforward implementations: they can save several orders of magnitude in pre-computation and storage cost, and they achieve up to 150x speed up with 90%+ quality preservation.
图中两个节点的关系有多密切?如何在巨大的、驻留在磁盘上的真实图形上快速计算这个分数?重新启动随机漫步(RWR)在加权图中的两个节点之间提供了一个很好的相关性评分,它已经成功地用于许多设置,如图像的自动字幕、“连接子图”的泛化、个性化PageRank等等。然而,RWR的直接实现不能扩展到大型图,需要二次空间和三次预计算时间,或者查询的响应时间较慢。我们建议快速解决这个问题。我们方法的核心是利用许多实图共有的两个重要属性:(a)线性相关性和(b)块状、类社区结构。我们利用低秩矩阵逼近来开发线性,利用图划分来开发群体结构,然后利用Sherman- Morrison引理进行矩阵反演。在Corel图像和DBLP数据集上的实验结果表明,我们提出的方法比直接实现的方法节省了显着的成本:它们可以节省几个数量级的预计算和存储成本,并且它们可以实现高达150倍的速度和90%以上的质量保持。
{"title":"Fast Random Walk with Restart and Its Applications","authors":"Hanghang Tong, C. Faloutsos, Jia-Yu Pan","doi":"10.1109/ICDM.2006.70","DOIUrl":"https://doi.org/10.1109/ICDM.2006.70","url":null,"abstract":"How closely related are two nodes in a graph? How to compute this score quickly, on huge, disk-resident, real graphs? Random walk with restart (RWR) provides a good relevance score between two nodes in a weighted graph, and it has been successfully used in numerous settings, like automatic captioning of images, generalizations to the \"connection subgraphs\", personalized PageRank, and many more. However, the straightforward implementations of RWR do not scale for large graphs, requiring either quadratic space and cubic pre-computation time, or slow response time on queries. We propose fast solutions to this problem. The heart of our approach is to exploit two important properties shared by many real graphs: (a) linear correlations and (b) block- wise, community-like structure. We exploit the linearity by using low-rank matrix approximation, and the community structure by graph partitioning, followed by the Sherman- Morrison lemma for matrix inversion. Experimental results on the Corel image and the DBLP dabasets demonstrate that our proposed methods achieve significant savings over the straightforward implementations: they can save several orders of magnitude in pre-computation and storage cost, and they achieve up to 150x speed up with 90%+ quality preservation.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130806345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1118
DSTree: A Tree Structure for the Mining of Frequent Sets from Data Streams 从数据流中挖掘频繁集的树状结构
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.62
C. Leung, Quamrul I. Khan
With advances in technology, a flood of data can be produced in many applications such as sensor networks and Web click streams. This calls for efficient techniques for extracting useful information from streams of data. In this paper, we propose a novel tree structure, called DSTree (Data Stream Tree), that captures important data from the streams. By exploiting its nice properties, the DSTree can be easily maintained and mined for frequent itemsets as well as various other patterns like constrained itemsets.
随着技术的进步,在传感器网络和网络点击流等许多应用程序中可以产生大量数据。这就需要从数据流中提取有用信息的有效技术。在本文中,我们提出了一种新的树状结构,称为DSTree(数据流树),它从流中捕获重要数据。通过利用其良好的属性,dstreet可以很容易地维护和挖掘频繁的项目集以及约束项目集等各种其他模式。
{"title":"DSTree: A Tree Structure for the Mining of Frequent Sets from Data Streams","authors":"C. Leung, Quamrul I. Khan","doi":"10.1109/ICDM.2006.62","DOIUrl":"https://doi.org/10.1109/ICDM.2006.62","url":null,"abstract":"With advances in technology, a flood of data can be produced in many applications such as sensor networks and Web click streams. This calls for efficient techniques for extracting useful information from streams of data. In this paper, we propose a novel tree structure, called DSTree (Data Stream Tree), that captures important data from the streams. By exploiting its nice properties, the DSTree can be easily maintained and mined for frequent itemsets as well as various other patterns like constrained itemsets.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126660586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 198
Star-Structured High-Order Heterogeneous Data Co-clustering Based on Consistent Information Theory 基于一致信息理论的星形结构高阶异构数据协同聚类
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.154
Bin Gao, Tie-Yan Liu, Wei-Ying Ma
Heterogeneous object co-clustering has become an important research topic in data mining. In early years of this research, people mainly worked on two types of heterogeneous data (denoted by pair-wise co-clustering); while recently more and more attention was paid to multiple types of heterogeneous data (denoted by high- order co-clustering). In this paper, we studied the high- order co-clustering of objects with star-structured interrelationship, i.e., there is a central type of objects that connects the other types of objects. Actually, this case could be a very good model for many real-world applications, such as the co-clustering of Web images, their low-level visual features, and the surrounding text. We used a tripartite graph to represent the interrelationships among different objects, and proposed a consistent information theory which generates an effective algorithm to obtain the co-clusters of different types of objects. Experiments on a Web image show that our proposed algorithm is a better choice compared with previous work on heterogeneous object co-clustering.
异构对象共聚类已成为数据挖掘领域的重要研究课题。在这项研究的早期,人们主要研究两种类型的异构数据(用成对共聚类表示);近年来,多类型异构数据(以高阶共聚类为代表)越来越受到人们的关注。本文研究了具有星形结构相互关系的对象的高阶共聚,即存在一个中心类型的对象连接其他类型的对象。实际上,对于许多现实世界的应用程序来说,这种情况可能是一个非常好的模型,例如Web图像、它们的低级视觉特征和周围文本的共同聚类。我们用三部图来表示不同对象之间的相互关系,并提出了一致信息理论,生成了一种有效的算法来获取不同类型对象的共聚类。在Web图像上的实验表明,与以往的异构对象共聚类方法相比,本文提出的算法是一种更好的选择。
{"title":"Star-Structured High-Order Heterogeneous Data Co-clustering Based on Consistent Information Theory","authors":"Bin Gao, Tie-Yan Liu, Wei-Ying Ma","doi":"10.1109/ICDM.2006.154","DOIUrl":"https://doi.org/10.1109/ICDM.2006.154","url":null,"abstract":"Heterogeneous object co-clustering has become an important research topic in data mining. In early years of this research, people mainly worked on two types of heterogeneous data (denoted by pair-wise co-clustering); while recently more and more attention was paid to multiple types of heterogeneous data (denoted by high- order co-clustering). In this paper, we studied the high- order co-clustering of objects with star-structured interrelationship, i.e., there is a central type of objects that connects the other types of objects. Actually, this case could be a very good model for many real-world applications, such as the co-clustering of Web images, their low-level visual features, and the surrounding text. We used a tripartite graph to represent the interrelationships among different objects, and proposed a consistent information theory which generates an effective algorithm to obtain the co-clusters of different types of objects. Experiments on a Web image show that our proposed algorithm is a better choice compared with previous work on heterogeneous object co-clustering.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128156216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 43
Discovering Unrevealed Properties of Probability Estimation Trees: On Algorithm Selection and Performance Explanation 发现概率估计树的未揭示属性:关于算法选择和性能解释
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.58
Kun Zhang, W. Fan, B. Buckles, Xiao-Xia Yuan, Zujia Xu
There has been increasing interest to design better probability estimation trees, or PETs, for ranking and probability estimation. Capable of generating class membership probabilities, PETs have been shown to be highly accurate and flexible for many difficult problems, such as cost-sensitive learning and matching skewed distributions. There are a large number of PET algorithms available, and about ten of them are well- known. This large number provides an advantage, but it also creates confusion in practice. One would ask "given a new dataset, which algorithm to choose and what performance to expect and not to expect? What are the reasons to explain either good or bad performance under different situations? " In this paper, we systematically, for the first time, answer these important questions by conducting a large-scale empirical comparison of five popular PETs by examining their AUC, MSE and error rate "learning curves" (instead of training-test split based cross-validation). Using the maximum AUC achieved by any of the evaluated probability estimation tree algorithms, we demonstrate that the preference of a probability estimation tree on different evaluation metrics can be accurately characterized by the "signal-noise separability" of the dataset, as well as some other observable statistics of the dataset explained further in the paper. Moreover, in order to understand their relative performance, many important and previously unrevealed properties of each PET's mechanism and heuristics are analyzed and evaluated. Importantly, a practical guide for choosing the most appropriate PET algorithm given a new data mining problem is provided.
人们对设计更好的概率估计树(pet)来进行排序和概率估计越来越感兴趣。pet能够生成类隶属概率,对于成本敏感学习和匹配偏态分布等许多难题具有高度的准确性和灵活性。有大量的PET算法可用,其中大约有十种是众所周知的。这个大的数字提供了一个优势,但它也在实践中造成了混乱。有人会问:“给定一个新的数据集,选择哪种算法,期望什么性能,不期望什么性能?”在不同的情况下,表现好坏的原因是什么?”在本文中,我们首次系统地回答了这些重要问题,对五种流行的pet进行了大规模的实证比较,通过检查它们的AUC、MSE和错误率“学习曲线”(而不是基于训练-测试分裂的交叉验证)。使用任何评估的概率估计树算法获得的最大AUC,我们证明了概率估计树对不同评估指标的偏好可以通过数据集的“信噪可分性”以及本文进一步解释的数据集的一些其他可观察统计量来准确表征。此外,为了了解它们的相对性能,分析和评估了每个PET的机制和启发式的许多重要和以前未揭示的特性。重要的是,本文提供了在新的数据挖掘问题中选择最合适的PET算法的实用指南。
{"title":"Discovering Unrevealed Properties of Probability Estimation Trees: On Algorithm Selection and Performance Explanation","authors":"Kun Zhang, W. Fan, B. Buckles, Xiao-Xia Yuan, Zujia Xu","doi":"10.1109/ICDM.2006.58","DOIUrl":"https://doi.org/10.1109/ICDM.2006.58","url":null,"abstract":"There has been increasing interest to design better probability estimation trees, or PETs, for ranking and probability estimation. Capable of generating class membership probabilities, PETs have been shown to be highly accurate and flexible for many difficult problems, such as cost-sensitive learning and matching skewed distributions. There are a large number of PET algorithms available, and about ten of them are well- known. This large number provides an advantage, but it also creates confusion in practice. One would ask \"given a new dataset, which algorithm to choose and what performance to expect and not to expect? What are the reasons to explain either good or bad performance under different situations? \" In this paper, we systematically, for the first time, answer these important questions by conducting a large-scale empirical comparison of five popular PETs by examining their AUC, MSE and error rate \"learning curves\" (instead of training-test split based cross-validation). Using the maximum AUC achieved by any of the evaluated probability estimation tree algorithms, we demonstrate that the preference of a probability estimation tree on different evaluation metrics can be accurately characterized by the \"signal-noise separability\" of the dataset, as well as some other observable statistics of the dataset explained further in the paper. Moreover, in order to understand their relative performance, many important and previously unrevealed properties of each PET's mechanism and heuristics are analyzed and evaluated. Importantly, a practical guide for choosing the most appropriate PET algorithm given a new data mining problem is provided.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121584678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
Sixth International Conference on Data Mining (ICDM'06)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1