首页 > 最新文献

Sixth International Conference on Data Mining (ICDM'06)最新文献

英文 中文
Probabilistic Enhanced Mapping with the Generative Tabular Model 基于生成表格模型的概率增强映射
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.128
R. Priam, M. Nadif
Visualization of the massive datasets needs new methods which are able to quickly and easily reveal their contents. The projection of the data cloud is an interesting paradigm in spite of its difficulty to be explored when data plots are too numerous. So we study a new way to show a bidimensional projection from a multidimensional data cloud: our generative model constructs a tabular view of the projected cloud. We are able to show the high densities areas by their non equidistributed discretization. This approach is an alternative to the self-organizing map when a projection does already exist. The resulting pixel views of a dataset are illustrated by projecting a data sample of real images: it becomes possible to observe how are laid out the class labels or the frequencies of a group of modalities without being lost because of a zoom enlarging change for instance. The conclusion gives perspectives to this original promising point of view to get a readable projection for a statistical data analysis of large data samples.
海量数据集的可视化需要能够快速方便地显示其内容的新方法。数据云的投影是一个有趣的范例,尽管它在数据图太多时难以探索。因此,我们研究了一种从多维数据云中显示二维投影的新方法:我们的生成模型构建了投影云的表格视图。我们可以通过高密度区域的非等分布离散化来表示高密度区域。当投影已经存在时,这种方法是自组织映射的替代方法。通过投影真实图像的数据样本来说明数据集的最终像素视图:例如,可以观察如何布局类标签或一组模态的频率,而不会因为缩放变化而丢失。结论为这一最初的有希望的观点提供了视角,以获得可读的投影,用于大数据样本的统计数据分析。
{"title":"Probabilistic Enhanced Mapping with the Generative Tabular Model","authors":"R. Priam, M. Nadif","doi":"10.1109/ICDM.2006.128","DOIUrl":"https://doi.org/10.1109/ICDM.2006.128","url":null,"abstract":"Visualization of the massive datasets needs new methods which are able to quickly and easily reveal their contents. The projection of the data cloud is an interesting paradigm in spite of its difficulty to be explored when data plots are too numerous. So we study a new way to show a bidimensional projection from a multidimensional data cloud: our generative model constructs a tabular view of the projected cloud. We are able to show the high densities areas by their non equidistributed discretization. This approach is an alternative to the self-organizing map when a projection does already exist. The resulting pixel views of a dataset are illustrated by projecting a data sample of real images: it becomes possible to observe how are laid out the class labels or the frequencies of a group of modalities without being lost because of a zoom enlarging change for instance. The conclusion gives perspectives to this original promising point of view to get a readable projection for a statistical data analysis of large data samples.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125242000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving Nearest Neighbor Classifier Using Tabu Search and Ensemble Distance Metrics 利用禁忌搜索和集合距离度量改进最近邻分类器
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.86
M. Tahir, Jim E. Smith
The nearest-neighbor (NN) classifier has long been used in pattern recognition, exploratory data analysis, and data mining problems. A vital consideration in obtaining good results with this technique is the choice of distance function, and correspondingly which features to consider when computing distances between samples. In this paper, a new ensemble technique is proposed to improve the performance of NN classifier. The proposed approach combines multiple NN classifiers, where each classifier uses a different distance function and potentially a different set of features (feature vector). These feature vectors are determined for each distance metric using Simple Voting Scheme incorporated in Tabu Search (TS). The proposed ensemble classifier with different distance metrics and different feature vectors (TS-DF/NN) is evaluated using various benchmark data sets from UCI Machine Learning Repository. Results have indicated a significant increase in the performance when compared with various well-known classifiers. Furthermore, the proposed ensemble method is also compared with ensemble classifier using different distance metrics but with same feature vector (with or without Feature Selection (FS)).
最近邻(NN)分类器在模式识别、探索性数据分析和数据挖掘问题中一直被使用。使用该技术获得良好结果的一个重要考虑因素是距离函数的选择,以及在计算样本间距离时相应考虑哪些特征。本文提出了一种新的集成技术来提高神经网络分类器的性能。该方法结合了多个神经网络分类器,其中每个分类器使用不同的距离函数和可能不同的特征集(特征向量)。使用禁忌搜索(TS)中的简单投票方案确定每个距离度量的特征向量。使用来自UCI机器学习库的各种基准数据集对具有不同距离度量和不同特征向量(TS-DF/NN)的集成分类器进行了评估。结果表明,与各种知名分类器相比,性能显着提高。此外,还将所提出的集成方法与使用不同距离度量但具有相同特征向量(带或不带特征选择(FS))的集成分类器进行了比较。
{"title":"Improving Nearest Neighbor Classifier Using Tabu Search and Ensemble Distance Metrics","authors":"M. Tahir, Jim E. Smith","doi":"10.1109/ICDM.2006.86","DOIUrl":"https://doi.org/10.1109/ICDM.2006.86","url":null,"abstract":"The nearest-neighbor (NN) classifier has long been used in pattern recognition, exploratory data analysis, and data mining problems. A vital consideration in obtaining good results with this technique is the choice of distance function, and correspondingly which features to consider when computing distances between samples. In this paper, a new ensemble technique is proposed to improve the performance of NN classifier. The proposed approach combines multiple NN classifiers, where each classifier uses a different distance function and potentially a different set of features (feature vector). These feature vectors are determined for each distance metric using Simple Voting Scheme incorporated in Tabu Search (TS). The proposed ensemble classifier with different distance metrics and different feature vectors (TS-DF/NN) is evaluated using various benchmark data sets from UCI Machine Learning Repository. Results have indicated a significant increase in the performance when compared with various well-known classifiers. Furthermore, the proposed ensemble method is also compared with ensemble classifier using different distance metrics but with same feature vector (with or without Feature Selection (FS)).","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"27 18","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114017535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Incremental Mining of Frequent Query Patterns from XML Queries for Caching 基于缓存的XML查询频繁查询模式的增量挖掘
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.88
Guoliang Li, Jianhua Feng, Jianyong Wang, Yong Zhang, Lizhu Zhou
Existing studies for mining frequent XML query patterns mainly introduce a straightforward candidate generate-and-test strategy and compute frequencies of candidate query patterns from scratch periodically by checking the entire transaction database, which consists of XML query patterns transformed from user queries. However, it is nontrivial to maintain such discovered frequent patterns in real XML databases because there may incur frequent updates that may not only invalidate some existing frequent query patterns but also generate some new frequent ones. Accordingly, existing proposals are inefficient for the evolution of the transaction database. To address these problems, this paper presents an efficient algorithm IPS-FXQPMiner for mining frequent XML query patterns without candidate maintenance and costly tree-containment checking. We transform XML queries into sequences through a one- to-one mapping and then mine the frequent sequences to generate frequent XML query patterns. More importantly, based on IPS-FXQPMiner, an efficient incremental algorithm, Incre-FXQPMiner is proposed to incrementally mine frequent XML query patterns, which can minimize the I/O and computation requirements for handling incremental updates. Our experimental study on various real-life datasets demonstrates the efficiency and scalability of our algorithms over previous known alternatives.
现有的挖掘频繁XML查询模式的研究主要是引入一种简单的候选生成和测试策略,通过检查整个事务数据库(由用户查询转换而来的XML查询模式组成),从头开始定期计算候选查询模式的频率。然而,在实际的XML数据库中维护这些发现的频繁模式并非易事,因为这可能导致频繁的更新,不仅会使某些现有的频繁查询模式失效,还会生成一些新的频繁查询模式。因此,现有的建议对于事务数据库的发展是低效的。为了解决这些问题,本文提出了一种高效的算法IPS-FXQPMiner,用于挖掘频繁的XML查询模式,无需候选维护和昂贵的树包容检查。我们通过一对一映射将XML查询转换为序列,然后挖掘频繁序列以生成频繁XML查询模式。更重要的是,基于IPS-FXQPMiner这一高效的增量算法,提出了增量挖掘频繁XML查询模式的increment - fxqpminer,可以最大限度地减少处理增量更新的I/O和计算需求。我们对各种现实生活数据集的实验研究表明,我们的算法比以前已知的替代方案更高效和可扩展性。
{"title":"Incremental Mining of Frequent Query Patterns from XML Queries for Caching","authors":"Guoliang Li, Jianhua Feng, Jianyong Wang, Yong Zhang, Lizhu Zhou","doi":"10.1109/ICDM.2006.88","DOIUrl":"https://doi.org/10.1109/ICDM.2006.88","url":null,"abstract":"Existing studies for mining frequent XML query patterns mainly introduce a straightforward candidate generate-and-test strategy and compute frequencies of candidate query patterns from scratch periodically by checking the entire transaction database, which consists of XML query patterns transformed from user queries. However, it is nontrivial to maintain such discovered frequent patterns in real XML databases because there may incur frequent updates that may not only invalidate some existing frequent query patterns but also generate some new frequent ones. Accordingly, existing proposals are inefficient for the evolution of the transaction database. To address these problems, this paper presents an efficient algorithm IPS-FXQPMiner for mining frequent XML query patterns without candidate maintenance and costly tree-containment checking. We transform XML queries into sequences through a one- to-one mapping and then mine the frequent sequences to generate frequent XML query patterns. More importantly, based on IPS-FXQPMiner, an efficient incremental algorithm, Incre-FXQPMiner is proposed to incrementally mine frequent XML query patterns, which can minimize the I/O and computation requirements for handling incremental updates. Our experimental study on various real-life datasets demonstrates the efficiency and scalability of our algorithms over previous known alternatives.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128139391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
COALA: A Novel Approach for the Extraction of an Alternate Clustering of High Quality and High Dissimilarity COALA:一种提取高质量和高不相似度交替聚类的新方法
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.37
Eric Bae, J. Bailey
Cluster analysis has long been a fundamental task in data mining and machine learning. However, traditional clustering methods concentrate on producing a single solution, even though multiple alternative clusterings may exist. It is thus difficult for the user to validate whether the given solution is in fact appropriate, particularly for large and complex datasets. In this paper we explore the critical requirements for systematically finding a new clustering, given that an already known clustering is available and we also propose a novel algorithm, COALA, to discover this new clustering. Our approach is driven by two important factors; dissimilarity and quality. These are especially important for finding a new clustering which is highly informative about the underlying structure of data, but is at the same time distinctively different from the provided clustering. We undertake an experimental analysis and show that our method is able to outperform existing techniques, for both synthetic and real datasets.
聚类分析一直是数据挖掘和机器学习中的一项基本任务。然而,传统的聚类方法集中于产生单一的解决方案,即使可能存在多个备选聚类。因此,用户很难验证给定的解决方案是否实际上是合适的,特别是对于大型和复杂的数据集。在本文中,我们探讨了系统地找到一个新的聚类的关键要求,给定一个已知的聚类是可用的,我们还提出了一个新的算法,COALA,以发现这个新的聚类。我们的做法是由两个重要因素驱动的;差异和质量。这对于寻找新的聚类尤其重要,这种聚类对数据的底层结构提供了大量的信息,但同时又与现有的聚类有明显的不同。我们进行了实验分析,并表明我们的方法能够优于现有的技术,无论是合成数据集还是真实数据集。
{"title":"COALA: A Novel Approach for the Extraction of an Alternate Clustering of High Quality and High Dissimilarity","authors":"Eric Bae, J. Bailey","doi":"10.1109/ICDM.2006.37","DOIUrl":"https://doi.org/10.1109/ICDM.2006.37","url":null,"abstract":"Cluster analysis has long been a fundamental task in data mining and machine learning. However, traditional clustering methods concentrate on producing a single solution, even though multiple alternative clusterings may exist. It is thus difficult for the user to validate whether the given solution is in fact appropriate, particularly for large and complex datasets. In this paper we explore the critical requirements for systematically finding a new clustering, given that an already known clustering is available and we also propose a novel algorithm, COALA, to discover this new clustering. Our approach is driven by two important factors; dissimilarity and quality. These are especially important for finding a new clustering which is highly informative about the underlying structure of data, but is at the same time distinctively different from the provided clustering. We undertake an experimental analysis and show that our method is able to outperform existing techniques, for both synthetic and real datasets.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133301425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 140
Applying Data Mining to Pseudo-Relevance Feedback for High Performance Text Retrieval 将数据挖掘应用于伪相关反馈的高性能文本检索
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.22
Xiangji Huang, Y. Huang, M. Wen, Aijun An, Y. Liu, Josiah Poon
In this paper, we investigate the use of data mining, in particular the text classification and co-training techniques, to identify more relevant passages based on a small set of labeled passages obtained from the blind feedback of a retrieval system. The data mining results are used to expand query terms and to re-estimate some of the parameters used in a probabilistic weighting function. We evaluate the data mining based feedback method on the TREC HARD data set. The results show that data mining can be successfully applied to improve the text retrieval performance. We report our experimental findings in detail.
在本文中,我们研究了使用数据挖掘,特别是文本分类和协同训练技术,基于从检索系统的盲反馈中获得的一小组标记段落来识别更多相关的段落。数据挖掘结果用于扩展查询项并重新估计概率加权函数中使用的一些参数。我们在TREC HARD数据集上评估了基于数据挖掘的反馈方法。结果表明,数据挖掘可以有效地提高文本检索的性能。我们详细地报告了我们的实验结果。
{"title":"Applying Data Mining to Pseudo-Relevance Feedback for High Performance Text Retrieval","authors":"Xiangji Huang, Y. Huang, M. Wen, Aijun An, Y. Liu, Josiah Poon","doi":"10.1109/ICDM.2006.22","DOIUrl":"https://doi.org/10.1109/ICDM.2006.22","url":null,"abstract":"In this paper, we investigate the use of data mining, in particular the text classification and co-training techniques, to identify more relevant passages based on a small set of labeled passages obtained from the blind feedback of a retrieval system. The data mining results are used to expand query terms and to re-estimate some of the parameters used in a probabilistic weighting function. We evaluate the data mining based feedback method on the TREC HARD data set. The results show that data mining can be successfully applied to improve the text retrieval performance. We report our experimental findings in detail.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133963687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
Solution Path for Semi-Supervised Classification with Manifold Regularization 具有流形正则化的半监督分类解路径
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.150
G. Wang, Tao Chen, D. Yeung, F. Lochovsky
With very low extra computational cost, the entire solution path can be computed for various learning algorithms like support vector classification (SVC) and support vector regression (SVR). In this paper, we extend this promising approach to semi-supervised learning algorithms. In particular, we consider finding the solution path for the Laplacian support vector machine (LapSVM) which is a semi-supervised classification model based on manifold regularization. One advantage of the this algorithm is that the coefficient path is piecewise linear with respect to the regularization parameter, hence its computational complexity is quadratic in the number of labeled examples.
对于支持向量分类(SVC)和支持向量回归(SVR)等各种学习算法,可以以非常低的额外计算成本计算出整个解路径。在本文中,我们将这种有前途的方法扩展到半监督学习算法中。特别地,我们考虑了基于流形正则化的半监督分类模型拉普拉斯支持向量机(LapSVM)的求解路径。该算法的一个优点是系数路径相对于正则化参数是分段线性的,因此其计算复杂度在标记样例的数量上是二次的。
{"title":"Solution Path for Semi-Supervised Classification with Manifold Regularization","authors":"G. Wang, Tao Chen, D. Yeung, F. Lochovsky","doi":"10.1109/ICDM.2006.150","DOIUrl":"https://doi.org/10.1109/ICDM.2006.150","url":null,"abstract":"With very low extra computational cost, the entire solution path can be computed for various learning algorithms like support vector classification (SVC) and support vector regression (SVR). In this paper, we extend this promising approach to semi-supervised learning algorithms. In particular, we consider finding the solution path for the Laplacian support vector machine (LapSVM) which is a semi-supervised classification model based on manifold regularization. One advantage of the this algorithm is that the coefficient path is piecewise linear with respect to the regularization parameter, hence its computational complexity is quadratic in the number of labeled examples.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134645601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Dirichlet Aspect Weighting: A Generalized EM Algorithm for Integrating External Data Fields with Semantically Structured Queries by Using Gradient Projection Method Dirichlet方面加权:一种利用梯度投影法集成外部数据域和语义结构化查询的广义EM算法
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.55
A. Velivelli, Thomas S. Huang
In this paper we address the problem of document retrieval with semantically structured queries - queries where each term has a tagged field label. We introduce Dirichlet Aspect Weighting model which integrates terms from external databases into the query language model in a bayesian learning framework. For this model, the Dirichlet prior distribution is governed by parameters which depend on the number of fields in the external databases. This model needs additional examples to be augmented to the semantically structured query. These examples are obtained using pseudo relevance feedback. We formulate a loglikelihood function for the Dirichlet Aspect Weighting model and maximize it using a novel Generalized EM algorithm. Comparison of the results of Dirichlet Aspect Weighting model on TREC 2005 Genomics Track dataset with baseline methods using pseudo relevance feedback, while incorporating terms from external databases shows an improvement.
在本文中,我们用语义结构化查询解决文档检索的问题——查询中每个词都有一个带标签的字段标签。在贝叶斯学习框架中引入Dirichlet方面加权模型,该模型将外部数据库中的术语集成到查询语言模型中。对于该模型,Dirichlet先验分布由参数控制,这些参数取决于外部数据库中字段的数量。该模型需要额外的示例来扩展到语义结构化查询。这些例子是使用伪相关反馈得到的。我们为Dirichlet方面加权模型制定了一个对数似然函数,并使用一种新的广义EM算法最大化它。将Dirichlet方面加权模型在TREC 2005 Genomics Track数据集上的结果与使用伪相关反馈的基线方法进行比较,并结合外部数据库中的术语,结果显示出改进。
{"title":"Dirichlet Aspect Weighting: A Generalized EM Algorithm for Integrating External Data Fields with Semantically Structured Queries by Using Gradient Projection Method","authors":"A. Velivelli, Thomas S. Huang","doi":"10.1109/ICDM.2006.55","DOIUrl":"https://doi.org/10.1109/ICDM.2006.55","url":null,"abstract":"In this paper we address the problem of document retrieval with semantically structured queries - queries where each term has a tagged field label. We introduce Dirichlet Aspect Weighting model which integrates terms from external databases into the query language model in a bayesian learning framework. For this model, the Dirichlet prior distribution is governed by parameters which depend on the number of fields in the external databases. This model needs additional examples to be augmented to the semantically structured query. These examples are obtained using pseudo relevance feedback. We formulate a loglikelihood function for the Dirichlet Aspect Weighting model and maximize it using a novel Generalized EM algorithm. Comparison of the results of Dirichlet Aspect Weighting model on TREC 2005 Genomics Track dataset with baseline methods using pseudo relevance feedback, while incorporating terms from external databases shows an improvement.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125322146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Experimental Investigation of Graph Kernels on a Collaborative Recommendation Task 协同推荐任务图核的实验研究
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.18
François Fouss, Luh Yen, A. Pirotte, M. Saerens
This work presents a systematic comparison between seven kernels (or similarity matrices) on a graph, namely the exponential diffusion kernel, the Laplacian diffusion kernel, the von Neumann kernel, the regularized Laplacian kernel, the commute time kernel, and finally the Markov diffusion kernel and the cross-entropy diffusion matrix - both introduced in this paper - on a collaborative recommendation task involving a database. The database is viewed as a graph where elements are represented as nodes and relations as links between nodes. From this graph, seven kernels are computed, leading to a set of meaningful proximity measures between nodes, allowing to answer questions about the structure of the graph under investigation; in particular, recommend items to users. Cross- validation results indicate that a simple nearest-neighbours rule based on the similarity measure provided by the regularized Laplacian, the Markov diffusion and the commute time kernels performs best. We therefore recommend the use of the commute time kernel for computing similarities between elements of a database, for two reasons: (1) it has a nice appealing interpretation in terms of random walks and (2) no parameter needs to be adjusted.
本文系统地比较了图上的7个核(或相似矩阵),即指数扩散核、拉普拉斯扩散核、冯·诺伊曼核、正则拉普拉斯核、通勤时间核,最后是马尔可夫扩散核和交叉熵扩散矩阵——两者都在本文中介绍——在涉及数据库的协同推荐任务上。数据库被视为一个图,其中元素表示为节点,关系表示为节点之间的链接。从这个图中,计算了七个核,导致节点之间的一组有意义的接近度量,允许回答有关正在研究的图结构的问题;特别是向用户推荐项目。交叉验证结果表明,基于正则拉普拉斯算子、马尔可夫扩散和通勤时间核提供的相似性度量的简单近邻规则效果最好。因此,我们建议使用通勤时间核来计算数据库元素之间的相似性,原因有两个:(1)它在随机漫步方面有一个很好的吸引人的解释,(2)不需要调整参数。
{"title":"An Experimental Investigation of Graph Kernels on a Collaborative Recommendation Task","authors":"François Fouss, Luh Yen, A. Pirotte, M. Saerens","doi":"10.1109/ICDM.2006.18","DOIUrl":"https://doi.org/10.1109/ICDM.2006.18","url":null,"abstract":"This work presents a systematic comparison between seven kernels (or similarity matrices) on a graph, namely the exponential diffusion kernel, the Laplacian diffusion kernel, the von Neumann kernel, the regularized Laplacian kernel, the commute time kernel, and finally the Markov diffusion kernel and the cross-entropy diffusion matrix - both introduced in this paper - on a collaborative recommendation task involving a database. The database is viewed as a graph where elements are represented as nodes and relations as links between nodes. From this graph, seven kernels are computed, leading to a set of meaningful proximity measures between nodes, allowing to answer questions about the structure of the graph under investigation; in particular, recommend items to users. Cross- validation results indicate that a simple nearest-neighbours rule based on the similarity measure provided by the regularized Laplacian, the Markov diffusion and the commute time kernels performs best. We therefore recommend the use of the commute time kernel for computing similarities between elements of a database, for two reasons: (1) it has a nice appealing interpretation in terms of random walks and (2) no parameter needs to be adjusted.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"156 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132293604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 129
Cluster Analysis of Time-Series Medical Data Based on the Trajectory Representation and Multiscale Comparison Techniques 基于轨迹表示和多尺度比较技术的时间序列医疗数据聚类分析
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.33
S. Hirano, S. Tsumoto
This paper presents a cluster analysis method for multidimensional time-series data on clinical laboratory examinations. Our method represents the time series of test results as trajectories in multidimensional space, and compares their structural similarity by using the multiscale comparison technique. It enables us to find the part-to-part correspondences between two trajectories, taking into account the relationships between different tests. The resultant dissimilarity can be further used with clustering algorithms for finding the groups of similar cases. The method was applied to the cluster analysis of Albumin-Platelet data in the chronic hepatitis dataset. The results denonstrated that it could form interesting groups of cases that have high correspondence to the fibrotic stages.
本文提出了一种聚类分析方法,用于临床实验室检查的多维时间序列数据。该方法将测试结果的时间序列表示为多维空间中的轨迹,并利用多尺度比较技术比较它们的结构相似性。它使我们能够找到两个轨迹之间的部分对部分对应关系,同时考虑到不同测试之间的关系。由此产生的不相似性可以进一步与聚类算法一起使用,以找到相似案例的组。该方法应用于慢性肝炎数据集中白蛋白-血小板数据的聚类分析。结果表明,它可以形成与纤维化阶段高度对应的有趣病例组。
{"title":"Cluster Analysis of Time-Series Medical Data Based on the Trajectory Representation and Multiscale Comparison Techniques","authors":"S. Hirano, S. Tsumoto","doi":"10.1109/ICDM.2006.33","DOIUrl":"https://doi.org/10.1109/ICDM.2006.33","url":null,"abstract":"This paper presents a cluster analysis method for multidimensional time-series data on clinical laboratory examinations. Our method represents the time series of test results as trajectories in multidimensional space, and compares their structural similarity by using the multiscale comparison technique. It enables us to find the part-to-part correspondences between two trajectories, taking into account the relationships between different tests. The resultant dissimilarity can be further used with clustering algorithms for finding the groups of similar cases. The method was applied to the cluster analysis of Albumin-Platelet data in the chronic hepatitis dataset. The results denonstrated that it could form interesting groups of cases that have high correspondence to the fibrotic stages.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132962769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Mining Maximal Generalized Frequent Geographic Patterns with Knowledge Constraints 基于知识约束的最大广义频繁地理模式挖掘
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.110
V. Bogorny, J. Valiati, S. D. S. Camargo, P. Engel, B. Kuijpers, L. Alvares
In frequent geographic pattern mining a large amount of patterns is well known a priori. This paper presents a novel approach for mining frequent geographic patterns without associations that are previously known as non- interesting. Geographic dependences are eliminated during the frequent set generation using prior knowledge. After the dependence elimination maximal generalized frequent sets are computed to remove redundant frequent sets. Experimental results show a significant reduction of both the number of frequent sets and the computational time for mining maximal frequent geographic patterns.
在频繁的地理模式挖掘中,大量的模式是众所周知的先验。本文提出了一种新的方法来挖掘频繁的地理模式,而不涉及以前被称为无兴趣的关联。在使用先验知识的频繁集生成过程中消除了地理依赖性。在相关性消除后,计算最大广义频率集,去除冗余频率集。实验结果表明,该方法显著减少了挖掘最大频繁地理模式的频繁集数量和计算时间。
{"title":"Mining Maximal Generalized Frequent Geographic Patterns with Knowledge Constraints","authors":"V. Bogorny, J. Valiati, S. D. S. Camargo, P. Engel, B. Kuijpers, L. Alvares","doi":"10.1109/ICDM.2006.110","DOIUrl":"https://doi.org/10.1109/ICDM.2006.110","url":null,"abstract":"In frequent geographic pattern mining a large amount of patterns is well known a priori. This paper presents a novel approach for mining frequent geographic patterns without associations that are previously known as non- interesting. Geographic dependences are eliminated during the frequent set generation using prior knowledge. After the dependence elimination maximal generalized frequent sets are computed to remove redundant frequent sets. Experimental results show a significant reduction of both the number of frequent sets and the computational time for mining maximal frequent geographic patterns.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"195 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116402953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
期刊
Sixth International Conference on Data Mining (ICDM'06)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1