首页 > 最新文献

2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)最新文献

英文 中文
Clustering categorical data: A stability analysis framework 聚类分类数据:一个稳定性分析框架
Pub Date : 2011-04-11 DOI: 10.1109/CIDM.2011.5949452
I. Jarman, T. Etchells, P. Lisboa, Charlene Beynon, J. Martín-Guerrero
Clustering to identify inherent structure is an important first step in data exploration. The k-means algorithm is a popular choice, but K-means is not generally appropriate for categorical data. A specific extension of k-means for categorical data is the k-modes algorithm. Both of these partition clustering methods are sensitive to the initialization of prototypes, which creates the difficulty of selecting the best solution for a given problem. In addition, selecting the number of clusters can be an issue. Further, the k-modes method is especially prone to instability when presented with ‘noisy’ data, since the calculation of the mode lacks the smoothing effect inherent in the calculation of the mean. This is often the case with real-world datasets, for instance in the domain of Public Health, resulting in solutions that can be radically different depending on the initialization and therefore lead to different interpretations. This paper presents two methodologies. The first addresses sensitivity to initializations using a generic landscape mapping of k-mode solutions. The second methodology utilizes the landscape map to stabilize the partition clusters for discrete data, by drawing a consensus sample in order to separate signal from noise components. Results are presented for the benchmark soybean disease dataset, an artificially generated dataset and a case study involving Public Health data.
聚类识别内在结构是数据探索中重要的第一步。k-means算法是一种流行的选择,但k-means通常不适用于分类数据。k-means对分类数据的具体扩展是k-modes算法。这两种划分聚类方法都对原型的初始化很敏感,这给给定问题选择最佳解决方案带来了困难。此外,选择集群的数量也是一个问题。此外,k模态方法在处理“噪声”数据时特别容易出现不稳定性,因为模态的计算缺乏平均值计算中固有的平滑效果。现实世界的数据集经常出现这种情况,例如在公共卫生领域,这导致解决方案可能因初始化而截然不同,从而导致不同的解释。本文提出了两种方法。第一个解决了使用k-mode解的通用横向映射对初始化的敏感性。第二种方法利用景观图来稳定离散数据的分区簇,通过绘制共识样本来分离信号和噪声成分。介绍了基准大豆疾病数据集、人工生成数据集和涉及公共卫生数据的案例研究的结果。
{"title":"Clustering categorical data: A stability analysis framework","authors":"I. Jarman, T. Etchells, P. Lisboa, Charlene Beynon, J. Martín-Guerrero","doi":"10.1109/CIDM.2011.5949452","DOIUrl":"https://doi.org/10.1109/CIDM.2011.5949452","url":null,"abstract":"Clustering to identify inherent structure is an important first step in data exploration. The k-means algorithm is a popular choice, but K-means is not generally appropriate for categorical data. A specific extension of k-means for categorical data is the k-modes algorithm. Both of these partition clustering methods are sensitive to the initialization of prototypes, which creates the difficulty of selecting the best solution for a given problem. In addition, selecting the number of clusters can be an issue. Further, the k-modes method is especially prone to instability when presented with ‘noisy’ data, since the calculation of the mode lacks the smoothing effect inherent in the calculation of the mean. This is often the case with real-world datasets, for instance in the domain of Public Health, resulting in solutions that can be radically different depending on the initialization and therefore lead to different interpretations. This paper presents two methodologies. The first addresses sensitivity to initializations using a generic landscape mapping of k-mode solutions. The second methodology utilizes the landscape map to stabilize the partition clusters for discrete data, by drawing a consensus sample in order to separate signal from noise components. Results are presented for the benchmark soybean disease dataset, an artificially generated dataset and a case study involving Public Health data.","PeriodicalId":211565,"journal":{"name":"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127129739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
KB-CB-N classification: Towards unsupervised approach for supervised learning KB-CB-N分类:面向监督学习的无监督方法
Pub Date : 2011-04-11 DOI: 10.1109/CIDM.2011.5949435
Z. Abdallah, M. Gaber
Data classification has attracted considerable research attention in the field of computational statistics and data mining due to its wide range of applications. K Best Cluster Based Neighbour (KB-CB-N) is our novel classification technique based on the integration of three different similarity measures for cluster based classification. The basic principle is to apply unsupervised learning on the instances of each class in the dataset and then use the output as an input for the classification algorithm to find the K best neighbours of clusters from the density, gravity and distance perspectives. Clustering is applied as an initial step within each class to find the inherent in-class grouping in the dataset. Different data clustering techniques use different similarity measures. Each measure has its own strength and weakness. Thus, combining the three measures can benefit from the strength of each one and eliminate encountered problems of using an individual measure. Extensive experimental results using eight real datasets have evidenced that our new technique typically shows improved or equivalent performance over other existing state-of-the-art classification methods.
数据分类由于其广泛的应用,在计算统计和数据挖掘领域引起了相当大的研究关注。基于最佳聚类邻居(KB-CB-N)是一种基于三种不同相似性度量的聚类分类新技术。基本原理是对数据集中每个类的实例应用无监督学习,然后将输出作为分类算法的输入,从密度、重力和距离的角度找到K个簇的最佳邻居。聚类是在每个类中应用的初始步骤,以找到数据集中固有的类内分组。不同的数据聚类技术使用不同的相似性度量。每种措施都有其优缺点。因此,将三个度量结合起来可以从每个度量的优势中获益,并消除使用单个度量所遇到的问题。使用8个真实数据集的广泛实验结果证明,我们的新技术通常比其他现有的最先进的分类方法表现出改进或同等的性能。
{"title":"KB-CB-N classification: Towards unsupervised approach for supervised learning","authors":"Z. Abdallah, M. Gaber","doi":"10.1109/CIDM.2011.5949435","DOIUrl":"https://doi.org/10.1109/CIDM.2011.5949435","url":null,"abstract":"Data classification has attracted considerable research attention in the field of computational statistics and data mining due to its wide range of applications. K Best Cluster Based Neighbour (KB-CB-N) is our novel classification technique based on the integration of three different similarity measures for cluster based classification. The basic principle is to apply unsupervised learning on the instances of each class in the dataset and then use the output as an input for the classification algorithm to find the K best neighbours of clusters from the density, gravity and distance perspectives. Clustering is applied as an initial step within each class to find the inherent in-class grouping in the dataset. Different data clustering techniques use different similarity measures. Each measure has its own strength and weakness. Thus, combining the three measures can benefit from the strength of each one and eliminate encountered problems of using an individual measure. Extensive experimental results using eight real datasets have evidenced that our new technique typically shows improved or equivalent performance over other existing state-of-the-art classification methods.","PeriodicalId":211565,"journal":{"name":"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133416846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Online autoregressive prediction in time series with delayed disclosure 时滞披露时间序列的在线自回归预测
Pub Date : 2011-04-11 DOI: 10.1109/CIDM.2011.5949440
J. Andreoli, Marie-Luise Schneider
We propose a supervised machine learning method to automate the classification of events within time series in a monitoring context. It is based on a generative stochastic model of the time series which combines a probabilistic autoregressive classifier to determine the class label of each event, and a hidden Markov model to capture the production of the events. Events can be described by arbitrary combinations of discrete and continuous features. While at training time (offline), it is assumed that the class labels of all the events are known, at inference time (online), when a prediction is to be made for an event, it is not assumed that the class labels of the preceding events are known. This makes prediction more complex due to the autoregressive nature of the model. Instead, we make and exploit a “delayed disclosure” assumption, namely that the class labels of all the events are eventually revealed, but the occurrence of an event and the revelation of its class are asynchronous. We report experimental results obtained by application of this approach to the monitoring of a fleet of distributed devices.
我们提出了一种监督式机器学习方法,用于在监控环境中对时间序列中的事件进行自动分类。它基于时间序列的生成随机模型,该模型结合了一个概率自回归分类器来确定每个事件的类别标签,以及一个隐马尔可夫模型来捕获事件的产生。事件可以用离散特征和连续特征的任意组合来描述。在训练时(离线),假设所有事件的类标签都是已知的,在推理时(在线),当要对事件进行预测时,不假设前面事件的类标签是已知的。由于模型的自回归性质,这使得预测更加复杂。相反,我们做出并利用了一个“延迟披露”的假设,即所有事件的类标签最终都会被披露,但事件的发生和类的披露是异步的。我们报告了将这种方法应用于监测一组分布式设备所获得的实验结果。
{"title":"Online autoregressive prediction in time series with delayed disclosure","authors":"J. Andreoli, Marie-Luise Schneider","doi":"10.1109/CIDM.2011.5949440","DOIUrl":"https://doi.org/10.1109/CIDM.2011.5949440","url":null,"abstract":"We propose a supervised machine learning method to automate the classification of events within time series in a monitoring context. It is based on a generative stochastic model of the time series which combines a probabilistic autoregressive classifier to determine the class label of each event, and a hidden Markov model to capture the production of the events. Events can be described by arbitrary combinations of discrete and continuous features. While at training time (offline), it is assumed that the class labels of all the events are known, at inference time (online), when a prediction is to be made for an event, it is not assumed that the class labels of the preceding events are known. This makes prediction more complex due to the autoregressive nature of the model. Instead, we make and exploit a “delayed disclosure” assumption, namely that the class labels of all the events are eventually revealed, but the occurrence of an event and the revelation of its class are asynchronous. We report experimental results obtained by application of this approach to the monitoring of a fleet of distributed devices.","PeriodicalId":211565,"journal":{"name":"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121920908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Partially supervised k-harmonic means clustering 部分监督k调和均值聚类
Pub Date : 2011-04-11 DOI: 10.1109/CIDM.2011.5949424
T. Runkler
A popular algorithm for finding clusters in unlabeled data optimizes the k-means clustering model. This algorithm converges quickly but is sensitive to initialization. Two ways to overcome this drawback are fuzzification and harmonic means. We show that k-harmonic means is a special case of reformulated fuzzy k-means. The main focus of this paper is on partially supervised clustering. Partially supervised clustering finds clusters in data sets that contain both unlabeled and labeled data. We review partially supervised k-means, partially supervised fuzzy k-means, and introduce a partially supervised extension of k-harmonic means. Experiments with four benchmark data sets indicate that partially supervised k-harmonic means inherits the advantages of its completely unsupervised variant: It is significantly less sensitive to initialization than partially supervised k-means.
一种在未标记数据中寻找聚类的流行算法优化了k-means聚类模型。该算法收敛速度快,但对初始化敏感。克服这一缺点的两种方法是模糊化和调和方法。我们证明了k调和均值是重新表述的模糊k均值的一种特殊情况。本文的重点是部分监督聚类。部分监督聚类在包含未标记和标记数据的数据集中查找聚类。我们回顾了部分监督k-means、部分监督模糊k-means,并引入了k调和均值的部分监督扩展。对四个基准数据集的实验表明,部分监督k-调和均值继承了其完全无监督变体的优点:它对初始化的敏感性明显低于部分监督k-均值。
{"title":"Partially supervised k-harmonic means clustering","authors":"T. Runkler","doi":"10.1109/CIDM.2011.5949424","DOIUrl":"https://doi.org/10.1109/CIDM.2011.5949424","url":null,"abstract":"A popular algorithm for finding clusters in unlabeled data optimizes the k-means clustering model. This algorithm converges quickly but is sensitive to initialization. Two ways to overcome this drawback are fuzzification and harmonic means. We show that k-harmonic means is a special case of reformulated fuzzy k-means. The main focus of this paper is on partially supervised clustering. Partially supervised clustering finds clusters in data sets that contain both unlabeled and labeled data. We review partially supervised k-means, partially supervised fuzzy k-means, and introduce a partially supervised extension of k-harmonic means. Experiments with four benchmark data sets indicate that partially supervised k-harmonic means inherits the advantages of its completely unsupervised variant: It is significantly less sensitive to initialization than partially supervised k-means.","PeriodicalId":211565,"journal":{"name":"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121478306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Increased classification accuracy and speedup through pair-wise feature selection for support vector machines 通过对支持向量机的成对特征选择,提高了分类精度和速度
Pub Date : 2011-04-11 DOI: 10.1109/CIDM.2011.5949457
K. Kramer, Dmitry Goldgof, L. Hall, A. Remsen
Support vector machines are binary classifiers that can implement multi-class classifiers by creating a classifier for each possible combination of classes or for each class using a one class versus all strategy. Feature selection algorithms often search for a single set of features to be used by each of the binary classifiers. This ignores the fact that features that may be good discriminators for two particular classes might not do well for other class combinations. As a result, the feature selection process may not include these features in the common set to be used by all support vector machines. It is shown that by selecting features for each binary class combination, overall classification accuracy can be improved (as much as 2.1%), feature selection time can be significantly reduced (speed up of 3.2 times), and time required for training a multi-class support vector machine is reduced. Another benefit of this approach is that considerably less time is required for feature selection when additional classes are added to the training data. This is because the features selected for the existing class combinations are still valid, so that feature selection only needs to be run for the new class combinations created.
支持向量机是二元分类器,它可以通过为每个可能的类组合创建分类器来实现多类分类器,或者使用单类对全策略为每个类创建分类器。特征选择算法通常搜索每个二元分类器使用的单个特征集。这忽略了一个事实,即可能是两个特定类的良好鉴别器的特征可能不适用于其他类组合。因此,特征选择过程可能不会将这些特征包含在所有支持向量机使用的公共集中。研究表明,通过对每个二分类组合选择特征,可以提高整体分类准确率(高达2.1%),显著减少特征选择时间(速度提高3.2倍),减少多类支持向量机的训练时间。这种方法的另一个好处是,当向训练数据中添加额外的类时,特征选择所需的时间大大减少。这是因为为现有的类组合选择的特性仍然有效,因此只需要为创建的新类组合运行特性选择。
{"title":"Increased classification accuracy and speedup through pair-wise feature selection for support vector machines","authors":"K. Kramer, Dmitry Goldgof, L. Hall, A. Remsen","doi":"10.1109/CIDM.2011.5949457","DOIUrl":"https://doi.org/10.1109/CIDM.2011.5949457","url":null,"abstract":"Support vector machines are binary classifiers that can implement multi-class classifiers by creating a classifier for each possible combination of classes or for each class using a one class versus all strategy. Feature selection algorithms often search for a single set of features to be used by each of the binary classifiers. This ignores the fact that features that may be good discriminators for two particular classes might not do well for other class combinations. As a result, the feature selection process may not include these features in the common set to be used by all support vector machines. It is shown that by selecting features for each binary class combination, overall classification accuracy can be improved (as much as 2.1%), feature selection time can be significantly reduced (speed up of 3.2 times), and time required for training a multi-class support vector machine is reduced. Another benefit of this approach is that considerably less time is required for feature selection when additional classes are added to the training data. This is because the features selected for the existing class combinations are still valid, so that feature selection only needs to be run for the new class combinations created.","PeriodicalId":211565,"journal":{"name":"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129496053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Partial generalized correlation for hyperspectral data 高光谱数据的部分广义相关
Pub Date : 2011-04-11 DOI: 10.1109/CIDM.2011.5949422
M. Strickert, B. Labitzke, V. Blanz
A variational approach is proposed for the unsupervised assessment of attribute variability of high-dimensional data given a differentiable similarity measure. The key question addressed is how much each data attribute contributes to an optimum transformation of vectors for reaching maximum similarity. This question is formalized and solved in a mathematically rigorous optimization framework for each data pair of interest. Trivially, for the Euclidean metric minimization to zero distance induces highest vector similarity, but in case of the linear Pearson correlation measure the highest similarity of one is desired. During optimization the not necessarily symmetric trajectories between two vectors are recorded and analyzed in terms of attribute changes and line integral. The proposed formalism allows to assess partial covariance and correlation characteristics of data attributes for vectors being compared by any differentiable similarity measure. Its potential for generating alternative and localized views such as for contrast enhancement is demonstrated for hyperspectral images from the remote sensing domain.
提出了一种基于可微相似性测度的高维数据属性可变性无监督评价的变分方法。解决的关键问题是每个数据属性对达到最大相似性的向量的最佳转换有多大贡献。这个问题在每个感兴趣的数据对的数学上严格的优化框架中被形式化和解决。通常,对于欧几里得度量最小化到零距离诱导最高向量相似性,但在线性Pearson相关度量的情况下,期望最高相似性为1。在优化过程中,记录两个矢量之间不一定对称的轨迹,并根据属性变化和线积分进行分析。所提出的形式允许评估部分协方差和相关特征的数据属性的向量被任何可微的相似性度量比较。对于来自遥感领域的高光谱图像,证明了它在生成替代和局部视图(如对比度增强)方面的潜力。
{"title":"Partial generalized correlation for hyperspectral data","authors":"M. Strickert, B. Labitzke, V. Blanz","doi":"10.1109/CIDM.2011.5949422","DOIUrl":"https://doi.org/10.1109/CIDM.2011.5949422","url":null,"abstract":"A variational approach is proposed for the unsupervised assessment of attribute variability of high-dimensional data given a differentiable similarity measure. The key question addressed is how much each data attribute contributes to an optimum transformation of vectors for reaching maximum similarity. This question is formalized and solved in a mathematically rigorous optimization framework for each data pair of interest. Trivially, for the Euclidean metric minimization to zero distance induces highest vector similarity, but in case of the linear Pearson correlation measure the highest similarity of one is desired. During optimization the not necessarily symmetric trajectories between two vectors are recorded and analyzed in terms of attribute changes and line integral. The proposed formalism allows to assess partial covariance and correlation characteristics of data attributes for vectors being compared by any differentiable similarity measure. Its potential for generating alternative and localized views such as for contrast enhancement is demonstrated for hyperspectral images from the remote sensing domain.","PeriodicalId":211565,"journal":{"name":"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133717725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Periodic quick test for classifying long-term activities 对长期活动进行分类的定期快速测试
Pub Date : 2011-04-11 DOI: 10.1109/CIDM.2011.5949426
Pekka Siirtola, Heli Koskimäki, J. Röning
A novel method to classify long-term human activities is presented in this study. The method consists of two parts: quick test and periodic classification. The quick test uses temporal information to improve recognition accuracy, while the periodic classification is based on the assumption that recognized activities are long-term. Periodic quick test (PQT) classification was tested using a data set consisting of six long-term sports exercises. The data were collected from six persons wearing a two-dimensional accelerometer on their wrist. The results show that the presented method is not only faster than a normal method, that does not use temporal information and does not assume that activities are long-term, but also more accurate. The results were compared with a normal sliding window technique which divides signal into smaller sequences and classifies each sequence into one of the six classes. The classification accuracy using a normal method was around 84% while using PQT the recognition rate was over 90%. In addition, the number of classified sequences using a normal method was over six times higher than using PQT.
提出了一种新的人类长期活动分类方法。该方法由快速检测和周期性分类两部分组成。快速测试使用时间信息来提高识别精度,而周期性分类是基于识别活动是长期的假设。定期快速测试(PQT)分类测试使用的数据集包括六个长期运动。这些数据是从六个人身上收集的,他们的手腕上戴着一个二维加速度计。结果表明,该方法不仅比常规方法更快,不使用时间信息,不假设活动是长期的,而且更准确。结果与常规滑动窗口技术进行了比较,该技术将信号分成更小的序列,并将每个序列分为六个类之一。常规方法的分类准确率在84%左右,而PQT方法的识别率在90%以上。此外,使用正常方法分类序列的数量比使用PQT高出6倍以上。
{"title":"Periodic quick test for classifying long-term activities","authors":"Pekka Siirtola, Heli Koskimäki, J. Röning","doi":"10.1109/CIDM.2011.5949426","DOIUrl":"https://doi.org/10.1109/CIDM.2011.5949426","url":null,"abstract":"A novel method to classify long-term human activities is presented in this study. The method consists of two parts: quick test and periodic classification. The quick test uses temporal information to improve recognition accuracy, while the periodic classification is based on the assumption that recognized activities are long-term. Periodic quick test (PQT) classification was tested using a data set consisting of six long-term sports exercises. The data were collected from six persons wearing a two-dimensional accelerometer on their wrist. The results show that the presented method is not only faster than a normal method, that does not use temporal information and does not assume that activities are long-term, but also more accurate. The results were compared with a normal sliding window technique which divides signal into smaller sequences and classifies each sequence into one of the six classes. The classification accuracy using a normal method was around 84% while using PQT the recognition rate was over 90%. In addition, the number of classified sequences using a normal method was over six times higher than using PQT.","PeriodicalId":211565,"journal":{"name":"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116384545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
FGMAC: Frequent subgraph mining with Arc Consistency 基于弧一致性的频繁子图挖掘
Pub Date : 2011-04-11 DOI: 10.1109/CIDM.2011.5949436
Brahim Douar, M. Liquiere, C. Latiri, Y. Slimani
With the important growth of requirements to analyze large amount of structured data such as chemical compounds, proteins structures, XML documents, to cite but a few, graph mining has become an attractive track and a real challenge in the data mining field. Among the various kinds of graph patterns, frequent subgraphs seem to be relevant in characterizing graphsets, discriminating different groups of sets, and classifying and clustering graphs. Because of the NP-Completeness of subgraph isomorphism test as well as the huge search space, fragment miners are exponential in runtime and/or memory consumption. In this paper we study a new polynomial projection operator named AC-Projection based on a key technique of constraint programming namely Arc Consistency (AC). This is intended to replace the use of the exponential subgraph isomorphism. We study the relevance of frequent AC-reduced graph patterns on classification and we prove that we can achieve an important performance gain without or with non-significant loss of discovered pattern's quality.
随着分析大量结构化数据(如化合物、蛋白质结构、XML文档等)需求的增长,图挖掘已经成为数据挖掘领域的一个有吸引力的方向和真正的挑战。在各种各样的图模式中,频繁子图似乎与图集的表征、不同集合组的区分以及图的分类和聚类有关。由于子图同构测试的np完备性以及巨大的搜索空间,碎片挖掘在运行时间和/或内存消耗方面呈指数级增长。本文研究了一种新的多项式投影算子AC-投影,该算子基于约束规划的一个关键技术——弧一致性(AC)。这是为了取代指数子图同构的使用。我们研究了频繁的交流约简图模式在分类上的相关性,证明了我们可以在没有或没有发现模式质量的显著损失的情况下获得重要的性能增益。
{"title":"FGMAC: Frequent subgraph mining with Arc Consistency","authors":"Brahim Douar, M. Liquiere, C. Latiri, Y. Slimani","doi":"10.1109/CIDM.2011.5949436","DOIUrl":"https://doi.org/10.1109/CIDM.2011.5949436","url":null,"abstract":"With the important growth of requirements to analyze large amount of structured data such as chemical compounds, proteins structures, XML documents, to cite but a few, graph mining has become an attractive track and a real challenge in the data mining field. Among the various kinds of graph patterns, frequent subgraphs seem to be relevant in characterizing graphsets, discriminating different groups of sets, and classifying and clustering graphs. Because of the NP-Completeness of subgraph isomorphism test as well as the huge search space, fragment miners are exponential in runtime and/or memory consumption. In this paper we study a new polynomial projection operator named AC-Projection based on a key technique of constraint programming namely Arc Consistency (AC). This is intended to replace the use of the exponential subgraph isomorphism. We study the relevance of frequent AC-reduced graph patterns on classification and we prove that we can achieve an important performance gain without or with non-significant loss of discovered pattern's quality.","PeriodicalId":211565,"journal":{"name":"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131819513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Multiple query-dependent RankSVM aggregation for document retrieval 用于文档检索的多查询依赖的RankSVM聚合
Pub Date : 2011-04-11 DOI: 10.1109/CIDM.2011.5949420
Yang Wang, Min Lu, X. Pang, Maoqiang Xie, Yalou Huang
This paper is concerned with supervised rank aggregation, which aims to improve the ranking performance by combining the outputs from multiple rankers. However, there are two main shortcomings in previous rank aggregation approaches. Firstly, the learned weights for base rankers do not distinguish the differences among queries. This is suboptimal since queries vary significantly in terms of ranking. Besides, most current aggregation functions are unsupervised. A supervised aggregation function could further improve the ranking performance. In this paper, the significant difference existing among queries is taken into consideration, and a supervised rank aggregation approach is proposed. As a case study, we employ RankSVM model to aggregate the base rankers, referred to as Q.D.RSVM, and prove that Q.D.RSVM can set up query-dependent weights for different base rankers. Experimental results based on benchmark datasets show our approach outperforms conventional ranking approaches.
本文研究的是有监督排序聚合,其目的是通过组合多个排序器的输出来提高排序性能。然而,以前的秩聚集方法有两个主要缺点。首先,基础排名的学习权值不能区分查询之间的差异。这是次优的,因为查询在排名方面差异很大。此外,大多数当前的聚合函数都是无监督的。有监督的聚合函数可以进一步提高排序性能。本文考虑到查询之间存在的显著差异,提出了一种有监督的秩聚合方法。作为案例研究,我们采用RankSVM模型对基础排名进行聚合,称为Q.D.RSVM,并证明了Q.D.RSVM可以为不同的基础排名设置查询相关的权重。基于基准数据集的实验结果表明,我们的方法优于传统的排名方法。
{"title":"Multiple query-dependent RankSVM aggregation for document retrieval","authors":"Yang Wang, Min Lu, X. Pang, Maoqiang Xie, Yalou Huang","doi":"10.1109/CIDM.2011.5949420","DOIUrl":"https://doi.org/10.1109/CIDM.2011.5949420","url":null,"abstract":"This paper is concerned with supervised rank aggregation, which aims to improve the ranking performance by combining the outputs from multiple rankers. However, there are two main shortcomings in previous rank aggregation approaches. Firstly, the learned weights for base rankers do not distinguish the differences among queries. This is suboptimal since queries vary significantly in terms of ranking. Besides, most current aggregation functions are unsupervised. A supervised aggregation function could further improve the ranking performance. In this paper, the significant difference existing among queries is taken into consideration, and a supervised rank aggregation approach is proposed. As a case study, we employ RankSVM model to aggregate the base rankers, referred to as Q.D.RSVM, and prove that Q.D.RSVM can set up query-dependent weights for different base rankers. Experimental results based on benchmark datasets show our approach outperforms conventional ranking approaches.","PeriodicalId":211565,"journal":{"name":"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125509437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A GPU-based interactive bio-inspired visual clustering 基于gpu的交互式生物视觉聚类
Pub Date : 2011-04-11 DOI: 10.1109/CIDM.2011.5949300
U. Erra, Bernardino Frola, V. Scarano
In this work, we present an interactive visual clustering approach for the exploration and analysis of vast volumes of data. Our proposed approach is a bio-inspired collective behavioral model to be used in a 3D graphics environment. Our paper illustrates an extension of the behavioral model for clustering and a parallel implementation, using Compute Unified Device Architecture to exploit the computational power of Graphics Processor Units (GPUs). The advantage of our approach is that, as data enters the environment, the user is directly involved in the data mining process. Our experiments illustrate the effectiveness and efficiency provided by our approach when applied to a number of real and synthetic data sets.
在这项工作中,我们提出了一种交互式视觉聚类方法,用于探索和分析大量数据。我们提出的方法是一个生物启发的集体行为模型,用于3D图形环境。我们的论文阐述了集群行为模型的扩展和并行实现,使用计算统一设备架构来利用图形处理器单元(gpu)的计算能力。我们的方法的优点是,当数据进入环境时,用户直接参与到数据挖掘过程中。我们的实验说明了我们的方法在应用于大量真实和合成数据集时所提供的有效性和效率。
{"title":"A GPU-based interactive bio-inspired visual clustering","authors":"U. Erra, Bernardino Frola, V. Scarano","doi":"10.1109/CIDM.2011.5949300","DOIUrl":"https://doi.org/10.1109/CIDM.2011.5949300","url":null,"abstract":"In this work, we present an interactive visual clustering approach for the exploration and analysis of vast volumes of data. Our proposed approach is a bio-inspired collective behavioral model to be used in a 3D graphics environment. Our paper illustrates an extension of the behavioral model for clustering and a parallel implementation, using Compute Unified Device Architecture to exploit the computational power of Graphics Processor Units (GPUs). The advantage of our approach is that, as data enters the environment, the user is directly involved in the data mining process. Our experiments illustrate the effectiveness and efficiency provided by our approach when applied to a number of real and synthetic data sets.","PeriodicalId":211565,"journal":{"name":"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130005845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1