2012 11th International Conference on Machine Learning and Applications最新文献

英文中文

Deep Structure Learning: Beyond Connectionist Approaches 深度结构学习:超越联结主义方法

2012 11th International Conference on Machine Learning and Applications

Pub Date : 2012-12-12 DOI: 10.1109/ICMLA.2012.34

B. Mitchell, John W. Sheppard

Deep structure learning is a promising new area of work in the field of machine learning. Previous work in this area has shown impressive performance, but all of it has used connectionist models. We hope to demonstrate that the utility of deep architectures is not restricted to connectionist models. Our approach is to use simple, non-connectionist dimensionality reduction techniques in conjunction with a deep architecture to examine more precisely the impact of the deep architecture itself. To do this, we use standard PCA as a baseline and compare it with a deep architecture using PCA. We perform several image classification experiments using the features generated by the two techniques, and we conclude that the deep architecture leads to improved classification performance, supporting the deep structure hypothesis.

深度结构学习是机器学习领域中一个很有前途的新领域。这一领域之前的工作已经显示出令人印象深刻的表现，但它们都使用了连接主义模型。我们希望证明深度架构的效用并不局限于连接主义模型。我们的方法是使用简单的、非连接主义的降维技术与深度架构相结合，以更精确地检查深度架构本身的影响。为此，我们使用标准PCA作为基准，并将其与使用PCA的深度架构进行比较。我们使用这两种技术生成的特征进行了多次图像分类实验，并得出结论，深度结构可以提高分类性能，支持深度结构假设。

引用次数: 12

Age-Group Classification of Facial Images 面部图像的年龄组分类

2012 11th International Conference on Machine Learning and Applications

Pub Date : 2012-12-12 DOI: 10.1109/ICMLA.2012.129

Li Liu, Jianming Liu, Jun Cheng

This paper presents the age-group classification based on facial images. We perform age-group classification by dividing ages into five age groups according to the incremental regulation of age. Features are extracted from face images through Active Appearance Model (AAM), which describe the shape and gray value variation of face images. Principle Component Analysis (PCA) is adopted to reduce the dimensions and Support Vector Machine (SVM) classifier with Gaussian Radian Basis Function (RBF) kernel is trained. Experimental results demonstrate that AAM can improve the performance of age estimation.

提出了一种基于人脸图像的年龄分类方法。我们进行年龄组分类，根据年龄的增量调节将年龄分为五个年龄组。通过主动外观模型(AAM)从人脸图像中提取特征，该模型描述了人脸图像的形状和灰度值变化。采用主成分分析(PCA)进行降维，训练具有高斯弧度基函数(RBF)核的支持向量机(SVM)分类器。实验结果表明，AAM可以提高年龄估计的性能。

引用次数: 24

Increasing Efficiency of Evolutionary Algorithms by Choosing between Auxiliary Fitness Functions with Reinforcement Learning 基于强化学习的辅助适应度函数选择提高进化算法的效率

2012 11th International Conference on Machine Learning and Applications

Pub Date : 2012-12-12 DOI: 10.1109/ICMLA.2012.32

Arina Buzdalova, M. Buzdalov

In this paper further investigation of the previously proposed method of speeding up single-objective evolutionary algorithms is done. The method is based on reinforcement learning which is used to choose auxiliary fitness functions. The requirements for this method are formulated. The compliance of the method with these requirements is illustrated on model problems such as Royal Roads problem and H-IFF optimization problem. The experiments confirm that the method increases the efficiency of evolutionary algorithms.

本文对先前提出的加速单目标进化算法的方法进行了进一步的研究。该方法基于强化学习来选择辅助适应度函数。阐述了该方法的要求。通过皇家道路问题和H-IFF优化问题等模型问题说明了该方法符合这些要求。实验证明，该方法提高了进化算法的效率。

引用次数: 32

Block Level Video Steganalysis Scheme 块级视频隐写分析方案

2012 11th International Conference on Machine Learning and Applications

Pub Date : 2012-12-12 DOI: 10.1109/ICMLA.2012.121

K. Kancherla, Srinivas Mukkamala

In this paper, we propose block level video steganalysis method. Current steganalysis methods detect steganograms at frame level only. In this paper, we present a new steganalysis method using correlation of pattern noise between consecutive frames as feature. First we extract the pattern noise from each frame and obtain difference between consecutive frames pattern noise. Later we divide the difference matrix into blocks and apply Discrete Cosine Transform (DCT). We use the 63 lowest frequency components of DCT coefficients as feature vector for the block. We used ten different videos in our experiments. Our results show the potential of our method in detecting video steganograms at block level.

本文提出了一种块级视频隐写分析方法。目前的隐写分析方法仅在帧级检测隐写。本文提出了一种以连续帧间模式噪声的相关性为特征的隐写分析方法。首先提取每帧图像的模式噪声，得到连续帧之间的模式噪声差值。然后将差分矩阵分割成块并应用离散余弦变换(DCT)。我们使用DCT系数的63个最低频率分量作为块的特征向量。我们在实验中使用了10个不同的视频。我们的结果显示了我们的方法在块级检测视频隐写的潜力。

引用次数: 2

A Comparative Study on the Stability of Software Metric Selection Techniques 软件度量选择技术稳定性的比较研究

2012 11th International Conference on Machine Learning and Applications

Pub Date : 2012-12-12 DOI: 10.1109/ICMLA.2012.142

Huanjing Wang, T. Khoshgoftaar, Randall Wald, Amri Napolitano

In large software projects, software quality prediction is an important aspect of the development cycle to help focus quality assurance efforts on the modules most likely to contain faults. To perform software quality prediction, various software metrics are collected during the software development cycle, and models are built using these metrics. However, not all features (metrics) make the same contribution to the class attribute (e.g., faulty/not faulty). Thus, selecting a subset of metrics that are relevant to the class attribute is a critical step. As many feature selection algorithms exist, it is important to find ones which will produce consistent results even as the underlying data is changed, this quality of producing consistent results is referred to as "stability." In this paper, we investigate the stability of seven feature selection techniques in the context of software quality classification. We compare four approaches for varying the underlying data to evaluate stability: the traditional approach of generating many sub samples of the original data and comparing the features selected from each, an earlier approach developed by our research group which compares the features selected from sub samples of the data with those selected from the original, and two newly-proposed approaches based on comparing two sub samples which are specifically designed to have same number of instances and a specified level of overlap, with one of these new approaches comparing within each pair while the other compares the generated sub samples with the original dataset. The empirical validation is carried out on sixteen software metrics datasets. Our results show that ReliefF is the most stable feature selection technique. Results also show that the level of overlap, degree of perturbation, and feature subset size do affect the stability of feature selection methods. Finally, we find that all four approaches of evaluating stability produce similar results in terms of which feature selection techniques are best under different circumstances.

在大型软件项目中，软件质量预测是开发周期的一个重要方面，它有助于将质量保证工作集中在最有可能包含错误的模块上。为了执行软件质量预测，在软件开发周期中收集各种软件度量标准，并使用这些度量标准构建模型。然而，并不是所有的特性(度量)对类属性做出相同的贡献(例如，有缺陷/没有缺陷)。因此，选择与类属性相关的度量子集是关键的一步。由于存在许多特征选择算法，因此即使底层数据发生变化，也要找到能够产生一致结果的算法，这一点很重要，这种产生一致结果的质量被称为“稳定性”。本文研究了软件质量分类中7种特征选择技术的稳定性。我们比较了四种改变底层数据来评估稳定性的方法:传统的方法是生成原始数据的许多子样本并比较从每个子样本中选择的特征，本课题组开发的较早的方法是将从数据的子样本中选择的特征与从原始数据中选择的特征进行比较，以及两种新提出的基于比较两个子样本的方法，这两个子样本专门设计为具有相同数量的实例和指定的重叠程度。其中一种新方法在每对中进行比较，而另一种方法将生成的子样本与原始数据集进行比较。在16个软件度量数据集上进行了实证验证。结果表明，ReliefF是最稳定的特征选择技术。结果还表明，重叠程度、扰动程度和特征子集大小会影响特征选择方法的稳定性。最后，我们发现，就特征选择技术在不同情况下的最佳效果而言，所有四种评估稳定性的方法产生了相似的结果。

{"title":"A Comparative Study on the Stability of Software Metric Selection Techniques","authors":"Huanjing Wang, T. Khoshgoftaar, Randall Wald, Amri Napolitano","doi":"10.1109/ICMLA.2012.142","DOIUrl":"https://doi.org/10.1109/ICMLA.2012.142","url":null,"abstract":"In large software projects, software quality prediction is an important aspect of the development cycle to help focus quality assurance efforts on the modules most likely to contain faults. To perform software quality prediction, various software metrics are collected during the software development cycle, and models are built using these metrics. However, not all features (metrics) make the same contribution to the class attribute (e.g., faulty/not faulty). Thus, selecting a subset of metrics that are relevant to the class attribute is a critical step. As many feature selection algorithms exist, it is important to find ones which will produce consistent results even as the underlying data is changed, this quality of producing consistent results is referred to as \"stability.\" In this paper, we investigate the stability of seven feature selection techniques in the context of software quality classification. We compare four approaches for varying the underlying data to evaluate stability: the traditional approach of generating many sub samples of the original data and comparing the features selected from each, an earlier approach developed by our research group which compares the features selected from sub samples of the data with those selected from the original, and two newly-proposed approaches based on comparing two sub samples which are specifically designed to have same number of instances and a specified level of overlap, with one of these new approaches comparing within each pair while the other compares the generated sub samples with the original dataset. The empirical validation is carried out on sixteen software metrics datasets. Our results show that ReliefF is the most stable feature selection technique. Results also show that the level of overlap, degree of perturbation, and feature subset size do affect the stability of feature selection methods. Finally, we find that all four approaches of evaluating stability produce similar results in terms of which feature selection techniques are best under different circumstances.","PeriodicalId":157399,"journal":{"name":"2012 11th International Conference on Machine Learning and Applications","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123725860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Fast Insight into High-Dimensional Parametrized Simulation Data 快速洞察高维参数化仿真数据

2012 11th International Conference on Machine Learning and Applications

Pub Date : 2012-12-12 DOI: 10.1109/ICMLA.2012.189

D. Butnaru, B. Peherstorfer, H. Bungartz, D. Pflüger

Numerical simulation has become an inevitable tool in most industrial product development processes with simulations being used to understand the influence of design decisions (parameter configurations) on the structure and properties of the product. However, in order to allow the engineer to thoroughly explore the design space and fine-tune parameters, many -- usually very time-consuming -- simulation runs are necessary. Additionally, this results in a huge amount of data that cannot be analyzed in an efficient way without the support of appropriate tools. In this paper, we address the two-fold problem: First, instantly provide simulation results if the parameter configuration is changed, and, second, identify specific areas of the design space with concentrated change and thus importance. We propose the use of a hierarchical approach based on sparse grid interpolation or regression which acts as an efficient and cheap substitute for the simulation. Furthermore, we develop new visual representations based on the derivative information contained inherently in the hierarchical basis. They intuitively let a user identify interesting parameter regions even in higher-dimensional settings. This workflow is combined in an interactive visualization and exploration framework. We discuss examples from different fields of computational science and engineering and show how our sparse-grid-based techniques make parameter dependencies apparent and how they can be used to fine-tune parameter configurations.

在大多数工业产品开发过程中，数值模拟已经成为一种不可避免的工具，模拟被用来理解设计决策(参数配置)对产品结构和性能的影响。然而，为了允许工程师彻底探索设计空间和微调参数，许多(通常非常耗时)模拟运行是必要的。此外，如果没有适当工具的支持，就无法以有效的方式分析大量数据。在本文中，我们解决了两个问题:首先，如果参数配置发生变化，立即提供仿真结果;其次，确定设计空间中变化集中的特定区域，从而确定其重要性。我们建议使用基于稀疏网格插值或回归的分层方法，作为模拟的有效而廉价的替代品。此外，我们还基于层次基础中固有的衍生信息开发了新的视觉表示。它们可以直观地让用户识别有趣的参数区域，即使是在更高维度的设置中。该工作流结合在交互式可视化和探索框架中。我们讨论了来自计算科学和工程不同领域的例子，并展示了基于稀疏网格的技术如何使参数依赖性变得明显，以及如何使用它们来微调参数配置。

{"title":"Fast Insight into High-Dimensional Parametrized Simulation Data","authors":"D. Butnaru, B. Peherstorfer, H. Bungartz, D. Pflüger","doi":"10.1109/ICMLA.2012.189","DOIUrl":"https://doi.org/10.1109/ICMLA.2012.189","url":null,"abstract":"Numerical simulation has become an inevitable tool in most industrial product development processes with simulations being used to understand the influence of design decisions (parameter configurations) on the structure and properties of the product. However, in order to allow the engineer to thoroughly explore the design space and fine-tune parameters, many -- usually very time-consuming -- simulation runs are necessary. Additionally, this results in a huge amount of data that cannot be analyzed in an efficient way without the support of appropriate tools. In this paper, we address the two-fold problem: First, instantly provide simulation results if the parameter configuration is changed, and, second, identify specific areas of the design space with concentrated change and thus importance. We propose the use of a hierarchical approach based on sparse grid interpolation or regression which acts as an efficient and cheap substitute for the simulation. Furthermore, we develop new visual representations based on the derivative information contained inherently in the hierarchical basis. They intuitively let a user identify interesting parameter regions even in higher-dimensional settings. This workflow is combined in an interactive visualization and exploration framework. We discuss examples from different fields of computational science and engineering and show how our sparse-grid-based techniques make parameter dependencies apparent and how they can be used to fine-tune parameter configurations.","PeriodicalId":157399,"journal":{"name":"2012 11th International Conference on Machine Learning and Applications","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121945244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Binary Function Clustering Using Semantic Hashes 使用语义哈希的二值函数聚类

2012 11th International Conference on Machine Learning and Applications

Pub Date : 2012-12-12 DOI: 10.1109/ICMLA.2012.70

Wesley Jin, S. Chaki, Cory F. Cohen, A. Gurfinkel, Jeffrey Havrilla, C. Hines, P. Narasimhan

The ability to identify semantically-related functions, in large collections of binary executables, is important for malware detection. Intuitively, two pieces of code are similar if they have the same effect on a machine's state. Current state-of-the-art tools employ a variety of pair wise comparisons (e.g., template matching using SMT solvers, Value-Set analysis at critical program points, API call matching, etc.) However, these methods are unshakable for clustering large datasets, of size N, since they require O(N2) comparisons. In this paper, we present an alternative approach based upon "hashing". We propose a scheme that captures the semantics of functions as semantic hashes. Our approach treats a function as a set of features, each of which represent the input-output behavior of a basic block. Using a form of locality-sensitive hashing known as Min Hashing, functions with many common features can be quickly identified, and the complexity of clustering is reduced to O(N). Experiments on functions extracted from the CERT malware catalog indicate that we are able to cluster closely related code with a low false positive rate.

在大型二进制可执行文件集合中识别语义相关函数的能力对于恶意软件检测非常重要。直观地说，如果两段代码对机器的状态有相同的影响，那么它们就是相似的。当前最先进的工具采用各种对明智的比较(例如，使用SMT求解器的模板匹配，关键程序点的值集分析，API调用匹配等)。然而，这些方法对于大小为N的大型数据集聚类是不可动摇的，因为它们需要O(N2)比较。在本文中，我们提出了一种基于“哈希”的替代方法。我们提出了一种将函数的语义捕获为语义哈希的方案。我们的方法将函数视为一组特征，每个特征代表一个基本块的输入-输出行为。使用一种称为最小哈希的位置敏感哈希形式，可以快速识别具有许多共同特征的函数，并且将聚类的复杂性降低到0 (N)。从CERT恶意软件目录中提取的功能实验表明，我们能够以低误报率聚类密切相关的代码。

{"title":"Binary Function Clustering Using Semantic Hashes","authors":"Wesley Jin, S. Chaki, Cory F. Cohen, A. Gurfinkel, Jeffrey Havrilla, C. Hines, P. Narasimhan","doi":"10.1109/ICMLA.2012.70","DOIUrl":"https://doi.org/10.1109/ICMLA.2012.70","url":null,"abstract":"The ability to identify semantically-related functions, in large collections of binary executables, is important for malware detection. Intuitively, two pieces of code are similar if they have the same effect on a machine's state. Current state-of-the-art tools employ a variety of pair wise comparisons (e.g., template matching using SMT solvers, Value-Set analysis at critical program points, API call matching, etc.) However, these methods are unshakable for clustering large datasets, of size N, since they require O(N2) comparisons. In this paper, we present an alternative approach based upon \"hashing\". We propose a scheme that captures the semantics of functions as semantic hashes. Our approach treats a function as a set of features, each of which represent the input-output behavior of a basic block. Using a form of locality-sensitive hashing known as Min Hashing, functions with many common features can be quickly identified, and the complexity of clustering is reduced to O(N). Experiments on functions extracted from the CERT malware catalog indicate that we are able to cluster closely related code with a low false positive rate.","PeriodicalId":157399,"journal":{"name":"2012 11th International Conference on Machine Learning and Applications","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122062229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 52

Automatically Detecting Avalanche Events in Passive Seismic Data 被动地震资料中雪崩事件的自动检测

2012 11th International Conference on Machine Learning and Applications

Pub Date : 2012-12-12 DOI: 10.1109/ICMLA.2012.12

Marc J. Rubin, T. Camp, A. Herwijnen, J. Schweizer

During the 2010-2011 winter season, we deployed seven geophones on a mountain outside of Davos, Switzerland and collected over 100 days of seismic data containing 385 possible avalanche events (33 confirmed slab avalanches). In this article, we describe our efforts to develop a pattern recognition workflow to automatically detect snow avalanche events from passive seismic data. Our initial workflow consisted of frequency domain feature extraction, cluster-based stratified subsampling, and 100 runs of training and testing of 12 different classification algorithms. When tested on the entire season of data from a single sensor, all twelve machine learning algorithms resulted in mean classification accuracies above 84%, with seven classifiers reaching over 90%. We then experimented with a voting based paradigm that combined information from all seven sensors. This method increased overall accuracy and precision, but performed quite poorly in terms of classifier recall. We, therefore, decided to pursue other signal preprocessing methodologies. We focused our efforts on improving the overall performance of single sensor avalanche detection, and employed spectral flux based event selection to identify events with significant instantaneous increases in spectral energy. With a threshold of 90% relative spectral flux increase, we correctly selected 32 of 33 slab avalanches and reduced our problem space by nearly 98%. When trained and tested on this reduced data set of only significant events, a decision stump classifier achieved 93% overall accuracy, 89.5% recall, and improved the precision of our initial workflow from 2.8% to 13.2%.

2010-2011年冬季，我们在瑞士达沃斯郊外的一座山上部署了7台检波器，收集了100多天的地震数据，其中包含385起可能的雪崩事件(33起已确认的板状雪崩)。在本文中，我们描述了我们为开发一种模式识别工作流程所做的努力，该工作流程可以从被动地震数据中自动检测雪崩事件。我们最初的工作流程包括频域特征提取，基于聚类的分层子采样，以及100次训练和测试12种不同的分类算法。当对来自单个传感器的整个季节的数据进行测试时，所有12种机器学习算法的平均分类准确率都在84%以上，其中7种分类器达到90%以上。然后，我们用一个基于投票的范例进行了实验，该范例结合了来自所有七个传感器的信息。这种方法提高了总体的准确性和精密度，但在分类器召回率方面表现相当差。因此，我们决定采用其他信号预处理方法。我们致力于提高单传感器雪崩检测的整体性能，并采用基于光谱通量的事件选择来识别光谱能量瞬时显著增加的事件。在相对谱通量增加90%的阈值下，我们正确地选择了33个板雪崩中的32个，并将问题空间减少了近98%。当在这个仅包含重要事件的简化数据集上进行训练和测试时，决策残桩分类器达到了93%的总体准确率，89.5%的召回率，并将初始工作流的精度从2.8%提高到13.2%。

{"title":"Automatically Detecting Avalanche Events in Passive Seismic Data","authors":"Marc J. Rubin, T. Camp, A. Herwijnen, J. Schweizer","doi":"10.1109/ICMLA.2012.12","DOIUrl":"https://doi.org/10.1109/ICMLA.2012.12","url":null,"abstract":"During the 2010-2011 winter season, we deployed seven geophones on a mountain outside of Davos, Switzerland and collected over 100 days of seismic data containing 385 possible avalanche events (33 confirmed slab avalanches). In this article, we describe our efforts to develop a pattern recognition workflow to automatically detect snow avalanche events from passive seismic data. Our initial workflow consisted of frequency domain feature extraction, cluster-based stratified subsampling, and 100 runs of training and testing of 12 different classification algorithms. When tested on the entire season of data from a single sensor, all twelve machine learning algorithms resulted in mean classification accuracies above 84%, with seven classifiers reaching over 90%. We then experimented with a voting based paradigm that combined information from all seven sensors. This method increased overall accuracy and precision, but performed quite poorly in terms of classifier recall. We, therefore, decided to pursue other signal preprocessing methodologies. We focused our efforts on improving the overall performance of single sensor avalanche detection, and employed spectral flux based event selection to identify events with significant instantaneous increases in spectral energy. With a threshold of 90% relative spectral flux increase, we correctly selected 32 of 33 slab avalanches and reduced our problem space by nearly 98%. When trained and tested on this reduced data set of only significant events, a decision stump classifier achieved 93% overall accuracy, 89.5% recall, and improved the precision of our initial workflow from 2.8% to 13.2%.","PeriodicalId":157399,"journal":{"name":"2012 11th International Conference on Machine Learning and Applications","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123186785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

A Hybrid Method for Estimating the Predominant Number of Clusters in a Data Set 一种估计数据集中优势簇数的混合方法

2012 11th International Conference on Machine Learning and Applications

Pub Date : 2012-12-12 DOI: 10.1109/ICMLA.2012.146

Jamil Alshaqsi, Wenjia Wang

In cluster analysis, finding out the number of clusters, K, for a given dataset is an important yet very tricky task, simply because there is often no universally accepted correct or wrong answer for non-trivial real world problems and it also depends on the context and purpose of a cluster study. This paper presents a new hybrid method for estimating the predominant number of clusters automatically. It employs a new similarity measure and then calculates the length of constant similarity intervals, L and considers the longest consistent intervals representing the most probable numbers of the clusters under the set context. An error function is defined to measure and evaluate the goodness of estimations. The proposed method has been tested on 3 synthetic datasets and 8 real-world benchmark datasets, and compared with some other popular methods. The experimental results showed that the proposed method is able to determine the desired number of clusters for all the simulated datasets and most of the benchmark datasets, and the statistical tests indicate that our method is significantly better.

在聚类分析中，找出给定数据集的聚类数量K是一项重要但非常棘手的任务，原因很简单，因为对于非平凡的现实世界问题，通常没有普遍接受的正确或错误答案，而且它还取决于聚类研究的背景和目的。本文提出了一种新的自动估计聚类优势数的混合方法。它采用一种新的相似度度量，然后计算恒定相似区间的长度L，并考虑最长的一致区间表示集合上下文下最可能的簇数。定义了一个误差函数来度量和评价估计的好坏。该方法在3个合成数据集和8个真实基准数据集上进行了测试，并与其他常用方法进行了比较。实验结果表明，本文提出的方法能够在所有模拟数据集和大部分基准数据集上确定所需的聚类数，统计测试表明，我们的方法明显更好。

引用次数: 2

A Machine Learning Based Topic Exploration and Categorization on Surveys 基于机器学习的主题探索与调查分类

2012 11th International Conference on Machine Learning and Applications

Pub Date : 2012-12-12 DOI: 10.1109/ICMLA.2012.132

Clint P. George, D. Wang, Joseph N. Wilson, L. Epstein, Philip Garland, Annabell Suh

This paper describes an automatic topic extraction, categorization, and relevance ranking model for multi-lingual surveys and questions that exploits machine learning algorithms such as topic modeling and fuzzy clustering. Automatically generated question and survey categories are used to build question banks and category-specific survey templates. First, we describe different pre-processing steps we considered for removing noise in the multilingual survey text. Second, we explain our strategy to automatically extract survey categories from surveys based on topic models. Third, we describe different methods to cluster questions under survey categories and group them based on relevance. Last, we describe our experimental results on a large group of unique, real-world survey datasets from the German, Spanish, French, and Portuguese languages and our refining methods to determine meaningful and sensible categories for building question banks. We conclude this document with possible enhancements to the current system and impacts in the business domain.

本文描述了一个针对多语言调查和问题的自动主题提取、分类和相关性排序模型，该模型利用主题建模和模糊聚类等机器学习算法。自动生成的问题和调查类别用于构建题库和特定类别的调查模板。首先，我们描述了在多语言调查文本中去除噪声的不同预处理步骤。其次，我们解释了基于主题模型从调查中自动提取调查类别的策略。第三，我们描述了不同的方法将问题聚类到调查类别中，并根据相关性对它们进行分组。最后，我们描述了我们在来自德语、西班牙语、法语和葡萄牙语的大量独特的真实世界调查数据集上的实验结果，以及我们为构建题库确定有意义和合理类别的改进方法。我们总结了对当前系统的可能增强以及对业务领域的影响。

引用次数: 13

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2012 11th International Conference on Machine Learning and Applications

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀