Deep structure learning is a promising new area of work in the field of machine learning. Previous work in this area has shown impressive performance, but all of it has used connectionist models. We hope to demonstrate that the utility of deep architectures is not restricted to connectionist models. Our approach is to use simple, non-connectionist dimensionality reduction techniques in conjunction with a deep architecture to examine more precisely the impact of the deep architecture itself. To do this, we use standard PCA as a baseline and compare it with a deep architecture using PCA. We perform several image classification experiments using the features generated by the two techniques, and we conclude that the deep architecture leads to improved classification performance, supporting the deep structure hypothesis.
{"title":"Deep Structure Learning: Beyond Connectionist Approaches","authors":"B. Mitchell, John W. Sheppard","doi":"10.1109/ICMLA.2012.34","DOIUrl":"https://doi.org/10.1109/ICMLA.2012.34","url":null,"abstract":"Deep structure learning is a promising new area of work in the field of machine learning. Previous work in this area has shown impressive performance, but all of it has used connectionist models. We hope to demonstrate that the utility of deep architectures is not restricted to connectionist models. Our approach is to use simple, non-connectionist dimensionality reduction techniques in conjunction with a deep architecture to examine more precisely the impact of the deep architecture itself. To do this, we use standard PCA as a baseline and compare it with a deep architecture using PCA. We perform several image classification experiments using the features generated by the two techniques, and we conclude that the deep architecture leads to improved classification performance, supporting the deep structure hypothesis.","PeriodicalId":157399,"journal":{"name":"2012 11th International Conference on Machine Learning and Applications","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114366703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents the age-group classification based on facial images. We perform age-group classification by dividing ages into five age groups according to the incremental regulation of age. Features are extracted from face images through Active Appearance Model (AAM), which describe the shape and gray value variation of face images. Principle Component Analysis (PCA) is adopted to reduce the dimensions and Support Vector Machine (SVM) classifier with Gaussian Radian Basis Function (RBF) kernel is trained. Experimental results demonstrate that AAM can improve the performance of age estimation.
{"title":"Age-Group Classification of Facial Images","authors":"Li Liu, Jianming Liu, Jun Cheng","doi":"10.1109/ICMLA.2012.129","DOIUrl":"https://doi.org/10.1109/ICMLA.2012.129","url":null,"abstract":"This paper presents the age-group classification based on facial images. We perform age-group classification by dividing ages into five age groups according to the incremental regulation of age. Features are extracted from face images through Active Appearance Model (AAM), which describe the shape and gray value variation of face images. Principle Component Analysis (PCA) is adopted to reduce the dimensions and Support Vector Machine (SVM) classifier with Gaussian Radian Basis Function (RBF) kernel is trained. Experimental results demonstrate that AAM can improve the performance of age estimation.","PeriodicalId":157399,"journal":{"name":"2012 11th International Conference on Machine Learning and Applications","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127459625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper further investigation of the previously proposed method of speeding up single-objective evolutionary algorithms is done. The method is based on reinforcement learning which is used to choose auxiliary fitness functions. The requirements for this method are formulated. The compliance of the method with these requirements is illustrated on model problems such as Royal Roads problem and H-IFF optimization problem. The experiments confirm that the method increases the efficiency of evolutionary algorithms.
{"title":"Increasing Efficiency of Evolutionary Algorithms by Choosing between Auxiliary Fitness Functions with Reinforcement Learning","authors":"Arina Buzdalova, M. Buzdalov","doi":"10.1109/ICMLA.2012.32","DOIUrl":"https://doi.org/10.1109/ICMLA.2012.32","url":null,"abstract":"In this paper further investigation of the previously proposed method of speeding up single-objective evolutionary algorithms is done. The method is based on reinforcement learning which is used to choose auxiliary fitness functions. The requirements for this method are formulated. The compliance of the method with these requirements is illustrated on model problems such as Royal Roads problem and H-IFF optimization problem. The experiments confirm that the method increases the efficiency of evolutionary algorithms.","PeriodicalId":157399,"journal":{"name":"2012 11th International Conference on Machine Learning and Applications","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126995176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we propose block level video steganalysis method. Current steganalysis methods detect steganograms at frame level only. In this paper, we present a new steganalysis method using correlation of pattern noise between consecutive frames as feature. First we extract the pattern noise from each frame and obtain difference between consecutive frames pattern noise. Later we divide the difference matrix into blocks and apply Discrete Cosine Transform (DCT). We use the 63 lowest frequency components of DCT coefficients as feature vector for the block. We used ten different videos in our experiments. Our results show the potential of our method in detecting video steganograms at block level.
{"title":"Block Level Video Steganalysis Scheme","authors":"K. Kancherla, Srinivas Mukkamala","doi":"10.1109/ICMLA.2012.121","DOIUrl":"https://doi.org/10.1109/ICMLA.2012.121","url":null,"abstract":"In this paper, we propose block level video steganalysis method. Current steganalysis methods detect steganograms at frame level only. In this paper, we present a new steganalysis method using correlation of pattern noise between consecutive frames as feature. First we extract the pattern noise from each frame and obtain difference between consecutive frames pattern noise. Later we divide the difference matrix into blocks and apply Discrete Cosine Transform (DCT). We use the 63 lowest frequency components of DCT coefficients as feature vector for the block. We used ten different videos in our experiments. Our results show the potential of our method in detecting video steganograms at block level.","PeriodicalId":157399,"journal":{"name":"2012 11th International Conference on Machine Learning and Applications","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124394321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huanjing Wang, T. Khoshgoftaar, Randall Wald, Amri Napolitano
In large software projects, software quality prediction is an important aspect of the development cycle to help focus quality assurance efforts on the modules most likely to contain faults. To perform software quality prediction, various software metrics are collected during the software development cycle, and models are built using these metrics. However, not all features (metrics) make the same contribution to the class attribute (e.g., faulty/not faulty). Thus, selecting a subset of metrics that are relevant to the class attribute is a critical step. As many feature selection algorithms exist, it is important to find ones which will produce consistent results even as the underlying data is changed, this quality of producing consistent results is referred to as "stability." In this paper, we investigate the stability of seven feature selection techniques in the context of software quality classification. We compare four approaches for varying the underlying data to evaluate stability: the traditional approach of generating many sub samples of the original data and comparing the features selected from each, an earlier approach developed by our research group which compares the features selected from sub samples of the data with those selected from the original, and two newly-proposed approaches based on comparing two sub samples which are specifically designed to have same number of instances and a specified level of overlap, with one of these new approaches comparing within each pair while the other compares the generated sub samples with the original dataset. The empirical validation is carried out on sixteen software metrics datasets. Our results show that ReliefF is the most stable feature selection technique. Results also show that the level of overlap, degree of perturbation, and feature subset size do affect the stability of feature selection methods. Finally, we find that all four approaches of evaluating stability produce similar results in terms of which feature selection techniques are best under different circumstances.
{"title":"A Comparative Study on the Stability of Software Metric Selection Techniques","authors":"Huanjing Wang, T. Khoshgoftaar, Randall Wald, Amri Napolitano","doi":"10.1109/ICMLA.2012.142","DOIUrl":"https://doi.org/10.1109/ICMLA.2012.142","url":null,"abstract":"In large software projects, software quality prediction is an important aspect of the development cycle to help focus quality assurance efforts on the modules most likely to contain faults. To perform software quality prediction, various software metrics are collected during the software development cycle, and models are built using these metrics. However, not all features (metrics) make the same contribution to the class attribute (e.g., faulty/not faulty). Thus, selecting a subset of metrics that are relevant to the class attribute is a critical step. As many feature selection algorithms exist, it is important to find ones which will produce consistent results even as the underlying data is changed, this quality of producing consistent results is referred to as \"stability.\" In this paper, we investigate the stability of seven feature selection techniques in the context of software quality classification. We compare four approaches for varying the underlying data to evaluate stability: the traditional approach of generating many sub samples of the original data and comparing the features selected from each, an earlier approach developed by our research group which compares the features selected from sub samples of the data with those selected from the original, and two newly-proposed approaches based on comparing two sub samples which are specifically designed to have same number of instances and a specified level of overlap, with one of these new approaches comparing within each pair while the other compares the generated sub samples with the original dataset. The empirical validation is carried out on sixteen software metrics datasets. Our results show that ReliefF is the most stable feature selection technique. Results also show that the level of overlap, degree of perturbation, and feature subset size do affect the stability of feature selection methods. Finally, we find that all four approaches of evaluating stability produce similar results in terms of which feature selection techniques are best under different circumstances.","PeriodicalId":157399,"journal":{"name":"2012 11th International Conference on Machine Learning and Applications","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123725860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Butnaru, B. Peherstorfer, H. Bungartz, D. Pflüger
Numerical simulation has become an inevitable tool in most industrial product development processes with simulations being used to understand the influence of design decisions (parameter configurations) on the structure and properties of the product. However, in order to allow the engineer to thoroughly explore the design space and fine-tune parameters, many -- usually very time-consuming -- simulation runs are necessary. Additionally, this results in a huge amount of data that cannot be analyzed in an efficient way without the support of appropriate tools. In this paper, we address the two-fold problem: First, instantly provide simulation results if the parameter configuration is changed, and, second, identify specific areas of the design space with concentrated change and thus importance. We propose the use of a hierarchical approach based on sparse grid interpolation or regression which acts as an efficient and cheap substitute for the simulation. Furthermore, we develop new visual representations based on the derivative information contained inherently in the hierarchical basis. They intuitively let a user identify interesting parameter regions even in higher-dimensional settings. This workflow is combined in an interactive visualization and exploration framework. We discuss examples from different fields of computational science and engineering and show how our sparse-grid-based techniques make parameter dependencies apparent and how they can be used to fine-tune parameter configurations.
{"title":"Fast Insight into High-Dimensional Parametrized Simulation Data","authors":"D. Butnaru, B. Peherstorfer, H. Bungartz, D. Pflüger","doi":"10.1109/ICMLA.2012.189","DOIUrl":"https://doi.org/10.1109/ICMLA.2012.189","url":null,"abstract":"Numerical simulation has become an inevitable tool in most industrial product development processes with simulations being used to understand the influence of design decisions (parameter configurations) on the structure and properties of the product. However, in order to allow the engineer to thoroughly explore the design space and fine-tune parameters, many -- usually very time-consuming -- simulation runs are necessary. Additionally, this results in a huge amount of data that cannot be analyzed in an efficient way without the support of appropriate tools. In this paper, we address the two-fold problem: First, instantly provide simulation results if the parameter configuration is changed, and, second, identify specific areas of the design space with concentrated change and thus importance. We propose the use of a hierarchical approach based on sparse grid interpolation or regression which acts as an efficient and cheap substitute for the simulation. Furthermore, we develop new visual representations based on the derivative information contained inherently in the hierarchical basis. They intuitively let a user identify interesting parameter regions even in higher-dimensional settings. This workflow is combined in an interactive visualization and exploration framework. We discuss examples from different fields of computational science and engineering and show how our sparse-grid-based techniques make parameter dependencies apparent and how they can be used to fine-tune parameter configurations.","PeriodicalId":157399,"journal":{"name":"2012 11th International Conference on Machine Learning and Applications","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121945244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wesley Jin, S. Chaki, Cory F. Cohen, A. Gurfinkel, Jeffrey Havrilla, C. Hines, P. Narasimhan
The ability to identify semantically-related functions, in large collections of binary executables, is important for malware detection. Intuitively, two pieces of code are similar if they have the same effect on a machine's state. Current state-of-the-art tools employ a variety of pair wise comparisons (e.g., template matching using SMT solvers, Value-Set analysis at critical program points, API call matching, etc.) However, these methods are unshakable for clustering large datasets, of size N, since they require O(N2) comparisons. In this paper, we present an alternative approach based upon "hashing". We propose a scheme that captures the semantics of functions as semantic hashes. Our approach treats a function as a set of features, each of which represent the input-output behavior of a basic block. Using a form of locality-sensitive hashing known as Min Hashing, functions with many common features can be quickly identified, and the complexity of clustering is reduced to O(N). Experiments on functions extracted from the CERT malware catalog indicate that we are able to cluster closely related code with a low false positive rate.
{"title":"Binary Function Clustering Using Semantic Hashes","authors":"Wesley Jin, S. Chaki, Cory F. Cohen, A. Gurfinkel, Jeffrey Havrilla, C. Hines, P. Narasimhan","doi":"10.1109/ICMLA.2012.70","DOIUrl":"https://doi.org/10.1109/ICMLA.2012.70","url":null,"abstract":"The ability to identify semantically-related functions, in large collections of binary executables, is important for malware detection. Intuitively, two pieces of code are similar if they have the same effect on a machine's state. Current state-of-the-art tools employ a variety of pair wise comparisons (e.g., template matching using SMT solvers, Value-Set analysis at critical program points, API call matching, etc.) However, these methods are unshakable for clustering large datasets, of size N, since they require O(N2) comparisons. In this paper, we present an alternative approach based upon \"hashing\". We propose a scheme that captures the semantics of functions as semantic hashes. Our approach treats a function as a set of features, each of which represent the input-output behavior of a basic block. Using a form of locality-sensitive hashing known as Min Hashing, functions with many common features can be quickly identified, and the complexity of clustering is reduced to O(N). Experiments on functions extracted from the CERT malware catalog indicate that we are able to cluster closely related code with a low false positive rate.","PeriodicalId":157399,"journal":{"name":"2012 11th International Conference on Machine Learning and Applications","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122062229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marc J. Rubin, T. Camp, A. Herwijnen, J. Schweizer
During the 2010-2011 winter season, we deployed seven geophones on a mountain outside of Davos, Switzerland and collected over 100 days of seismic data containing 385 possible avalanche events (33 confirmed slab avalanches). In this article, we describe our efforts to develop a pattern recognition workflow to automatically detect snow avalanche events from passive seismic data. Our initial workflow consisted of frequency domain feature extraction, cluster-based stratified subsampling, and 100 runs of training and testing of 12 different classification algorithms. When tested on the entire season of data from a single sensor, all twelve machine learning algorithms resulted in mean classification accuracies above 84%, with seven classifiers reaching over 90%. We then experimented with a voting based paradigm that combined information from all seven sensors. This method increased overall accuracy and precision, but performed quite poorly in terms of classifier recall. We, therefore, decided to pursue other signal preprocessing methodologies. We focused our efforts on improving the overall performance of single sensor avalanche detection, and employed spectral flux based event selection to identify events with significant instantaneous increases in spectral energy. With a threshold of 90% relative spectral flux increase, we correctly selected 32 of 33 slab avalanches and reduced our problem space by nearly 98%. When trained and tested on this reduced data set of only significant events, a decision stump classifier achieved 93% overall accuracy, 89.5% recall, and improved the precision of our initial workflow from 2.8% to 13.2%.
{"title":"Automatically Detecting Avalanche Events in Passive Seismic Data","authors":"Marc J. Rubin, T. Camp, A. Herwijnen, J. Schweizer","doi":"10.1109/ICMLA.2012.12","DOIUrl":"https://doi.org/10.1109/ICMLA.2012.12","url":null,"abstract":"During the 2010-2011 winter season, we deployed seven geophones on a mountain outside of Davos, Switzerland and collected over 100 days of seismic data containing 385 possible avalanche events (33 confirmed slab avalanches). In this article, we describe our efforts to develop a pattern recognition workflow to automatically detect snow avalanche events from passive seismic data. Our initial workflow consisted of frequency domain feature extraction, cluster-based stratified subsampling, and 100 runs of training and testing of 12 different classification algorithms. When tested on the entire season of data from a single sensor, all twelve machine learning algorithms resulted in mean classification accuracies above 84%, with seven classifiers reaching over 90%. We then experimented with a voting based paradigm that combined information from all seven sensors. This method increased overall accuracy and precision, but performed quite poorly in terms of classifier recall. We, therefore, decided to pursue other signal preprocessing methodologies. We focused our efforts on improving the overall performance of single sensor avalanche detection, and employed spectral flux based event selection to identify events with significant instantaneous increases in spectral energy. With a threshold of 90% relative spectral flux increase, we correctly selected 32 of 33 slab avalanches and reduced our problem space by nearly 98%. When trained and tested on this reduced data set of only significant events, a decision stump classifier achieved 93% overall accuracy, 89.5% recall, and improved the precision of our initial workflow from 2.8% to 13.2%.","PeriodicalId":157399,"journal":{"name":"2012 11th International Conference on Machine Learning and Applications","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123186785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In cluster analysis, finding out the number of clusters, K, for a given dataset is an important yet very tricky task, simply because there is often no universally accepted correct or wrong answer for non-trivial real world problems and it also depends on the context and purpose of a cluster study. This paper presents a new hybrid method for estimating the predominant number of clusters automatically. It employs a new similarity measure and then calculates the length of constant similarity intervals, L and considers the longest consistent intervals representing the most probable numbers of the clusters under the set context. An error function is defined to measure and evaluate the goodness of estimations. The proposed method has been tested on 3 synthetic datasets and 8 real-world benchmark datasets, and compared with some other popular methods. The experimental results showed that the proposed method is able to determine the desired number of clusters for all the simulated datasets and most of the benchmark datasets, and the statistical tests indicate that our method is significantly better.
{"title":"A Hybrid Method for Estimating the Predominant Number of Clusters in a Data Set","authors":"Jamil Alshaqsi, Wenjia Wang","doi":"10.1109/ICMLA.2012.146","DOIUrl":"https://doi.org/10.1109/ICMLA.2012.146","url":null,"abstract":"In cluster analysis, finding out the number of clusters, K, for a given dataset is an important yet very tricky task, simply because there is often no universally accepted correct or wrong answer for non-trivial real world problems and it also depends on the context and purpose of a cluster study. This paper presents a new hybrid method for estimating the predominant number of clusters automatically. It employs a new similarity measure and then calculates the length of constant similarity intervals, L and considers the longest consistent intervals representing the most probable numbers of the clusters under the set context. An error function is defined to measure and evaluate the goodness of estimations. The proposed method has been tested on 3 synthetic datasets and 8 real-world benchmark datasets, and compared with some other popular methods. The experimental results showed that the proposed method is able to determine the desired number of clusters for all the simulated datasets and most of the benchmark datasets, and the statistical tests indicate that our method is significantly better.","PeriodicalId":157399,"journal":{"name":"2012 11th International Conference on Machine Learning and Applications","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131419757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Clint P. George, D. Wang, Joseph N. Wilson, L. Epstein, Philip Garland, Annabell Suh
This paper describes an automatic topic extraction, categorization, and relevance ranking model for multi-lingual surveys and questions that exploits machine learning algorithms such as topic modeling and fuzzy clustering. Automatically generated question and survey categories are used to build question banks and category-specific survey templates. First, we describe different pre-processing steps we considered for removing noise in the multilingual survey text. Second, we explain our strategy to automatically extract survey categories from surveys based on topic models. Third, we describe different methods to cluster questions under survey categories and group them based on relevance. Last, we describe our experimental results on a large group of unique, real-world survey datasets from the German, Spanish, French, and Portuguese languages and our refining methods to determine meaningful and sensible categories for building question banks. We conclude this document with possible enhancements to the current system and impacts in the business domain.
{"title":"A Machine Learning Based Topic Exploration and Categorization on Surveys","authors":"Clint P. George, D. Wang, Joseph N. Wilson, L. Epstein, Philip Garland, Annabell Suh","doi":"10.1109/ICMLA.2012.132","DOIUrl":"https://doi.org/10.1109/ICMLA.2012.132","url":null,"abstract":"This paper describes an automatic topic extraction, categorization, and relevance ranking model for multi-lingual surveys and questions that exploits machine learning algorithms such as topic modeling and fuzzy clustering. Automatically generated question and survey categories are used to build question banks and category-specific survey templates. First, we describe different pre-processing steps we considered for removing noise in the multilingual survey text. Second, we explain our strategy to automatically extract survey categories from surveys based on topic models. Third, we describe different methods to cluster questions under survey categories and group them based on relevance. Last, we describe our experimental results on a large group of unique, real-world survey datasets from the German, Spanish, French, and Portuguese languages and our refining methods to determine meaningful and sensible categories for building question banks. We conclude this document with possible enhancements to the current system and impacts in the business domain.","PeriodicalId":157399,"journal":{"name":"2012 11th International Conference on Machine Learning and Applications","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131467950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}