A. Notsu, Katsuhiro Honda, H. Ichihashi, Yuki Komori
In this paper, we propose Simple Reinforcement Learning for a reinforcement learning agent that has small memory. In the real world, learning is difficult because there are an infinite number of states and actions that need a large number of stored memories and learning times. To solve a problem, estimated values are categorized as ``GOOD" or ``NO GOOD" in the reinforcement learning process. Additionally, the alignment sequence of estimated values is changed because they are regarded as an important sequence themselves. We conducted some simulations and observed the influence of our methods. Several simulation results show no bad influence on learning speed.
{"title":"Simple Reinforcement Learning for Small-Memory Agent","authors":"A. Notsu, Katsuhiro Honda, H. Ichihashi, Yuki Komori","doi":"10.1109/ICMLA.2011.127","DOIUrl":"https://doi.org/10.1109/ICMLA.2011.127","url":null,"abstract":"In this paper, we propose Simple Reinforcement Learning for a reinforcement learning agent that has small memory. In the real world, learning is difficult because there are an infinite number of states and actions that need a large number of stored memories and learning times. To solve a problem, estimated values are categorized as ``GOOD\" or ``NO GOOD\" in the reinforcement learning process. Additionally, the alignment sequence of estimated values is changed because they are regarded as an important sequence themselves. We conducted some simulations and observed the influence of our methods. Several simulation results show no bad influence on learning speed.","PeriodicalId":439926,"journal":{"name":"2011 10th International Conference on Machine Learning and Applications and Workshops","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124453787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lana Yeganova, Won Kim, Donald C. Comeau, W. Wilbur
In this paper we describe and compare two methods for automatically learning meaningful biomedical categories in Medline®. The first approach is a simple statistical method that uses part-of-speech and frequency information to extract a list of frequent headwords from noun phrases in Medline. The second method implements an alignment-based technique to learn frequent generic patterns that indicate a hyponymy/hypernymy relationship between a pair of noun phrases. We then apply these patterns to Medline to collect frequent hypernyms, potential biomedical categories. We study and compare these two alternative sets of terms to identify semantic categories in Medline. Our method is completely data-driven.
{"title":"Comparison of Two Methods for Finding Biomedical Categories in Medline","authors":"Lana Yeganova, Won Kim, Donald C. Comeau, W. Wilbur","doi":"10.1109/ICMLA.2011.50","DOIUrl":"https://doi.org/10.1109/ICMLA.2011.50","url":null,"abstract":"In this paper we describe and compare two methods for automatically learning meaningful biomedical categories in Medline®. The first approach is a simple statistical method that uses part-of-speech and frequency information to extract a list of frequent headwords from noun phrases in Medline. The second method implements an alignment-based technique to learn frequent generic patterns that indicate a hyponymy/hypernymy relationship between a pair of noun phrases. We then apply these patterns to Medline to collect frequent hypernyms, potential biomedical categories. We study and compare these two alternative sets of terms to identify semantic categories in Medline. Our method is completely data-driven.","PeriodicalId":439926,"journal":{"name":"2011 10th International Conference on Machine Learning and Applications and Workshops","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126311965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automated paraphrasing of natural language text has many interesting applications from aiding in better translations to generating better and more appropriate style language. In this paper, we are concerned with the problem of picking the best English sentence out of a set of machine generated paraphrase sentences, each designed to express the same content as a human generated original. We present a system of scoring sentences based on examples in large corpora. Specifically, we use the Microsoft Web N-Gram service and the text of the Brown Corpus to extract features from all candidate sentences and compare them against each other. We consider three feature combination methods: A handcrafted decision tree, linear regression and linear powerset regression. We find that while each method has particular strengths, the linear power set regression performs best against our human-evaluated test data.
自然语言文本的自动释义有许多有趣的应用,从帮助更好的翻译到生成更好、更合适的风格语言。在本文中,我们关注的问题是从一组机器生成的释义句子中挑选出最好的英语句子,每个句子都被设计成与人类生成的原文表达相同的内容。我们提出了一个基于大型语料库实例的句子评分系统。具体来说,我们使用Microsoft Web N-Gram服务和Brown语料库的文本从所有候选句子中提取特征并相互比较。我们考虑了三种特征组合方法:手工决策树、线性回归和线性幂集回归。我们发现,虽然每种方法都有自己的优势,但线性幂集回归在人类评估的测试数据中表现最好。
{"title":"Combining Corpus-Based Features for Selecting Best Natural Language Sentences","authors":"Foaad Khosmood, R. Levinson","doi":"10.1109/ICMLA.2011.170","DOIUrl":"https://doi.org/10.1109/ICMLA.2011.170","url":null,"abstract":"Automated paraphrasing of natural language text has many interesting applications from aiding in better translations to generating better and more appropriate style language. In this paper, we are concerned with the problem of picking the best English sentence out of a set of machine generated paraphrase sentences, each designed to express the same content as a human generated original. We present a system of scoring sentences based on examples in large corpora. Specifically, we use the Microsoft Web N-Gram service and the text of the Brown Corpus to extract features from all candidate sentences and compare them against each other. We consider three feature combination methods: A handcrafted decision tree, linear regression and linear powerset regression. We find that while each method has particular strengths, the linear power set regression performs best against our human-evaluated test data.","PeriodicalId":439926,"journal":{"name":"2011 10th International Conference on Machine Learning and Applications and Workshops","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122156914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper emphasizes the energy efficiency issue for unmanned aerial vehicles (UAVs). The power requirement for an UAV system was modeled with the aid of energy requiring from all possible sub-systems. In this model, a single UAV system was broken down by the six power consumption components. The scientific research areas and emerging technologies assisted UAV design stages involved in the mainly "six-part" load components; those are (1) Control, (2) Data processing, (3) Communication, (4) Payloads, including sensors with actuators, (5) External Loads as system perturbation, and (6) System Dynamicity with a performance criteria.
{"title":"Energy Efficiency for Unmanned Aerial Vehicles","authors":"Balemir Uragun","doi":"10.1109/ICMLA.2011.159","DOIUrl":"https://doi.org/10.1109/ICMLA.2011.159","url":null,"abstract":"This paper emphasizes the energy efficiency issue for unmanned aerial vehicles (UAVs). The power requirement for an UAV system was modeled with the aid of energy requiring from all possible sub-systems. In this model, a single UAV system was broken down by the six power consumption components. The scientific research areas and emerging technologies assisted UAV design stages involved in the mainly \"six-part\" load components; those are (1) Control, (2) Data processing, (3) Communication, (4) Payloads, including sensors with actuators, (5) External Loads as system perturbation, and (6) System Dynamicity with a performance criteria.","PeriodicalId":439926,"journal":{"name":"2011 10th International Conference on Machine Learning and Applications and Workshops","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133024936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Nalbantov, A. Dekker, D. Ruysscher, P. Lambin, E. Smirnov
The amount of delivered radiation dose to the tumor in non-small cell lung cancer (NSCLC) patients is limited by the negative side effects on normal tissues. The most dose-limiting factor in radiotherapy is the radiation-induced lung toxicity (RILT). RILT is generally measured semi-quantitatively, by a dyspnea, or shortness-of-breath, score. In general, about 20-30% of patients develop RILT several months after treatment, and in about 70% of the patients the delivered dose is insufficient to control the tumor growth. Ideally, if the RILT score would be known in advance, then the dose treatment plan for the low-toxicity-risk patients could be adjusted so that higher dose is delivered to the tumor to better control it. A number of possible predictors of RILT have been proposed in the literature, including dose-related and clinical/demographic patient characteristics available prior to radiotherapy. In addition, the use of imaging features -- which are noninvasive in nature - has been gaining momentum. Thus, anatomic as well as functional/metabolic information from CT and PET scanner images respectively are used in daily clinical practice, which provide further information about the status of a patient. In this study we assessed whether machine learning techniques can successfully be applied to predict post-radiation lung damage, proxied by dyspnea score, based on clinical, dose-related (dosimetric) and image features. Our dataset included 78 NSCLC patients. The patients were divided into two groups: no-deterioration-of-dyspnea, and deterioration-of-dyspnea patients. Several machine-learning binary classifiers were applied to discriminate the two groups. The results, evaluated using the area under the ROC curve in a cross-validation procedure, are highly promising. This outcome could open the possibility to deliver better, individualized dose-treatment plans for lung cancer patients and help the overall clinical decision making (treatment) process.
{"title":"The Combination of Clinical, Dose-Related and Imaging Features Helps Predict Radiation-Induced Normal-Tissue Toxicity in Lung-cancer Patients -- An in-silico Trial Using Machine Learning Techniques","authors":"G. Nalbantov, A. Dekker, D. Ruysscher, P. Lambin, E. Smirnov","doi":"10.1109/ICMLA.2011.139","DOIUrl":"https://doi.org/10.1109/ICMLA.2011.139","url":null,"abstract":"The amount of delivered radiation dose to the tumor in non-small cell lung cancer (NSCLC) patients is limited by the negative side effects on normal tissues. The most dose-limiting factor in radiotherapy is the radiation-induced lung toxicity (RILT). RILT is generally measured semi-quantitatively, by a dyspnea, or shortness-of-breath, score. In general, about 20-30% of patients develop RILT several months after treatment, and in about 70% of the patients the delivered dose is insufficient to control the tumor growth. Ideally, if the RILT score would be known in advance, then the dose treatment plan for the low-toxicity-risk patients could be adjusted so that higher dose is delivered to the tumor to better control it. A number of possible predictors of RILT have been proposed in the literature, including dose-related and clinical/demographic patient characteristics available prior to radiotherapy. In addition, the use of imaging features -- which are noninvasive in nature - has been gaining momentum. Thus, anatomic as well as functional/metabolic information from CT and PET scanner images respectively are used in daily clinical practice, which provide further information about the status of a patient. In this study we assessed whether machine learning techniques can successfully be applied to predict post-radiation lung damage, proxied by dyspnea score, based on clinical, dose-related (dosimetric) and image features. Our dataset included 78 NSCLC patients. The patients were divided into two groups: no-deterioration-of-dyspnea, and deterioration-of-dyspnea patients. Several machine-learning binary classifiers were applied to discriminate the two groups. The results, evaluated using the area under the ROC curve in a cross-validation procedure, are highly promising. This outcome could open the possibility to deliver better, individualized dose-treatment plans for lung cancer patients and help the overall clinical decision making (treatment) process.","PeriodicalId":439926,"journal":{"name":"2011 10th International Conference on Machine Learning and Applications and Workshops","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133517809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ş. Sağiroğlu, H. Kahraman, M. Yesilbudak, I. Colak
Driving frequency, amplitude and phase difference of two-phase sinusoidal voltages are the input parameters which have influence on speed stability of travelling wave ultrasonic motors (TWUSMs).These parameters are also time-varying due to the variations in operating temperature. In addition, a complete mathematical model of the TWUSM has not been derived yet. Owing to these reasons, a machine learning approach is required for determining the compatibility of operating parameters related to speed stability of TWUSMs. For this purpose, an intelligent decision support tool has been designed for TWUSMs in this study. The input parameters such as driving frequency, amplitude, phase difference of two-phase sinusoidal voltages and operating temperature were evaluated by the k-nearest neighbor algorithm in the decision support tool. The results have shown that the proposed tool provides effective results in the compatibility determination of operating parameters related to speed stability of TWUSMs.
{"title":"An Intelligent Decision Support Tool for a Travelling Wave Ultrasonic Motor Based on k-Nearest Neighbor Algorithm","authors":"Ş. Sağiroğlu, H. Kahraman, M. Yesilbudak, I. Colak","doi":"10.1109/ICMLA.2011.33","DOIUrl":"https://doi.org/10.1109/ICMLA.2011.33","url":null,"abstract":"Driving frequency, amplitude and phase difference of two-phase sinusoidal voltages are the input parameters which have influence on speed stability of travelling wave ultrasonic motors (TWUSMs).These parameters are also time-varying due to the variations in operating temperature. In addition, a complete mathematical model of the TWUSM has not been derived yet. Owing to these reasons, a machine learning approach is required for determining the compatibility of operating parameters related to speed stability of TWUSMs. For this purpose, an intelligent decision support tool has been designed for TWUSMs in this study. The input parameters such as driving frequency, amplitude, phase difference of two-phase sinusoidal voltages and operating temperature were evaluated by the k-nearest neighbor algorithm in the decision support tool. The results have shown that the proposed tool provides effective results in the compatibility determination of operating parameters related to speed stability of TWUSMs.","PeriodicalId":439926,"journal":{"name":"2011 10th International Conference on Machine Learning and Applications and Workshops","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133758585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Essential genes constitute the minimal set of genes an organism needs for its survival. Identification of essential genes is of theoretical interest to genome biologist and has practical applications in medicine and biotechnology. This paper presents and evaluates machine learning approaches to the problem of predicting essential genes in microbial genomes using solely sequence derived input features. We investigate three different supervised classification methods -- Support Vector Machine (SVM), Artificial Neural Network (ANN), and Decision Tree (DT) -- for this binary classification task. The classifiers are trained and evaluated using 37830 examples obtained from 14 experimentally validated, taxonomically diverse microbial genomes whose essential genes are known. A set of 52 relevant genomic sequence derived features is used as input for the classifiers. The models were evaluated using novel blind testing schemes Leave-One-Genome-Out (LOGO) and Leave-One-Taxon-group-Out (LOTO) and 10-fold stratified cross validation (10-f-cv) strategy on both the full multi-genome datasets and its class imbalance reduced variants. Experimental results (10 X 10-f-cv) indicate SVM and ANN perform better than DT with Area under the Receiver Operating Characteristics (AU-ROC) scores of 0.80, 0.79 and 0.68 respectively. This study demonstrates that supervised machine learning methods can be used to predict essential genes in microbial genomes by using only gene sequence and features derived from it. LOGO and LOTO Blind test results suggest that the trained classifiers generalize across genomes and taxonomic boundaries.
基本基因构成了生物体生存所需的最小基因集。必需基因的鉴定是基因组生物学家的理论兴趣,在医学和生物技术方面具有实际应用。本文提出并评估了机器学习方法来预测微生物基因组中仅使用序列衍生输入特征的基本基因的问题。我们研究了三种不同的监督分类方法——支持向量机(SVM)、人工神经网络(ANN)和决策树(DT)——用于这个二元分类任务。分类器的训练和评估使用了37830个样本,这些样本来自14个经过实验验证的、分类上多样化的微生物基因组,这些基因组的基本基因是已知的。一组52个相关的基因组序列衍生特征被用作分类器的输入。采用新颖的盲检验方案Leave-One-Genome-Out (LOGO)和Leave-One-Taxon-group-Out (LOTO),以及10倍分层交叉验证(10-f-cv)策略,对完整的多基因组数据集及其类失衡减少的变体进行了模型评估。实验结果(10 X 10-f-cv)表明,SVM和ANN在Receiver Operating characteristic (AU-ROC)下的面积分别为0.80、0.79和0.68,优于DT。本研究表明,监督机器学习方法可以通过仅使用基因序列和从中衍生的特征来预测微生物基因组中的必需基因。LOGO和LOTO盲测试结果表明,训练的分类器可以跨基因组和分类边界进行泛化。
{"title":"Predicting \"Essential\" Genes across Microbial Genomes: A Machine Learning Approach","authors":"Krishna Palaniappan, Sumitra Mukherjee","doi":"10.1109/ICMLA.2011.114","DOIUrl":"https://doi.org/10.1109/ICMLA.2011.114","url":null,"abstract":"Essential genes constitute the minimal set of genes an organism needs for its survival. Identification of essential genes is of theoretical interest to genome biologist and has practical applications in medicine and biotechnology. This paper presents and evaluates machine learning approaches to the problem of predicting essential genes in microbial genomes using solely sequence derived input features. We investigate three different supervised classification methods -- Support Vector Machine (SVM), Artificial Neural Network (ANN), and Decision Tree (DT) -- for this binary classification task. The classifiers are trained and evaluated using 37830 examples obtained from 14 experimentally validated, taxonomically diverse microbial genomes whose essential genes are known. A set of 52 relevant genomic sequence derived features is used as input for the classifiers. The models were evaluated using novel blind testing schemes Leave-One-Genome-Out (LOGO) and Leave-One-Taxon-group-Out (LOTO) and 10-fold stratified cross validation (10-f-cv) strategy on both the full multi-genome datasets and its class imbalance reduced variants. Experimental results (10 X 10-f-cv) indicate SVM and ANN perform better than DT with Area under the Receiver Operating Characteristics (AU-ROC) scores of 0.80, 0.79 and 0.68 respectively. This study demonstrates that supervised machine learning methods can be used to predict essential genes in microbial genomes by using only gene sequence and features derived from it. LOGO and LOTO Blind test results suggest that the trained classifiers generalize across genomes and taxonomic boundaries.","PeriodicalId":439926,"journal":{"name":"2011 10th International Conference on Machine Learning and Applications and Workshops","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124198689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study introduces an intelligent power factor correction approach based on Linear Regression (LR) and Ridge Regression (RR) methods. The 10-fold Cross Validation (CV) test protocol has been used to evaluate the performance. The best test performance has been obtained from the LR in comparison with RR. The empirical results have evaluated that the selected intelligent compensators developed in this work might overcome the problems met in the literature providing accurate, simple and low-cost solution for compensation.
{"title":"An Intelligent Power Factor Correction Approach Based on Linear Regression and Ridge Regression Methods","authors":"R. Bayindir, Murat Gök, E. Kabalci, O. Kaplan","doi":"10.1109/ICMLA.2011.34","DOIUrl":"https://doi.org/10.1109/ICMLA.2011.34","url":null,"abstract":"This study introduces an intelligent power factor correction approach based on Linear Regression (LR) and Ridge Regression (RR) methods. The 10-fold Cross Validation (CV) test protocol has been used to evaluate the performance. The best test performance has been obtained from the LR in comparison with RR. The empirical results have evaluated that the selected intelligent compensators developed in this work might overcome the problems met in the literature providing accurate, simple and low-cost solution for compensation.","PeriodicalId":439926,"journal":{"name":"2011 10th International Conference on Machine Learning and Applications and Workshops","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116930211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In computer forensic analysis, hundreds of thousands of files are usually examined. Much of those files consist of unstructured text, whose analysis by computer examiners is difficult to be performed. In this context, automated methods of analysis are of great interest. In particular, algorithms for clustering documents can facilitate the discovery of new and useful knowledge from the documents under analysis. We present an approach that applies clustering algorithms to forensic analysis of computers seized in police investigations. We illustrate the proposed approach by carrying out experimentation with five clustering algorithms (K-means, K-medoids, Single Link, Complete Link, and Average Link) applied to five datasets obtained from computers seized in real-world investigations. In addition, two relative validity indexes were used to automatically estimate the number of clusters. Related studies in the literature are significantly more limited than our study. Our experiments show that the Average Link and Complete Link algorithms provide the best results for our application domain. If suitably initialized, partitional algorithms (K-means and K-medoids) can also yield to very good results. Finally, we also present and discuss practical results that can be useful for researchers and practitioners of forensic computing.
在计算机取证分析中,通常要检查数十万个文件。这些文件大多由非结构化文本组成,计算机审查员很难对其进行分析。在这种情况下,自动化的分析方法是非常有趣的。特别是,聚类文档的算法可以促进从被分析的文档中发现新的和有用的知识。我们提出了一种方法,将聚类算法应用于警方调查中查获的计算机的法医分析。我们通过将五种聚类算法(K-means, k - medioids, Single Link, Complete Link和Average Link)应用于从现实世界调查中捕获的计算机中获得的五个数据集的实验来说明所提出的方法。此外,采用两个相对效度指标自动估计聚类数量。文献中相关研究的局限性明显大于我们的研究。实验表明,平均链接算法和完全链接算法为我们的应用领域提供了最好的结果。如果适当地初始化,分区算法(K-means和K-medoids)也可以产生非常好的结果。最后,我们还提出并讨论了对法医计算的研究人员和实践者有用的实际结果。
{"title":"Document Clustering for Forensic Computing: An Approach for Improving Computer Inspection","authors":"Luís Filipe da Cruz Nassif, Eduardo R. Hruschka","doi":"10.1109/ICMLA.2011.59","DOIUrl":"https://doi.org/10.1109/ICMLA.2011.59","url":null,"abstract":"In computer forensic analysis, hundreds of thousands of files are usually examined. Much of those files consist of unstructured text, whose analysis by computer examiners is difficult to be performed. In this context, automated methods of analysis are of great interest. In particular, algorithms for clustering documents can facilitate the discovery of new and useful knowledge from the documents under analysis. We present an approach that applies clustering algorithms to forensic analysis of computers seized in police investigations. We illustrate the proposed approach by carrying out experimentation with five clustering algorithms (K-means, K-medoids, Single Link, Complete Link, and Average Link) applied to five datasets obtained from computers seized in real-world investigations. In addition, two relative validity indexes were used to automatically estimate the number of clusters. Related studies in the literature are significantly more limited than our study. Our experiments show that the Average Link and Complete Link algorithms provide the best results for our application domain. If suitably initialized, partitional algorithms (K-means and K-medoids) can also yield to very good results. Finally, we also present and discuss practical results that can be useful for researchers and practitioners of forensic computing.","PeriodicalId":439926,"journal":{"name":"2011 10th International Conference on Machine Learning and Applications and Workshops","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129049457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The problem of detecting clusters in high-dimensional data is increasingly common in machine learning applications, for instance in computer vision and bioinformatics. Recently, a number of approaches in the field of subspace clustering have been proposed which search for clusters in subspaces of unknown dimensions. Learning the number of clusters, the dimension of each subspace, and the correct assignments is a challenging task, and many existing algorithms often perform poorly in the presence of subspaces that have different dimensions and possibly overlap, or are otherwise computationally expensive. In this work we present a novel approach to subspace clustering that learns the numbers of clusters and the dimensionality of each subspace in an efficient way. We assume that the data points in each cluster are well represented in low-dimensions by a PCA model. We propose a measure of predictive influence of data points modelled by PCA which we minimise to drive the clustering process. The proposed predictive subspace clustering algorithm is assessed on both simulated data and on the popular Yale faces database where state-of-the-art performance and speed are obtained.
{"title":"Predictive Subspace Clustering","authors":"B. McWilliams, G. Montana","doi":"10.1109/ICMLA.2011.117","DOIUrl":"https://doi.org/10.1109/ICMLA.2011.117","url":null,"abstract":"The problem of detecting clusters in high-dimensional data is increasingly common in machine learning applications, for instance in computer vision and bioinformatics. Recently, a number of approaches in the field of subspace clustering have been proposed which search for clusters in subspaces of unknown dimensions. Learning the number of clusters, the dimension of each subspace, and the correct assignments is a challenging task, and many existing algorithms often perform poorly in the presence of subspaces that have different dimensions and possibly overlap, or are otherwise computationally expensive. In this work we present a novel approach to subspace clustering that learns the numbers of clusters and the dimensionality of each subspace in an efficient way. We assume that the data points in each cluster are well represented in low-dimensions by a PCA model. We propose a measure of predictive influence of data points modelled by PCA which we minimise to drive the clustering process. The proposed predictive subspace clustering algorithm is assessed on both simulated data and on the popular Yale faces database where state-of-the-art performance and speed are obtained.","PeriodicalId":439926,"journal":{"name":"2011 10th International Conference on Machine Learning and Applications and Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130791973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}