{"title":"Data mining using Probabilistic Grammars","authors":"Aljoharah Algwaiz, S. Rajasekaran, R. Ammar","doi":"10.1109/ISSPIT.2016.7886057","DOIUrl":null,"url":null,"abstract":"Efficient and accurate data mining has become vital as technology advancements in data collection and storage soar. Researchers have proposed various valuable machine learning algorithms for data mining. However, not many have utilized formal methods. This paper proposes a data mining approach using Probabilistic Context Free Grammars (PCFGs). In this work we have employed PCFGs to mine from large heterogeneous datasets. The data mining problem of our interest is classification. To start with a probabilistic grammar is inferred from datasets for which classifications are known. The learnt model can then be used to classify any unknown data. Specifically, for each unknown data point, the model can be used to calculate probabilities that this point belongs to the various possible classes. A simple resolution strategy could be to associate the point with the class that corresponds to the maximum probability. To demonstrate the applicability of our approach we consider the problem of identifying splice junctions. Using a PCFG, an input DNA sequence is classified as donor, acceptor, or neither.","PeriodicalId":371691,"journal":{"name":"2016 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSPIT.2016.7886057","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Efficient and accurate data mining has become vital as technology advancements in data collection and storage soar. Researchers have proposed various valuable machine learning algorithms for data mining. However, not many have utilized formal methods. This paper proposes a data mining approach using Probabilistic Context Free Grammars (PCFGs). In this work we have employed PCFGs to mine from large heterogeneous datasets. The data mining problem of our interest is classification. To start with a probabilistic grammar is inferred from datasets for which classifications are known. The learnt model can then be used to classify any unknown data. Specifically, for each unknown data point, the model can be used to calculate probabilities that this point belongs to the various possible classes. A simple resolution strategy could be to associate the point with the class that corresponds to the maximum probability. To demonstrate the applicability of our approach we consider the problem of identifying splice junctions. Using a PCFG, an input DNA sequence is classified as donor, acceptor, or neither.