Wajid Ali, M. Kamran Malik, S. Hussain, S. Siddiq, A. Ali
{"title":"Urdu noun phrase chunking: HMM based approach","authors":"Wajid Ali, M. Kamran Malik, S. Hussain, S. Siddiq, A. Ali","doi":"10.1109/ICEIT.2010.5607623","DOIUrl":null,"url":null,"abstract":"Extraction of noun phrase (NP) from text is useful for many natural language processing applications, such as name entity recognition, indexing, searching, parsing etc. We present a noun phrase chunker for Urdu which is based on a statistical approach. A 100,000 words Urdu corpus is manually tagged with NP chunk tags. The corpus is used to develop a statistical approach. Initially, a statistical approach based on standard HMM model is developed for automatics NP chunking. In Urdu phrases, the case marker (CM) indicates the end of a noun phrase and is appended at its end. Thus, if one scans the sentence in reverse order, one may be able to better predict phrase endings. So, the technique is enhanced by changing scanning direction. The technique is further enhanced by merging chunk and POS tags to achieve maximum accuracy. The results of all experiments are reported with maximum overall accuracy of 97.61% achieved using HMM based approach with extended tagset and right to left (RTL) scanning.","PeriodicalId":346498,"journal":{"name":"2010 International Conference on Educational and Information Technology","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 International Conference on Educational and Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICEIT.2010.5607623","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12
Abstract
Extraction of noun phrase (NP) from text is useful for many natural language processing applications, such as name entity recognition, indexing, searching, parsing etc. We present a noun phrase chunker for Urdu which is based on a statistical approach. A 100,000 words Urdu corpus is manually tagged with NP chunk tags. The corpus is used to develop a statistical approach. Initially, a statistical approach based on standard HMM model is developed for automatics NP chunking. In Urdu phrases, the case marker (CM) indicates the end of a noun phrase and is appended at its end. Thus, if one scans the sentence in reverse order, one may be able to better predict phrase endings. So, the technique is enhanced by changing scanning direction. The technique is further enhanced by merging chunk and POS tags to achieve maximum accuracy. The results of all experiments are reported with maximum overall accuracy of 97.61% achieved using HMM based approach with extended tagset and right to left (RTL) scanning.