{"title":"Automatic text categorization: Marathi documents","authors":"Jaydeep Jalindar Patil, N. Bogiri","doi":"10.1109/ICESA.2015.7503438","DOIUrl":null,"url":null,"abstract":"Information technology generated huge data on the internet. Initially this data is mainly in English language so majority of data mining research work is on the English text documents. As the internet usage increased, data in other languages like Marathi, Tamil, Telugu and Punjabi etc. increased on the internet. This paper presents the retrieval system for Marathi language documents based on the user profile. User profile considers the user's interests, user's browsing history. The system shows the Marathi documents to the end user based on the user profile. Automatic text categorization is useful in better management and retrieval of these text documents and also makes document retrieval as simple task. This paper discusses the automatic text categorization of Marathi documents and literature survey of the related work done in automatic text categorization of Marathi documents. Various learning techniques exist for the classification of text documents like Naïve Bayes, Support Vector Machine and Decision Trees etc. There are different clustering techniques used for text categorization like Label Induction Grouping Algorithm, Suffix Tree Clustering, and K- means etc. Literature survey shows that for non-English documents VSM [Vector Space Model] gives the better results than any other models. The system provides text categorization of Marathi documents by using the LINGO [Label Induction Grouping] algorithm. LINGO is based on the VSM [Vector Space Model]. The system uses the dataset which contains 200 documents of 20 different categories. The result represents that for Marathi text documents LINGO clustering algorithm is efficient.","PeriodicalId":259816,"journal":{"name":"2015 International Conference on Energy Systems and Applications","volume":"174 5","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Energy Systems and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICESA.2015.7503438","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 30
Abstract
Information technology generated huge data on the internet. Initially this data is mainly in English language so majority of data mining research work is on the English text documents. As the internet usage increased, data in other languages like Marathi, Tamil, Telugu and Punjabi etc. increased on the internet. This paper presents the retrieval system for Marathi language documents based on the user profile. User profile considers the user's interests, user's browsing history. The system shows the Marathi documents to the end user based on the user profile. Automatic text categorization is useful in better management and retrieval of these text documents and also makes document retrieval as simple task. This paper discusses the automatic text categorization of Marathi documents and literature survey of the related work done in automatic text categorization of Marathi documents. Various learning techniques exist for the classification of text documents like Naïve Bayes, Support Vector Machine and Decision Trees etc. There are different clustering techniques used for text categorization like Label Induction Grouping Algorithm, Suffix Tree Clustering, and K- means etc. Literature survey shows that for non-English documents VSM [Vector Space Model] gives the better results than any other models. The system provides text categorization of Marathi documents by using the LINGO [Label Induction Grouping] algorithm. LINGO is based on the VSM [Vector Space Model]. The system uses the dataset which contains 200 documents of 20 different categories. The result represents that for Marathi text documents LINGO clustering algorithm is efficient.
信息技术在互联网上产生了巨大的数据。最初,这些数据主要是英文的,因此大多数数据挖掘研究工作都是在英文文本文档上进行的。随着互联网使用量的增加,其他语言的数据,如马拉地语、泰米尔语、泰卢固语和旁遮普语等在互联网上增加了。提出了一种基于用户配置文件的马拉地语文档检索系统。用户档案考虑用户的兴趣,用户的浏览历史。系统根据用户配置文件向最终用户显示马拉地语文档。自动文本分类有助于更好地管理和检索这些文本文档,并使文档检索成为一项简单的任务。本文讨论了马拉地语文献的自动文本分类,综述了马拉地语文献自动文本分类的相关工作。文本文档分类有多种学习技术,如Naïve贝叶斯、支持向量机和决策树等。有不同的聚类技术用于文本分类,如标签归纳分组算法、后缀树聚类和K- means等。文献调查表明,对于非英语文档,VSM [Vector Space Model]给出的结果比其他任何模型都要好。该系统使用LINGO[标签归纳分组]算法对马拉地语文档进行文本分类。LINGO基于VSM[向量空间模型]。系统使用包含20个不同类别的200个文档的数据集。结果表明,对于马拉地语文本文档,LINGO聚类算法是有效的。