通过构建指定N-Gram进行文档分类

2012 Sixth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing Pub Date : 2012-07-04 DOI:10.1109/IMIS.2012.142

Byeongkyu Ko, Dongjin Choi, Chang Choi, Junho Choi, Pankoo Kim

{"title":"通过构建指定N-Gram进行文档分类","authors":"Byeongkyu Ko, Dongjin Choi, Chang Choi, Junho Choi, Pankoo Kim","doi":"10.1109/IMIS.2012.142","DOIUrl":null,"url":null,"abstract":"This paper proposed a method to classify textural documents using specified n-gram data set. Human lives in the world where web documents have a great potential and the amount of valuable information has been consistently growing over the year. There is a problem that finding relevant web documents corresponding to what users want is more difficult due to the huge amount of web size. For this reason, many approaches have been suggested to overcome this obstacle. The most important task is classifying textural documents into predefined categories. Over the years, many statistical approaches were introduced though, no one can find perfect solution yet. In this paper, we suggest a method for textural document classification using n-gram model. The n-gram data frequency has a great potential to find similarities between documents. For this reason, we construct our own n-gram data sets from research papers. If an unknown document comes to the system, the system will extract n-grams from the given unknown documents. After this step, n-grams from unknown document and n-grams in previous data sets will be compared by proposed similarity measurement. The precision rate of this method comes to 86%.","PeriodicalId":290976,"journal":{"name":"2012 Sixth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing","volume":"114 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Document Classification through Building Specified N-Gram\",\"authors\":\"Byeongkyu Ko, Dongjin Choi, Chang Choi, Junho Choi, Pankoo Kim\",\"doi\":\"10.1109/IMIS.2012.142\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposed a method to classify textural documents using specified n-gram data set. Human lives in the world where web documents have a great potential and the amount of valuable information has been consistently growing over the year. There is a problem that finding relevant web documents corresponding to what users want is more difficult due to the huge amount of web size. For this reason, many approaches have been suggested to overcome this obstacle. The most important task is classifying textural documents into predefined categories. Over the years, many statistical approaches were introduced though, no one can find perfect solution yet. In this paper, we suggest a method for textural document classification using n-gram model. The n-gram data frequency has a great potential to find similarities between documents. For this reason, we construct our own n-gram data sets from research papers. If an unknown document comes to the system, the system will extract n-grams from the given unknown documents. After this step, n-grams from unknown document and n-grams in previous data sets will be compared by proposed similarity measurement. The precision rate of this method comes to 86%.\",\"PeriodicalId\":290976,\"journal\":{\"name\":\"2012 Sixth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing\",\"volume\":\"114 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-07-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 Sixth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IMIS.2012.142\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 Sixth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IMIS.2012.142","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

提出了一种使用指定n-gram数据集对文本文档进行分类的方法。在人类生活的世界里，网络文档具有巨大的潜力，而且有价值的信息的数量在过去的一年里一直在持续增长。有一个问题是，由于庞大的网络规模，找到与用户想要的相对应的相关网络文档变得更加困难。出于这个原因，已经提出了许多方法来克服这一障碍。最重要的任务是将纹理文档分类到预定义的类别中。多年来，许多统计方法被引入，但没有人能找到完美的解决方案。本文提出了一种基于n-gram模型的文本文档分类方法。n-gram数据频率在查找文档之间的相似性方面具有很大的潜力。出于这个原因，我们从研究论文中构建了自己的n-gram数据集。如果一个未知文档进入系统，系统将从给定的未知文档中提取n个图。在此步骤之后，将未知文档中的n个图与之前数据集中的n个图进行相似性度量比较。该方法的准确率可达86%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Document Classification through Building Specified N-Gram

This paper proposed a method to classify textural documents using specified n-gram data set. Human lives in the world where web documents have a great potential and the amount of valuable information has been consistently growing over the year. There is a problem that finding relevant web documents corresponding to what users want is more difficult due to the huge amount of web size. For this reason, many approaches have been suggested to overcome this obstacle. The most important task is classifying textural documents into predefined categories. Over the years, many statistical approaches were introduced though, no one can find perfect solution yet. In this paper, we suggest a method for textural document classification using n-gram model. The n-gram data frequency has a great potential to find similarities between documents. For this reason, we construct our own n-gram data sets from research papers. If an unknown document comes to the system, the system will extract n-grams from the given unknown documents. After this step, n-grams from unknown document and n-grams in previous data sets will be compared by proposed similarity measurement. The precision rate of this method comes to 86%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2012 Sixth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing

自引率

0.00%

发文量

期刊最新文献

Global Trends in Workplace Learning Web of Things as a Product Improvement tool: Furniture as Case Study Modified Dummy Sequence Generator for DSI on PAPR Reduction in OFDM Systems Data Aggregation Techniques in Heart Vessel Modelling and Recognition of Pathological Changes Preventing the Access of Fraudulent WEB Sites by Using a Special Two-Dimensional Code