A study in language identification

Australasian Document Computing Symposium Pub Date : 2012-12-05 DOI:10.1145/2407085.2407097

Rachel Mary Milne, Richard A. O'Keefe, A. Trotman

引用次数: 14

Abstract

Language identification is automatically determining the language that a previously unseen document was written in. We compared several prior methods on samples from the Wikipedia and the EuroParl collections. Most of these methods work well. But we identify that these (and presumably other document) collections are heterogeneous in size, and short documents are systematically different from large ones. That techniques that work well on long documents are different from those that work well on short ones. We believe that improvement in algorithms will be seen if length is taken into account.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

语言识别研究

语言识别是自动确定以前未见过的文档所使用的语言。我们比较了维基百科和EuroParl收集样本的几种先前的方法。这些方法大多数都很有效。但是我们发现这些(可能还有其他文档)集合在大小上是异构的，短文档与大文档在系统上是不同的。对长文档有效的技术与对短文档有效的技术是不同的。我们相信，如果考虑到长度，算法将得到改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊