Bilingual Auto-Categorization Comparison of Two LSTM Text Classifiers

2019 8th International Congress on Advanced Applied Informatics (IIAI-AAI) Pub Date : 2019-07-01 DOI:10.1109/IIAI-AAI.2019.00127

Johannes Lindén, Xutao Wang, Stefan Forsström, Tingting Zhang

{"title":"Bilingual Auto-Categorization Comparison of Two LSTM Text Classifiers","authors":"Johannes Lindén, Xutao Wang, Stefan Forsström, Tingting Zhang","doi":"10.1109/IIAI-AAI.2019.00127","DOIUrl":null,"url":null,"abstract":"Multi linguistic problems such as auto-categorization is not an easy task. It is possible to train different models for each language, another way to do auto-categorization is to build the model in one base language and use automatic translation from other languages to that base language. Different languages have a bias to a language specific grammar and syntax and will therefore pose problems to be expressed in other languages. Translating from one language into a non-verbal language could potentially have a positive impact of the categorization results. A non-verbal language could for example be pure information in form of a knowledge graph relation extraction from the text. In this article a comparison is conducted between Chinese and Swedish languages. Two categorization models are developed and validated on each dataset. The purpose is to make an auto-categorization model that works for n'importe quel langage. One model is built upon LSTM and optimized for Swedish and the other is an improved Bidirectional-LSTM Convolution model optimized for Chinese. The improved algorithm is trained on both languages and compared with the LSTM algorithm. The Bidirectional-LSTM algorithm performs approximately 20% units better than the LSTM algorithm, which is significant.","PeriodicalId":136474,"journal":{"name":"2019 8th International Congress on Advanced Applied Informatics (IIAI-AAI)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 8th International Congress on Advanced Applied Informatics (IIAI-AAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IIAI-AAI.2019.00127","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Multi linguistic problems such as auto-categorization is not an easy task. It is possible to train different models for each language, another way to do auto-categorization is to build the model in one base language and use automatic translation from other languages to that base language. Different languages have a bias to a language specific grammar and syntax and will therefore pose problems to be expressed in other languages. Translating from one language into a non-verbal language could potentially have a positive impact of the categorization results. A non-verbal language could for example be pure information in form of a knowledge graph relation extraction from the text. In this article a comparison is conducted between Chinese and Swedish languages. Two categorization models are developed and validated on each dataset. The purpose is to make an auto-categorization model that works for n'importe quel langage. One model is built upon LSTM and optimized for Swedish and the other is an improved Bidirectional-LSTM Convolution model optimized for Chinese. The improved algorithm is trained on both languages and compared with the LSTM algorithm. The Bidirectional-LSTM algorithm performs approximately 20% units better than the LSTM algorithm, which is significant.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

两种LSTM文本分类器的双语自动分类比较

像自动分类这样的多语言问题不是一件容易的事。可以为每种语言训练不同的模型，另一种进行自动分类的方法是用一种基本语言构建模型，并使用从其他语言到该基本语言的自动翻译。不同的语言对一种语言特定的语法和句法有偏见，因此会造成用其他语言表达的问题。从一种语言翻译成非言语语言可能会对分类结果产生积极的影响。例如，非言语语言可以是从文本中提取的知识图关系形式的纯信息。本文对汉语和瑞典语进行了比较。在每个数据集上开发并验证了两个分类模型。目的是建立一个自动分类模型，适用于非导入语言。其中一个模型是基于LSTM并针对瑞典语进行了优化的，另一个模型是针对汉语进行了优化的改进的双向LSTM卷积模型。改进算法在两种语言上进行了训练，并与LSTM算法进行了比较。Bidirectional-LSTM算法比LSTM算法的性能提高了约20%，这是非常显著的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2019 8th International Congress on Advanced Applied Informatics (IIAI-AAI)

自引率

0.00%

发文量