Tibetan word segmentation method based on CNN-BiLSTM-CRF model

2019 International Conference on Asian Language Processing (IALP) Pub Date : 2018-11-01 DOI:10.1109/IALP48816.2019.9037661

Lili Wang, Hongwu Yang, Xiaotian Xing, Yajing Yan

{"title":"Tibetan word segmentation method based on CNN-BiLSTM-CRF model","authors":"Lili Wang, Hongwu Yang, Xiaotian Xing, Yajing Yan","doi":"10.1109/IALP48816.2019.9037661","DOIUrl":null,"url":null,"abstract":"We propose a Tibetan word segmentation method based on CNN-BiLSTM-CRF model that merely uses the characters of sentence as the input so that the method does not need large-scale corpus resources and manual features for training. Firstly, we use convolution neural network to train character vectors. Then the character vectors are searched through the character lookup table to form a matrix C by stacking searched results. Then the convolution operation between the matrix C and multiple filter matrices is carried out to obtain the character-level features of each Tibetan word by maximizing the pooling. We input the character vector into the BiLSTM-CRF model, which is suitable for Tibetan word segmentation through the highway network, for getting a Tibetan word segmentation model that is optimized by using the character vector and CRF model. For Tibetan language with rich morphology, fewer parameters and faster training time make this model better than BiLSTM-CRF model in the performance of character level. The experimental results show that character input is sufficient for language modeling. The robustness of Tibetan word segmentation is improved by the model that can achieves 95.17% of the F value.","PeriodicalId":208066,"journal":{"name":"2019 International Conference on Asian Language Processing (IALP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Asian Language Processing (IALP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IALP48816.2019.9037661","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

We propose a Tibetan word segmentation method based on CNN-BiLSTM-CRF model that merely uses the characters of sentence as the input so that the method does not need large-scale corpus resources and manual features for training. Firstly, we use convolution neural network to train character vectors. Then the character vectors are searched through the character lookup table to form a matrix C by stacking searched results. Then the convolution operation between the matrix C and multiple filter matrices is carried out to obtain the character-level features of each Tibetan word by maximizing the pooling. We input the character vector into the BiLSTM-CRF model, which is suitable for Tibetan word segmentation through the highway network, for getting a Tibetan word segmentation model that is optimized by using the character vector and CRF model. For Tibetan language with rich morphology, fewer parameters and faster training time make this model better than BiLSTM-CRF model in the performance of character level. The experimental results show that character input is sufficient for language modeling. The robustness of Tibetan word segmentation is improved by the model that can achieves 95.17% of the F value.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于CNN-BiLSTM-CRF模型的藏文分词方法

本文提出了一种基于CNN-BiLSTM-CRF模型的藏文分词方法，该方法仅使用句子的字符作为输入，不需要大规模的语料库资源和人工特征进行训练。首先利用卷积神经网络对特征向量进行训练。然后通过字符查找表搜索字符向量，将搜索结果叠加形成矩阵C。然后对矩阵C与多个滤波矩阵进行卷积运算，通过池化最大化的方法获得每个藏文词的字符级特征。我们将特征向量输入到适用于公路网络藏文分词的BiLSTM-CRF模型中，得到一个结合特征向量和CRF模型进行优化的藏文分词模型。对于形态丰富的藏语，该模型参数更少，训练时间更快，在字符水平上优于BiLSTM-CRF模型。实验结果表明，字符输入对语言建模是足够的。该模型提高了藏文分词的鲁棒性，可达到F值的95.17%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2019 International Conference on Asian Language Processing (IALP)

自引率

0.00%

发文量