Tibetan word segmentation method based on CNN-BiLSTM-CRF model

Lili Wang, Hongwu Yang, Xiaotian Xing, Yajing Yan
{"title":"Tibetan word segmentation method based on CNN-BiLSTM-CRF model","authors":"Lili Wang, Hongwu Yang, Xiaotian Xing, Yajing Yan","doi":"10.1109/IALP48816.2019.9037661","DOIUrl":null,"url":null,"abstract":"We propose a Tibetan word segmentation method based on CNN-BiLSTM-CRF model that merely uses the characters of sentence as the input so that the method does not need large-scale corpus resources and manual features for training. Firstly, we use convolution neural network to train character vectors. Then the character vectors are searched through the character lookup table to form a matrix C by stacking searched results. Then the convolution operation between the matrix C and multiple filter matrices is carried out to obtain the character-level features of each Tibetan word by maximizing the pooling. We input the character vector into the BiLSTM-CRF model, which is suitable for Tibetan word segmentation through the highway network, for getting a Tibetan word segmentation model that is optimized by using the character vector and CRF model. For Tibetan language with rich morphology, fewer parameters and faster training time make this model better than BiLSTM-CRF model in the performance of character level. The experimental results show that character input is sufficient for language modeling. The robustness of Tibetan word segmentation is improved by the model that can achieves 95.17% of the F value.","PeriodicalId":208066,"journal":{"name":"2019 International Conference on Asian Language Processing (IALP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Asian Language Processing (IALP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IALP48816.2019.9037661","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

We propose a Tibetan word segmentation method based on CNN-BiLSTM-CRF model that merely uses the characters of sentence as the input so that the method does not need large-scale corpus resources and manual features for training. Firstly, we use convolution neural network to train character vectors. Then the character vectors are searched through the character lookup table to form a matrix C by stacking searched results. Then the convolution operation between the matrix C and multiple filter matrices is carried out to obtain the character-level features of each Tibetan word by maximizing the pooling. We input the character vector into the BiLSTM-CRF model, which is suitable for Tibetan word segmentation through the highway network, for getting a Tibetan word segmentation model that is optimized by using the character vector and CRF model. For Tibetan language with rich morphology, fewer parameters and faster training time make this model better than BiLSTM-CRF model in the performance of character level. The experimental results show that character input is sufficient for language modeling. The robustness of Tibetan word segmentation is improved by the model that can achieves 95.17% of the F value.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于CNN-BiLSTM-CRF模型的藏文分词方法
本文提出了一种基于CNN-BiLSTM-CRF模型的藏文分词方法,该方法仅使用句子的字符作为输入,不需要大规模的语料库资源和人工特征进行训练。首先利用卷积神经网络对特征向量进行训练。然后通过字符查找表搜索字符向量,将搜索结果叠加形成矩阵C。然后对矩阵C与多个滤波矩阵进行卷积运算,通过池化最大化的方法获得每个藏文词的字符级特征。我们将特征向量输入到适用于公路网络藏文分词的BiLSTM-CRF模型中,得到一个结合特征向量和CRF模型进行优化的藏文分词模型。对于形态丰富的藏语,该模型参数更少,训练时间更快,在字符水平上优于BiLSTM-CRF模型。实验结果表明,字符输入对语言建模是足够的。该模型提高了藏文分词的鲁棒性,可达到F值的95.17%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A General Procedure for Improving Language Models in Low-Resource Speech Recognition Automated Prediction of Item Difficulty in Reading Comprehension Using Long Short-Term Memory An Measurement Method of Ancient Poetry Difficulty for Adaptive Testing How to Answer Comparison Questions An Enhancement of Malay Social Media Text Normalization for Lexicon-Based Sentiment Analysis
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1