IME-Spell: Chinese Spelling Check based on Input Method

Qingbiao Zhao, Xingfa Shen, Jian Yao
{"title":"IME-Spell: Chinese Spelling Check based on Input Method","authors":"Qingbiao Zhao, Xingfa Shen, Jian Yao","doi":"10.1145/3443279.3443297","DOIUrl":null,"url":null,"abstract":"Intended for reducing manual inspection costs and semantic misunderstandings, Chinese Spelling Check (CSC) has been investigated extensively in natural language processing. However, little work has yet been done on input-method-based CSC in which CSC can make use of Pinyin information to improve spelling correction efficiency. This paper proposes a novel CSC architecture, IME-Spell, based on pre-trained context vectors for input methods, which consists of two parts as follows. The Chinese spelling detection part of the architecture adopts the fusion vectors of character-based pre-trained context vectors and Pinyin vectors, and uses the method of sequence labeling to detect the error characters. The Chinese spelling correction part of the architecture adopts Masked Language Model (MLM) to generate a candidate set of erroneous characters, and uses XGBoost and Pinyin-to-Character conversion models to filter correct characters and correct the error characters for users. IME-Spell has a significant improvement over the benchmark models on the SIGHAN dataset, whose maximum difference of F1 in the spelling detection and correction subtasks reach 48.9% and 27.8%, respectively.","PeriodicalId":414366,"journal":{"name":"Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3443279.3443297","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Intended for reducing manual inspection costs and semantic misunderstandings, Chinese Spelling Check (CSC) has been investigated extensively in natural language processing. However, little work has yet been done on input-method-based CSC in which CSC can make use of Pinyin information to improve spelling correction efficiency. This paper proposes a novel CSC architecture, IME-Spell, based on pre-trained context vectors for input methods, which consists of two parts as follows. The Chinese spelling detection part of the architecture adopts the fusion vectors of character-based pre-trained context vectors and Pinyin vectors, and uses the method of sequence labeling to detect the error characters. The Chinese spelling correction part of the architecture adopts Masked Language Model (MLM) to generate a candidate set of erroneous characters, and uses XGBoost and Pinyin-to-Character conversion models to filter correct characters and correct the error characters for users. IME-Spell has a significant improvement over the benchmark models on the SIGHAN dataset, whose maximum difference of F1 in the spelling detection and correction subtasks reach 48.9% and 27.8%, respectively.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于输入法的中文拼写检查
为了减少人工检查成本和语义误解,中文拼写检查在自然语言处理中得到了广泛的研究。然而,基于输入法的CSC利用拼音信息提高拼写纠错效率的研究还很少。本文提出了一种基于预训练的输入法上下文向量的新型CSC架构IME-Spell,该架构由以下两部分组成。该体系结构的中文拼写检测部分采用基于字符的预训练上下文向量与拼音向量的融合向量,并采用序列标注的方法检测错误字符。该体系结构的中文拼写纠错部分采用掩码语言模型(mask Language Model, MLM)生成错误字符候选集,并使用XGBoost和拼音字符转换模型过滤正确字符,为用户纠正错误字符。IME-Spell相较于SIGHAN数据集上的基准模型有了显著的改进,其在拼写检测和纠错子任务上的最大F1差值分别达到48.9%和27.8%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Ranking Hotel Reviews Based on User's Aspects Importance and Opinions Research on Information Extraction of Municipal Solid Waste Crisis using BERT-LSTM-CRF A Classification on Different Aspects of User Modelling in Personalized Web Search Automatic Summarization of Stock Market News Articles IME-Spell: Chinese Spelling Check based on Input Method
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1