BERTCWS: unsupervised multi-granular Chinese word segmentation based on a BERT method for the geoscience domain

IF 2.7 Q1 GEOGRAPHY Annals of GIS Pub Date : 2023-03-02 DOI:10.1080/19475683.2023.2186487
Qinjun Qiu, Zhong Xie, K. Ma, Miao Tian
{"title":"BERTCWS: unsupervised multi-granular Chinese word segmentation based on a BERT method for the geoscience domain","authors":"Qinjun Qiu, Zhong Xie, K. Ma, Miao Tian","doi":"10.1080/19475683.2023.2186487","DOIUrl":null,"url":null,"abstract":"ABSTRACT Unlike alphabet-based languages such as English, the Chinese language has no specifying word boundaries. Segmentation, particularly for the Chinese language, is a fundamental step towards Chinese text processing, information retrieval, and knowledge discovery. In the geoscience domain, most existing Chinese word segmentation tools/models require a prespecified dictionary and a large amount of relevant training corpus, and the segmentation accuracies drop significantly when processing out-domain situations using these same methods. To address this issue, a purely unsupervised and generic two-stage architecture (named BERTCWS) for domain-specific Chinese word segmentation is proposed. We first design an incidence matrix termed the ‘character combination tightness’ to calculate the closeness between characters. Then, BERTCWS recognizes geoscience terms based on a Bidirectional Encoder Representations from Transformers(BERT)-based segmenter, and multi-granular segmentation is generated by setting different thresholds. Finally, the discriminator is constructed to validate the correctness of the segmented words. Our numerical study demonstrates that BERTCWS can identify both general-domain terms and geoscience-domain terms. Additionally, multi-granular segmentation could be applied to offer a set of potential geoscience terms of various lengths.","PeriodicalId":46270,"journal":{"name":"Annals of GIS","volume":"16 1","pages":"387 - 399"},"PeriodicalIF":2.7000,"publicationDate":"2023-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of GIS","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/19475683.2023.2186487","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GEOGRAPHY","Score":null,"Total":0}
引用次数: 0

Abstract

ABSTRACT Unlike alphabet-based languages such as English, the Chinese language has no specifying word boundaries. Segmentation, particularly for the Chinese language, is a fundamental step towards Chinese text processing, information retrieval, and knowledge discovery. In the geoscience domain, most existing Chinese word segmentation tools/models require a prespecified dictionary and a large amount of relevant training corpus, and the segmentation accuracies drop significantly when processing out-domain situations using these same methods. To address this issue, a purely unsupervised and generic two-stage architecture (named BERTCWS) for domain-specific Chinese word segmentation is proposed. We first design an incidence matrix termed the ‘character combination tightness’ to calculate the closeness between characters. Then, BERTCWS recognizes geoscience terms based on a Bidirectional Encoder Representations from Transformers(BERT)-based segmenter, and multi-granular segmentation is generated by setting different thresholds. Finally, the discriminator is constructed to validate the correctness of the segmented words. Our numerical study demonstrates that BERTCWS can identify both general-domain terms and geoscience-domain terms. Additionally, multi-granular segmentation could be applied to offer a set of potential geoscience terms of various lengths.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于BERT方法的地球科学领域无监督多粒度中文分词
与英语等基于字母的语言不同,汉语没有特定的单词边界。摘要分词是实现中文文本处理、信息检索和知识发现的重要步骤。在地球科学领域,大多数现有的中文分词工具/模型都需要预先指定词典和大量相关的训练语料库,使用相同的方法处理域外情况时,分词准确率明显下降。为了解决这一问题,提出了一种纯无监督通用的两阶段中文分词体系结构(BERTCWS)。我们首先设计了一个称为“字符组合紧密度”的关联矩阵来计算字符之间的紧密度。然后,BERTCWS基于基于变形器(BERT)的双向编码器表示分割器识别地球科学术语,并通过设置不同的阈值生成多粒度分割。最后,构造鉴别器来验证分词的正确性。我们的数值研究表明,BERTCWS既可以识别一般领域术语,也可以识别地球科学领域术语。此外,多颗粒分段可以提供一组不同长度的潜在地球科学术语。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Annals of GIS
Annals of GIS Multiple-
CiteScore
8.30
自引率
2.00%
发文量
31
期刊最新文献
Zero watermarking algorithm for BIM data based on distance partitioning and local feature Controlling for spatial confounding and spatial interference in causal inference: modelling insights from a computational experiment Application of GIS and fuzzy sets to small-scale site suitability assessment for extensive brackish water aquaculture Revealing intra-urban hierarchical spatial structure through representation learning by combining road network abstraction model and taxi trajectory data The time- and distance-decay effects of hurricane relevancy on social media: an empirical study of three hurricanes in the United States
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1