Jigui Zhao, Yurong Qian, Shuxiang Hou, Jiayin Chen, Kui Wang, Min Liu, Aizimaiti Xiaokaiti
{"title":"Unleashing the power of pinyin: promoting Chinese named entity recognition with multiple embedding and attention","authors":"Jigui Zhao, Yurong Qian, Shuxiang Hou, Jiayin Chen, Kui Wang, Min Liu, Aizimaiti Xiaokaiti","doi":"10.1007/s40747-024-01753-0","DOIUrl":null,"url":null,"abstract":"<p>Named Entity Recognition (NER) aims to identify entities with specific meanings and their boundaries in natural language texts. Due to the differences between Chinese and English language families, Chinese NER faces challenges such as ambiguous word boundary delineation and semantic diversity. Previous studies on Chinese NER have focused on character and lexical information, neglecting the unique feature of Chinese—pinyin information. In this paper, we propose CPL-NER, which combines multiple feature information of Chinese characters as embedding to enhance the semantic representation by introducing pinyin and dictionary information. For Chinese named entity recognition, pinyin information of Chinese characters helps to resolve the polyphonic phenomenon, while dictionary information aids in addressing word segmentation ambiguities. Additionally, we innovatively designed the Pinyin-Lexicon Cross-Attention Mechanism (PLCA), which calculates attention scores between various embeddings. This mechanism deeply integrates character, pinyin, and lexicon embeddings, generating character sequences enriched with semantic information. Finally, BiLSTM-CRF is employed for sequence modeling. Through this design, we can more comprehensively capture semantic features in Chinese text, improving the model’s ability to handle polyphonic characters and word segmentation ambiguities, thereby enhancing the recognition performance of Chinese named entities. We conducted experiments on four standard Chinese NER benchmark datasets, and the results show that our method outperforms most baselines, demonstrating the effectiveness of our proposed model.</p>","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":"34 1","pages":""},"PeriodicalIF":5.0000,"publicationDate":"2025-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Complex & Intelligent Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s40747-024-01753-0","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Named Entity Recognition (NER) aims to identify entities with specific meanings and their boundaries in natural language texts. Due to the differences between Chinese and English language families, Chinese NER faces challenges such as ambiguous word boundary delineation and semantic diversity. Previous studies on Chinese NER have focused on character and lexical information, neglecting the unique feature of Chinese—pinyin information. In this paper, we propose CPL-NER, which combines multiple feature information of Chinese characters as embedding to enhance the semantic representation by introducing pinyin and dictionary information. For Chinese named entity recognition, pinyin information of Chinese characters helps to resolve the polyphonic phenomenon, while dictionary information aids in addressing word segmentation ambiguities. Additionally, we innovatively designed the Pinyin-Lexicon Cross-Attention Mechanism (PLCA), which calculates attention scores between various embeddings. This mechanism deeply integrates character, pinyin, and lexicon embeddings, generating character sequences enriched with semantic information. Finally, BiLSTM-CRF is employed for sequence modeling. Through this design, we can more comprehensively capture semantic features in Chinese text, improving the model’s ability to handle polyphonic characters and word segmentation ambiguities, thereby enhancing the recognition performance of Chinese named entities. We conducted experiments on four standard Chinese NER benchmark datasets, and the results show that our method outperforms most baselines, demonstrating the effectiveness of our proposed model.
期刊介绍:
Complex & Intelligent Systems aims to provide a forum for presenting and discussing novel approaches, tools and techniques meant for attaining a cross-fertilization between the broad fields of complex systems, computational simulation, and intelligent analytics and visualization. The transdisciplinary research that the journal focuses on will expand the boundaries of our understanding by investigating the principles and processes that underlie many of the most profound problems facing society today.