基于字符级序列标记的中医科学文摘关键词自动提取

Liangping Ding, Zhixiong Zhang, Huan Liu, Jie Li, Gaihong Yu
{"title":"基于字符级序列标记的中医科学文摘关键词自动提取","authors":"Liangping Ding, Zhixiong Zhang, Huan Liu, Jie Li, Gaihong Yu","doi":"10.2478/jdis-2021-0013","DOIUrl":null,"url":null,"abstract":"Abstract Purpose Automatic keyphrase extraction (AKE) is an important task for grasping the main points of the text. In this paper, we aim to combine the benefits of sequence labeling formulation and pretrained language model to propose an automatic keyphrase extraction model for Chinese scientific research. Design/methodology/approach We regard AKE from Chinese text as a character-level sequence labeling task to avoid segmentation errors of Chinese tokenizer and initialize our model with pretrained language model BERT, which was released by Google in 2018. We collect data from Chinese Science Citation Database and construct a large-scale dataset from medical domain, which contains 100,000 abstracts as training set, 6,000 abstracts as development set and 3,094 abstracts as test set. We use unsupervised keyphrase extraction methods including term frequency (TF), TF-IDF, TextRank and supervised machine learning methods including Conditional Random Field (CRF), Bidirectional Long Short Term Memory Network (BiLSTM), and BiLSTM-CRF as baselines. Experiments are designed to compare word-level and character-level sequence labeling approaches on supervised machine learning models and BERT-based models. Findings Compared with character-level BiLSTM-CRF, the best baseline model with F1 score of 50.16%, our character-level sequence labeling model based on BERT obtains F1 score of 59.80%, getting 9.64% absolute improvement. Research limitations We just consider automatic keyphrase extraction task rather than keyphrase generation task, so only keyphrases that are occurred in the given text can be extracted. In addition, our proposed dataset is not suitable for dealing with nested keyphrases. Practical implications We make our character-level IOB format dataset of Chinese Automatic Keyphrase Extraction from scientific Chinese medical abstracts (CAKE) publicly available for the benefits of research community, which is available at: https://github.com/possible1402/Dataset-For-Chinese-Medical-Keyphrase-Extraction. Originality/value By designing comparative experiments, our study demonstrates that character-level formulation is more suitable for Chinese automatic keyphrase extraction task under the general trend of pretrained language models. And our proposed dataset provides a unified method for model evaluation and can promote the development of Chinese automatic keyphrase extraction to some extent.","PeriodicalId":92237,"journal":{"name":"Journal of data and information science (Warsaw, Poland)","volume":"6 1","pages":"35 - 57"},"PeriodicalIF":0.0000,"publicationDate":"2021-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Automatic Keyphrase Extraction from Scientific Chinese Medical Abstracts Based on Character-Level Sequence Labeling\",\"authors\":\"Liangping Ding, Zhixiong Zhang, Huan Liu, Jie Li, Gaihong Yu\",\"doi\":\"10.2478/jdis-2021-0013\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract Purpose Automatic keyphrase extraction (AKE) is an important task for grasping the main points of the text. In this paper, we aim to combine the benefits of sequence labeling formulation and pretrained language model to propose an automatic keyphrase extraction model for Chinese scientific research. Design/methodology/approach We regard AKE from Chinese text as a character-level sequence labeling task to avoid segmentation errors of Chinese tokenizer and initialize our model with pretrained language model BERT, which was released by Google in 2018. We collect data from Chinese Science Citation Database and construct a large-scale dataset from medical domain, which contains 100,000 abstracts as training set, 6,000 abstracts as development set and 3,094 abstracts as test set. We use unsupervised keyphrase extraction methods including term frequency (TF), TF-IDF, TextRank and supervised machine learning methods including Conditional Random Field (CRF), Bidirectional Long Short Term Memory Network (BiLSTM), and BiLSTM-CRF as baselines. Experiments are designed to compare word-level and character-level sequence labeling approaches on supervised machine learning models and BERT-based models. Findings Compared with character-level BiLSTM-CRF, the best baseline model with F1 score of 50.16%, our character-level sequence labeling model based on BERT obtains F1 score of 59.80%, getting 9.64% absolute improvement. Research limitations We just consider automatic keyphrase extraction task rather than keyphrase generation task, so only keyphrases that are occurred in the given text can be extracted. In addition, our proposed dataset is not suitable for dealing with nested keyphrases. Practical implications We make our character-level IOB format dataset of Chinese Automatic Keyphrase Extraction from scientific Chinese medical abstracts (CAKE) publicly available for the benefits of research community, which is available at: https://github.com/possible1402/Dataset-For-Chinese-Medical-Keyphrase-Extraction. Originality/value By designing comparative experiments, our study demonstrates that character-level formulation is more suitable for Chinese automatic keyphrase extraction task under the general trend of pretrained language models. And our proposed dataset provides a unified method for model evaluation and can promote the development of Chinese automatic keyphrase extraction to some extent.\",\"PeriodicalId\":92237,\"journal\":{\"name\":\"Journal of data and information science (Warsaw, Poland)\",\"volume\":\"6 1\",\"pages\":\"35 - 57\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-03-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of data and information science (Warsaw, Poland)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2478/jdis-2021-0013\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of data and information science (Warsaw, Poland)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/jdis-2021-0013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

摘要

摘要目的关键词自动提取(AKE)是掌握文本要点的一项重要任务。在本文中,我们的目的是结合序列标记公式和预训练语言模型的优点,提出一种用于中国科学研究的关键短语自动提取模型。设计/方法论/方法我们将中文文本的AKE视为一个字符级序列标记任务,以避免中文标记器的分割错误,并使用谷歌于2018年发布的预训练语言模型BERT初始化我们的模型。我们从中国科学引文数据库中收集数据,构建了一个医学领域的大规模数据集,其中包括100000篇摘要作为训练集,6000篇摘要作为开发集,3094篇摘要作为测试集。我们使用无监督的关键短语提取方法,包括术语频率(TF)、TF-IDF、TextRank,以及有监督的机器学习方法,包括条件随机场(CRF)、双向长短期记忆网络(BiLSTM)和BiLSTM CRF作为基线。实验旨在比较监督机器学习模型和基于BERT的模型上的单词级和字符级序列标记方法。结果与F1得分为50.16%的最佳基线模型——特征水平BiLSTM-CRF相比,我们基于BERT的特征水平序列标记模型获得了59.80%的F1得分,获得了9.64%的绝对改善。研究局限性我们只考虑自动关键短语提取任务,而不是关键短语生成任务,因此只有出现在给定文本中的关键短语才能被提取。此外,我们提出的数据集不适合处理嵌套的关键短语。实际意义为了研究界的利益,我们公开了我们的字符级IOB格式的从科学中医摘要中提取中文关键词的数据集(CAKE),该数据集可在:https://github.com/possible1402/Dataset-For-Chinese-Medical-Keyphrase-Extraction.独创性/价值通过设计比较实验,我们的研究表明,在预训练语言模型的大趋势下,字符级公式更适合中文关键词自动提取任务。我们提出的数据集为模型评估提供了一种统一的方法,可以在一定程度上促进中文关键词自动提取的发展。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Automatic Keyphrase Extraction from Scientific Chinese Medical Abstracts Based on Character-Level Sequence Labeling
Abstract Purpose Automatic keyphrase extraction (AKE) is an important task for grasping the main points of the text. In this paper, we aim to combine the benefits of sequence labeling formulation and pretrained language model to propose an automatic keyphrase extraction model for Chinese scientific research. Design/methodology/approach We regard AKE from Chinese text as a character-level sequence labeling task to avoid segmentation errors of Chinese tokenizer and initialize our model with pretrained language model BERT, which was released by Google in 2018. We collect data from Chinese Science Citation Database and construct a large-scale dataset from medical domain, which contains 100,000 abstracts as training set, 6,000 abstracts as development set and 3,094 abstracts as test set. We use unsupervised keyphrase extraction methods including term frequency (TF), TF-IDF, TextRank and supervised machine learning methods including Conditional Random Field (CRF), Bidirectional Long Short Term Memory Network (BiLSTM), and BiLSTM-CRF as baselines. Experiments are designed to compare word-level and character-level sequence labeling approaches on supervised machine learning models and BERT-based models. Findings Compared with character-level BiLSTM-CRF, the best baseline model with F1 score of 50.16%, our character-level sequence labeling model based on BERT obtains F1 score of 59.80%, getting 9.64% absolute improvement. Research limitations We just consider automatic keyphrase extraction task rather than keyphrase generation task, so only keyphrases that are occurred in the given text can be extracted. In addition, our proposed dataset is not suitable for dealing with nested keyphrases. Practical implications We make our character-level IOB format dataset of Chinese Automatic Keyphrase Extraction from scientific Chinese medical abstracts (CAKE) publicly available for the benefits of research community, which is available at: https://github.com/possible1402/Dataset-For-Chinese-Medical-Keyphrase-Extraction. Originality/value By designing comparative experiments, our study demonstrates that character-level formulation is more suitable for Chinese automatic keyphrase extraction task under the general trend of pretrained language models. And our proposed dataset provides a unified method for model evaluation and can promote the development of Chinese automatic keyphrase extraction to some extent.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Editorial board publication strategy and acceptance rates in Turkish national journals Multimodal sentiment analysis for social media contents during public emergencies Perspectives from a publishing ethics and research integrity team for required improvements Build neural network models to identify and correct news headlines exaggerating obesity-related scientific findings An author credit allocation method with improved distinguishability and robustness
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1