基于字符级序列标记的中医科学文摘关键词自动提取

Journal of data and information science (Warsaw, Poland) Pub Date : 2021-03-02 DOI:10.2478/jdis-2021-0013

Liangping Ding, Zhixiong Zhang, Huan Liu, Jie Li, Gaihong Yu

{"title":"基于字符级序列标记的中医科学文摘关键词自动提取","authors":"Liangping Ding, Zhixiong Zhang, Huan Liu, Jie Li, Gaihong Yu","doi":"10.2478/jdis-2021-0013","DOIUrl":null,"url":null,"abstract":"Abstract Purpose Automatic keyphrase extraction (AKE) is an important task for grasping the main points of the text. In this paper, we aim to combine the benefits of sequence labeling formulation and pretrained language model to propose an automatic keyphrase extraction model for Chinese scientific research. Design/methodology/approach We regard AKE from Chinese text as a character-level sequence labeling task to avoid segmentation errors of Chinese tokenizer and initialize our model with pretrained language model BERT, which was released by Google in 2018. We collect data from Chinese Science Citation Database and construct a large-scale dataset from medical domain, which contains 100,000 abstracts as training set, 6,000 abstracts as development set and 3,094 abstracts as test set. We use unsupervised keyphrase extraction methods including term frequency (TF), TF-IDF, TextRank and supervised machine learning methods including Conditional Random Field (CRF), Bidirectional Long Short Term Memory Network (BiLSTM), and BiLSTM-CRF as baselines. Experiments are designed to compare word-level and character-level sequence labeling approaches on supervised machine learning models and BERT-based models. Findings Compared with character-level BiLSTM-CRF, the best baseline model with F1 score of 50.16%, our character-level sequence labeling model based on BERT obtains F1 score of 59.80%, getting 9.64% absolute improvement. Research limitations We just consider automatic keyphrase extraction task rather than keyphrase generation task, so only keyphrases that are occurred in the given text can be extracted. In addition, our proposed dataset is not suitable for dealing with nested keyphrases. Practical implications We make our character-level IOB format dataset of Chinese Automatic Keyphrase Extraction from scientific Chinese medical abstracts (CAKE) publicly available for the benefits of research community, which is available at: https://github.com/possible1402/Dataset-For-Chinese-Medical-Keyphrase-Extraction. Originality/value By designing comparative experiments, our study demonstrates that character-level formulation is more suitable for Chinese automatic keyphrase extraction task under the general trend of pretrained language models. And our proposed dataset provides a unified method for model evaluation and can promote the development of Chinese automatic keyphrase extraction to some extent.","PeriodicalId":92237,"journal":{"name":"Journal of data and information science (Warsaw, Poland)","volume":"6 1","pages":"35 - 57"},"PeriodicalIF":0.0000,"publicationDate":"2021-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Automatic Keyphrase Extraction from Scientific Chinese Medical Abstracts Based on Character-Level Sequence Labeling\",\"authors\":\"Liangping Ding, Zhixiong Zhang, Huan Liu, Jie Li, Gaihong Yu\",\"doi\":\"10.2478/jdis-2021-0013\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract Purpose Automatic keyphrase extraction (AKE) is an important task for grasping the main points of the text. In this paper, we aim to combine the benefits of sequence labeling formulation and pretrained language model to propose an automatic keyphrase extraction model for Chinese scientific research. Design/methodology/approach We regard AKE from Chinese text as a character-level sequence labeling task to avoid segmentation errors of Chinese tokenizer and initialize our model with pretrained language model BERT, which was released by Google in 2018. We collect data from Chinese Science Citation Database and construct a large-scale dataset from medical domain, which contains 100,000 abstracts as training set, 6,000 abstracts as development set and 3,094 abstracts as test set. We use unsupervised keyphrase extraction methods including term frequency (TF), TF-IDF, TextRank and supervised machine learning methods including Conditional Random Field (CRF), Bidirectional Long Short Term Memory Network (BiLSTM), and BiLSTM-CRF as baselines. Experiments are designed to compare word-level and character-level sequence labeling approaches on supervised machine learning models and BERT-based models. Findings Compared with character-level BiLSTM-CRF, the best baseline model with F1 score of 50.16%, our character-level sequence labeling model based on BERT obtains F1 score of 59.80%, getting 9.64% absolute improvement. Research limitations We just consider automatic keyphrase extraction task rather than keyphrase generation task, so only keyphrases that are occurred in the given text can be extracted. In addition, our proposed dataset is not suitable for dealing with nested keyphrases. Practical implications We make our character-level IOB format dataset of Chinese Automatic Keyphrase Extraction from scientific Chinese medical abstracts (CAKE) publicly available for the benefits of research community, which is available at: https://github.com/possible1402/Dataset-For-Chinese-Medical-Keyphrase-Extraction. Originality/value By designing comparative experiments, our study demonstrates that character-level formulation is more suitable for Chinese automatic keyphrase extraction task under the general trend of pretrained language models. And our proposed dataset provides a unified method for model evaluation and can promote the development of Chinese automatic keyphrase extraction to some extent.\",\"PeriodicalId\":92237,\"journal\":{\"name\":\"Journal of data and information science (Warsaw, Poland)\",\"volume\":\"6 1\",\"pages\":\"35 - 57\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-03-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of data and information science (Warsaw, Poland)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2478/jdis-2021-0013\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of data and information science (Warsaw, Poland)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/jdis-2021-0013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

摘要目的关键词自动提取（AKE）是掌握文本要点的一项重要任务。在本文中，我们的目的是结合序列标记公式和预训练语言模型的优点，提出一种用于中国科学研究的关键短语自动提取模型。设计/方法论/方法我们将中文文本的AKE视为一个字符级序列标记任务，以避免中文标记器的分割错误，并使用谷歌于2018年发布的预训练语言模型BERT初始化我们的模型。我们从中国科学引文数据库中收集数据，构建了一个医学领域的大规模数据集，其中包括100000篇摘要作为训练集，6000篇摘要作为开发集，3094篇摘要作为测试集。我们使用无监督的关键短语提取方法，包括术语频率（TF）、TF-IDF、TextRank，以及有监督的机器学习方法，包括条件随机场（CRF）、双向长短期记忆网络（BiLSTM）和BiLSTM CRF作为基线。实验旨在比较监督机器学习模型和基于BERT的模型上的单词级和字符级序列标记方法。结果与F1得分为50.16%的最佳基线模型——特征水平BiLSTM-CRF相比，我们基于BERT的特征水平序列标记模型获得了59.80%的F1得分，获得了9.64%的绝对改善。研究局限性我们只考虑自动关键短语提取任务，而不是关键短语生成任务，因此只有出现在给定文本中的关键短语才能被提取。此外，我们提出的数据集不适合处理嵌套的关键短语。实际意义为了研究界的利益，我们公开了我们的字符级IOB格式的从科学中医摘要中提取中文关键词的数据集（CAKE），该数据集可在：https://github.com/possible1402/Dataset-For-Chinese-Medical-Keyphrase-Extraction.独创性/价值通过设计比较实验，我们的研究表明，在预训练语言模型的大趋势下，字符级公式更适合中文关键词自动提取任务。我们提出的数据集为模型评估提供了一种统一的方法，可以在一定程度上促进中文关键词自动提取的发展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Automatic Keyphrase Extraction from Scientific Chinese Medical Abstracts Based on Character-Level Sequence Labeling

Abstract Purpose Automatic keyphrase extraction (AKE) is an important task for grasping the main points of the text. In this paper, we aim to combine the benefits of sequence labeling formulation and pretrained language model to propose an automatic keyphrase extraction model for Chinese scientific research. Design/methodology/approach We regard AKE from Chinese text as a character-level sequence labeling task to avoid segmentation errors of Chinese tokenizer and initialize our model with pretrained language model BERT, which was released by Google in 2018. We collect data from Chinese Science Citation Database and construct a large-scale dataset from medical domain, which contains 100,000 abstracts as training set, 6,000 abstracts as development set and 3,094 abstracts as test set. We use unsupervised keyphrase extraction methods including term frequency (TF), TF-IDF, TextRank and supervised machine learning methods including Conditional Random Field (CRF), Bidirectional Long Short Term Memory Network (BiLSTM), and BiLSTM-CRF as baselines. Experiments are designed to compare word-level and character-level sequence labeling approaches on supervised machine learning models and BERT-based models. Findings Compared with character-level BiLSTM-CRF, the best baseline model with F1 score of 50.16%, our character-level sequence labeling model based on BERT obtains F1 score of 59.80%, getting 9.64% absolute improvement. Research limitations We just consider automatic keyphrase extraction task rather than keyphrase generation task, so only keyphrases that are occurred in the given text can be extracted. In addition, our proposed dataset is not suitable for dealing with nested keyphrases. Practical implications We make our character-level IOB format dataset of Chinese Automatic Keyphrase Extraction from scientific Chinese medical abstracts (CAKE) publicly available for the benefits of research community, which is available at: https://github.com/possible1402/Dataset-For-Chinese-Medical-Keyphrase-Extraction. Originality/value By designing comparative experiments, our study demonstrates that character-level formulation is more suitable for Chinese automatic keyphrase extraction task under the general trend of pretrained language models. And our proposed dataset provides a unified method for model evaluation and can promote the development of Chinese automatic keyphrase extraction to some extent.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of data and information science (Warsaw, Poland)

自引率

0.00%

发文量