- Book学术

El-Cezeri Fen ve Mühendislik Dergisi Pub Date : 2022-09-07 DOI:10.31202/ecjse.1131826

Aiman Abibullayeva, Aydın Çeti̇n

引用次数: 0

摘要

关键词提供了对文档内容的简明而精确的描述。由于关键字的重要性和手工标记的难度，自动关键字提取使这一过程变得简单和快速。本文提出了一种基于哈萨克语新闻数据集的关键词提取方法。对新编译的哈萨克语新闻数据集- knd使用BERT基非case和BERT基多语言非case预训练语言模型获得模型性能结果。编译的哈萨克语新闻数据集由7060个数据组成。数据收集自anatili.kazgazeta网站。kz, Bilimdinews。还有炸土豆条。使用BeautifulSoap和Requests库。这些网页大多包含新闻、历史和文学文本。数据集包括出版物名称或新闻标题、出版物或新闻主题的作者以及哈萨克新闻网站的URL。在对训练结果的评价中，我们观察到基于BERT的多语言-不加大小写的f分表现高于BERT模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

BERT ile Kazak Haber Veri Kümesinden Anahtar Kelime Çıkarımı

Keywords provide a concise and precise description of the document's content. Due to the importance of the keyword and the difficulty of manual markup, automatic keyword extraction makes this process easy and fast. In this paper, Keyword Extraction from Kazakh News Dataset was presented. Model performance results were obtained by using the BERT base - uncased and BERT-base-multilingual-uncased pre-trained language model for the newly compiled Kazakh News Dataset-KND. Compiled Kazakh news data set consists of 7060 data. Data were collected from the web pages anatili.kazgazeta.kz, Bilimdinews.kz, and zhasalash.kz using the BeautifulSoap and Requests libraries. These web pages mostly contain news, history, and literary texts. The dataset includes the publication name or news title, the author of the publication or news subject, and the URL of the Kazakh news site. In the evaluation of the training results, it was observed that the BERT base-multilingual-uncased F-score performance was higher than the BERT model.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

El-Cezeri Fen ve Mühendislik Dergisi

自引率

0.00%

发文量