Corpus and Baseline Model for Domain-Specific Entity Recognition in German

2020 6th IEEE Congress on Information Science and Technology (CiSt) Pub Date : 2020-06-05 DOI:10.1109/CiSt49399.2021.9357189

Sunna Torge, Waldemar Hahn, R. Jäkel, W. Nagel

引用次数: 1

Abstract

Transfer Learning approaches are a promising means to analyze low-resource domain specific texts. The German SmartData corpus is the first German corpus, annotated with entities from different domains, and thus allows to investigate transfer learning approaches for Named Entity Recognition (NER) on different domains. In order to prepare such investigations, this work includes a thorough analysis of the SmartData corpus, and a revision w.r.t. annotations and the split into training and test data, considering the distribution of document and entity types. Based on that a baseline model for NER using BiLSTM-CRF neural networks including hyperparameter optimization is presented.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

德语特定领域实体识别的语料库和基线模型

迁移学习方法是分析低资源领域特定文本的一种很有前途的方法。德语SmartData语料库是第一个德语语料库，标注了来自不同领域的实体，从而允许在不同领域研究命名实体识别(NER)的迁移学习方法。为了准备这样的调查，这项工作包括对SmartData语料库进行彻底的分析，并考虑到文档和实体类型的分布，对w.r.t.注释进行修订，并将训练数据和测试数据分开。在此基础上，提出了一种包含超参数优化的基于BiLSTM-CRF神经网络的NER基线模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2020 6th IEEE Congress on Information Science and Technology (CiSt)

自引率

0.00%

发文量