结构化和非结构化文本格式数据集的名称实体识别开发

2020 Fifth International Conference on Informatics and Computing (ICIC) Pub Date : 2020-11-03 DOI:10.1109/ICIC50835.2020.9288566

N. Azzahra, Muhammad Okky Ibrohim, Junaedi Fahmi, Bagus Fajar Apriyanto, Oskar Riandi

{"title":"结构化和非结构化文本格式数据集的名称实体识别开发","authors":"N. Azzahra, Muhammad Okky Ibrohim, Junaedi Fahmi, Bagus Fajar Apriyanto, Oskar Riandi","doi":"10.1109/ICIC50835.2020.9288566","DOIUrl":null,"url":null,"abstract":"Named-Entity Recognition (NER) is a task that extracts the entity information from dataset into several different entity classes. Most of current NER research train the NER model from structured data such as news and Wikipedia article. Whereas, there are several tasks that generate an unstructured dataset such as speech-to-text task. In this paper, we did NER research for unstructured text formatting dataset in Indonesian language using deep learning approaches including LSTM, Bidirectional LSTM (Bi-LSTM), GRU, Bidirectional GRU (Bi-GRU), and Convolutional Neural Network (CNN). We used NERGRIT CORPUS as our dataset and modified the dataset into four types of structured and unstructured datasets. Afterward, we run several experiments scenario by combining all types of data modification and the deep learning algorithms that we used and we obtain that the highest $F$ - Score was obtained when using Bi-GRU for standard dataset, lowercase with punctuation dataset, lowercase without punctuation dataset, and lowercase and clean dataset equal to 71.04%, 70.61%, 68.12%, and 67.45%, respectively.","PeriodicalId":413610,"journal":{"name":"2020 Fifth International Conference on Informatics and Computing (ICIC)","volume":"104 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Developing Name Entity Recognition for Structured and Unstructured Text Formatting Dataset\",\"authors\":\"N. Azzahra, Muhammad Okky Ibrohim, Junaedi Fahmi, Bagus Fajar Apriyanto, Oskar Riandi\",\"doi\":\"10.1109/ICIC50835.2020.9288566\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Named-Entity Recognition (NER) is a task that extracts the entity information from dataset into several different entity classes. Most of current NER research train the NER model from structured data such as news and Wikipedia article. Whereas, there are several tasks that generate an unstructured dataset such as speech-to-text task. In this paper, we did NER research for unstructured text formatting dataset in Indonesian language using deep learning approaches including LSTM, Bidirectional LSTM (Bi-LSTM), GRU, Bidirectional GRU (Bi-GRU), and Convolutional Neural Network (CNN). We used NERGRIT CORPUS as our dataset and modified the dataset into four types of structured and unstructured datasets. Afterward, we run several experiments scenario by combining all types of data modification and the deep learning algorithms that we used and we obtain that the highest $F$ - Score was obtained when using Bi-GRU for standard dataset, lowercase with punctuation dataset, lowercase without punctuation dataset, and lowercase and clean dataset equal to 71.04%, 70.61%, 68.12%, and 67.45%, respectively.\",\"PeriodicalId\":413610,\"journal\":{\"name\":\"2020 Fifth International Conference on Informatics and Computing (ICIC)\",\"volume\":\"104 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 Fifth International Conference on Informatics and Computing (ICIC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICIC50835.2020.9288566\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 Fifth International Conference on Informatics and Computing (ICIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIC50835.2020.9288566","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

命名实体识别(NER)是一项将数据集中的实体信息提取到几个不同的实体类中的任务。目前大多数NER研究都是从结构化数据(如新闻和维基百科文章)中训练NER模型的。然而，有几个任务生成非结构化数据集，如语音到文本任务。本文采用LSTM、双向LSTM (Bi-LSTM)、GRU、双向GRU (Bi-GRU)和卷积神经网络(CNN)等深度学习方法对印尼语非结构化文本格式化数据集进行了NER研究。我们使用NERGRIT CORPUS作为我们的数据集，并将数据集修改为四种类型的结构化和非结构化数据集。随后，我们将所有类型的数据修改和我们使用的深度学习算法结合起来运行了几个实验场景，我们得到在标准数据集、小写带标点数据集、小写不带标点数据集和小写干净数据集上使用Bi-GRU获得的$F$ - Score最高，分别为71.04%、70.61%、68.12%和67.45%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Developing Name Entity Recognition for Structured and Unstructured Text Formatting Dataset

Named-Entity Recognition (NER) is a task that extracts the entity information from dataset into several different entity classes. Most of current NER research train the NER model from structured data such as news and Wikipedia article. Whereas, there are several tasks that generate an unstructured dataset such as speech-to-text task. In this paper, we did NER research for unstructured text formatting dataset in Indonesian language using deep learning approaches including LSTM, Bidirectional LSTM (Bi-LSTM), GRU, Bidirectional GRU (Bi-GRU), and Convolutional Neural Network (CNN). We used NERGRIT CORPUS as our dataset and modified the dataset into four types of structured and unstructured datasets. Afterward, we run several experiments scenario by combining all types of data modification and the deep learning algorithms that we used and we obtain that the highest $F$ - Score was obtained when using Bi-GRU for standard dataset, lowercase with punctuation dataset, lowercase without punctuation dataset, and lowercase and clean dataset equal to 71.04%, 70.61%, 68.12%, and 67.45%, respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 Fifth International Conference on Informatics and Computing (ICIC)

自引率

0.00%

发文量