结构化和非结构化文本格式数据集的名称实体识别开发

N. Azzahra, Muhammad Okky Ibrohim, Junaedi Fahmi, Bagus Fajar Apriyanto, Oskar Riandi
{"title":"结构化和非结构化文本格式数据集的名称实体识别开发","authors":"N. Azzahra, Muhammad Okky Ibrohim, Junaedi Fahmi, Bagus Fajar Apriyanto, Oskar Riandi","doi":"10.1109/ICIC50835.2020.9288566","DOIUrl":null,"url":null,"abstract":"Named-Entity Recognition (NER) is a task that extracts the entity information from dataset into several different entity classes. Most of current NER research train the NER model from structured data such as news and Wikipedia article. Whereas, there are several tasks that generate an unstructured dataset such as speech-to-text task. In this paper, we did NER research for unstructured text formatting dataset in Indonesian language using deep learning approaches including LSTM, Bidirectional LSTM (Bi-LSTM), GRU, Bidirectional GRU (Bi-GRU), and Convolutional Neural Network (CNN). We used NERGRIT CORPUS as our dataset and modified the dataset into four types of structured and unstructured datasets. Afterward, we run several experiments scenario by combining all types of data modification and the deep learning algorithms that we used and we obtain that the highest $F$ - Score was obtained when using Bi-GRU for standard dataset, lowercase with punctuation dataset, lowercase without punctuation dataset, and lowercase and clean dataset equal to 71.04%, 70.61%, 68.12%, and 67.45%, respectively.","PeriodicalId":413610,"journal":{"name":"2020 Fifth International Conference on Informatics and Computing (ICIC)","volume":"104 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Developing Name Entity Recognition for Structured and Unstructured Text Formatting Dataset\",\"authors\":\"N. Azzahra, Muhammad Okky Ibrohim, Junaedi Fahmi, Bagus Fajar Apriyanto, Oskar Riandi\",\"doi\":\"10.1109/ICIC50835.2020.9288566\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Named-Entity Recognition (NER) is a task that extracts the entity information from dataset into several different entity classes. Most of current NER research train the NER model from structured data such as news and Wikipedia article. Whereas, there are several tasks that generate an unstructured dataset such as speech-to-text task. In this paper, we did NER research for unstructured text formatting dataset in Indonesian language using deep learning approaches including LSTM, Bidirectional LSTM (Bi-LSTM), GRU, Bidirectional GRU (Bi-GRU), and Convolutional Neural Network (CNN). We used NERGRIT CORPUS as our dataset and modified the dataset into four types of structured and unstructured datasets. Afterward, we run several experiments scenario by combining all types of data modification and the deep learning algorithms that we used and we obtain that the highest $F$ - Score was obtained when using Bi-GRU for standard dataset, lowercase with punctuation dataset, lowercase without punctuation dataset, and lowercase and clean dataset equal to 71.04%, 70.61%, 68.12%, and 67.45%, respectively.\",\"PeriodicalId\":413610,\"journal\":{\"name\":\"2020 Fifth International Conference on Informatics and Computing (ICIC)\",\"volume\":\"104 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 Fifth International Conference on Informatics and Computing (ICIC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICIC50835.2020.9288566\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 Fifth International Conference on Informatics and Computing (ICIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIC50835.2020.9288566","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

命名实体识别(NER)是一项将数据集中的实体信息提取到几个不同的实体类中的任务。目前大多数NER研究都是从结构化数据(如新闻和维基百科文章)中训练NER模型的。然而,有几个任务生成非结构化数据集,如语音到文本任务。本文采用LSTM、双向LSTM (Bi-LSTM)、GRU、双向GRU (Bi-GRU)和卷积神经网络(CNN)等深度学习方法对印尼语非结构化文本格式化数据集进行了NER研究。我们使用NERGRIT CORPUS作为我们的数据集,并将数据集修改为四种类型的结构化和非结构化数据集。随后,我们将所有类型的数据修改和我们使用的深度学习算法结合起来运行了几个实验场景,我们得到在标准数据集、小写带标点数据集、小写不带标点数据集和小写干净数据集上使用Bi-GRU获得的$F$ - Score最高,分别为71.04%、70.61%、68.12%和67.45%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Developing Name Entity Recognition for Structured and Unstructured Text Formatting Dataset
Named-Entity Recognition (NER) is a task that extracts the entity information from dataset into several different entity classes. Most of current NER research train the NER model from structured data such as news and Wikipedia article. Whereas, there are several tasks that generate an unstructured dataset such as speech-to-text task. In this paper, we did NER research for unstructured text formatting dataset in Indonesian language using deep learning approaches including LSTM, Bidirectional LSTM (Bi-LSTM), GRU, Bidirectional GRU (Bi-GRU), and Convolutional Neural Network (CNN). We used NERGRIT CORPUS as our dataset and modified the dataset into four types of structured and unstructured datasets. Afterward, we run several experiments scenario by combining all types of data modification and the deep learning algorithms that we used and we obtain that the highest $F$ - Score was obtained when using Bi-GRU for standard dataset, lowercase with punctuation dataset, lowercase without punctuation dataset, and lowercase and clean dataset equal to 71.04%, 70.61%, 68.12%, and 67.45%, respectively.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Task Design for Indonesian Cultural Heritage Data Collection with Crowdsourcing PenalViz: A Web-Based Visualization Tool for the Indonesian Penal Code Examining GOJEK Drivers' Loyalty: The Influence of GOJEK's Partnership Mechanism and Service Quality Modeling and Analysis of Three-Phase Active Power Filter Integrated Photovoltaic as a Reactive Power Compensator Using the Simulink Matlab Tool An Evaluation of Internet Addiction Test (IAT)
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1