N. Azzahra, Muhammad Okky Ibrohim, Junaedi Fahmi, Bagus Fajar Apriyanto, Oskar Riandi
{"title":"Developing Name Entity Recognition for Structured and Unstructured Text Formatting Dataset","authors":"N. Azzahra, Muhammad Okky Ibrohim, Junaedi Fahmi, Bagus Fajar Apriyanto, Oskar Riandi","doi":"10.1109/ICIC50835.2020.9288566","DOIUrl":null,"url":null,"abstract":"Named-Entity Recognition (NER) is a task that extracts the entity information from dataset into several different entity classes. Most of current NER research train the NER model from structured data such as news and Wikipedia article. Whereas, there are several tasks that generate an unstructured dataset such as speech-to-text task. In this paper, we did NER research for unstructured text formatting dataset in Indonesian language using deep learning approaches including LSTM, Bidirectional LSTM (Bi-LSTM), GRU, Bidirectional GRU (Bi-GRU), and Convolutional Neural Network (CNN). We used NERGRIT CORPUS as our dataset and modified the dataset into four types of structured and unstructured datasets. Afterward, we run several experiments scenario by combining all types of data modification and the deep learning algorithms that we used and we obtain that the highest $F$ - Score was obtained when using Bi-GRU for standard dataset, lowercase with punctuation dataset, lowercase without punctuation dataset, and lowercase and clean dataset equal to 71.04%, 70.61%, 68.12%, and 67.45%, respectively.","PeriodicalId":413610,"journal":{"name":"2020 Fifth International Conference on Informatics and Computing (ICIC)","volume":"104 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 Fifth International Conference on Informatics and Computing (ICIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIC50835.2020.9288566","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Named-Entity Recognition (NER) is a task that extracts the entity information from dataset into several different entity classes. Most of current NER research train the NER model from structured data such as news and Wikipedia article. Whereas, there are several tasks that generate an unstructured dataset such as speech-to-text task. In this paper, we did NER research for unstructured text formatting dataset in Indonesian language using deep learning approaches including LSTM, Bidirectional LSTM (Bi-LSTM), GRU, Bidirectional GRU (Bi-GRU), and Convolutional Neural Network (CNN). We used NERGRIT CORPUS as our dataset and modified the dataset into four types of structured and unstructured datasets. Afterward, we run several experiments scenario by combining all types of data modification and the deep learning algorithms that we used and we obtain that the highest $F$ - Score was obtained when using Bi-GRU for standard dataset, lowercase with punctuation dataset, lowercase without punctuation dataset, and lowercase and clean dataset equal to 71.04%, 70.61%, 68.12%, and 67.45%, respectively.