Hasan Öztürk, Alperen Değirmenci, Onur Güngör, Suzan Üsküdarli
{"title":"The Role of Contextual Word Embeddings in Correcting the ‘de/da’ Clitic Errors in Turkish","authors":"Hasan Öztürk, Alperen Değirmenci, Onur Güngör, Suzan Üsküdarli","doi":"10.1109/SIU49456.2020.9302477","DOIUrl":null,"url":null,"abstract":"One of the most common spelling errors in Turkish is regarding the clitic ‘de/da’. People often misspell the ‘de/da’ either by treating it as a suffix inappropriately when it should not, or by spelling it seperately when it should be a suffix. Since Turkish is a morphologically rich agglutinative language, detecting and identifying such errors are difficult. As such, many widely used spell correction tools do not handle such mistakes well. In this work, we show that a sequence tagger model that employs BERT model which produces word embeddings that consider the context of a word obtains higher performance compared to using non-contextual word embeddings instead. Training and evaluation tasks were performed with a dataset that was derived from a Turkish corpus using a special process in addition to a manually curated one. The contextual word embeddings obtained during this task are publicly shared with the research community.","PeriodicalId":312627,"journal":{"name":"2020 28th Signal Processing and Communications Applications Conference (SIU)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 28th Signal Processing and Communications Applications Conference (SIU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIU49456.2020.9302477","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
One of the most common spelling errors in Turkish is regarding the clitic ‘de/da’. People often misspell the ‘de/da’ either by treating it as a suffix inappropriately when it should not, or by spelling it seperately when it should be a suffix. Since Turkish is a morphologically rich agglutinative language, detecting and identifying such errors are difficult. As such, many widely used spell correction tools do not handle such mistakes well. In this work, we show that a sequence tagger model that employs BERT model which produces word embeddings that consider the context of a word obtains higher performance compared to using non-contextual word embeddings instead. Training and evaluation tasks were performed with a dataset that was derived from a Turkish corpus using a special process in addition to a manually curated one. The contextual word embeddings obtained during this task are publicly shared with the research community.