Hasan Öztürk, Alperen Değirmenci, Onur Güngör, Suzan Üsküdarli
{"title":"语境词嵌入在纠正土耳其语“de/da”偏误中的作用","authors":"Hasan Öztürk, Alperen Değirmenci, Onur Güngör, Suzan Üsküdarli","doi":"10.1109/SIU49456.2020.9302477","DOIUrl":null,"url":null,"abstract":"One of the most common spelling errors in Turkish is regarding the clitic ‘de/da’. People often misspell the ‘de/da’ either by treating it as a suffix inappropriately when it should not, or by spelling it seperately when it should be a suffix. Since Turkish is a morphologically rich agglutinative language, detecting and identifying such errors are difficult. As such, many widely used spell correction tools do not handle such mistakes well. In this work, we show that a sequence tagger model that employs BERT model which produces word embeddings that consider the context of a word obtains higher performance compared to using non-contextual word embeddings instead. Training and evaluation tasks were performed with a dataset that was derived from a Turkish corpus using a special process in addition to a manually curated one. The contextual word embeddings obtained during this task are publicly shared with the research community.","PeriodicalId":312627,"journal":{"name":"2020 28th Signal Processing and Communications Applications Conference (SIU)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"The Role of Contextual Word Embeddings in Correcting the ‘de/da’ Clitic Errors in Turkish\",\"authors\":\"Hasan Öztürk, Alperen Değirmenci, Onur Güngör, Suzan Üsküdarli\",\"doi\":\"10.1109/SIU49456.2020.9302477\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"One of the most common spelling errors in Turkish is regarding the clitic ‘de/da’. People often misspell the ‘de/da’ either by treating it as a suffix inappropriately when it should not, or by spelling it seperately when it should be a suffix. Since Turkish is a morphologically rich agglutinative language, detecting and identifying such errors are difficult. As such, many widely used spell correction tools do not handle such mistakes well. In this work, we show that a sequence tagger model that employs BERT model which produces word embeddings that consider the context of a word obtains higher performance compared to using non-contextual word embeddings instead. Training and evaluation tasks were performed with a dataset that was derived from a Turkish corpus using a special process in addition to a manually curated one. The contextual word embeddings obtained during this task are publicly shared with the research community.\",\"PeriodicalId\":312627,\"journal\":{\"name\":\"2020 28th Signal Processing and Communications Applications Conference (SIU)\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 28th Signal Processing and Communications Applications Conference (SIU)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SIU49456.2020.9302477\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 28th Signal Processing and Communications Applications Conference (SIU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIU49456.2020.9302477","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
The Role of Contextual Word Embeddings in Correcting the ‘de/da’ Clitic Errors in Turkish
One of the most common spelling errors in Turkish is regarding the clitic ‘de/da’. People often misspell the ‘de/da’ either by treating it as a suffix inappropriately when it should not, or by spelling it seperately when it should be a suffix. Since Turkish is a morphologically rich agglutinative language, detecting and identifying such errors are difficult. As such, many widely used spell correction tools do not handle such mistakes well. In this work, we show that a sequence tagger model that employs BERT model which produces word embeddings that consider the context of a word obtains higher performance compared to using non-contextual word embeddings instead. Training and evaluation tasks were performed with a dataset that was derived from a Turkish corpus using a special process in addition to a manually curated one. The contextual word embeddings obtained during this task are publicly shared with the research community.