{"title":"Gender classification of Korean personal names: Deep neural networks versus human judgments","authors":"Hyesun Cho","doi":"10.1016/j.lingua.2024.103703","DOIUrl":null,"url":null,"abstract":"<div><p>In many languages, female and male names have different phonotactic characteristics. The name–gender relationship is probabilistic; therefore, it can be captured more adequately using stochastic models than deterministic phonological theories. In this study, a total of 6,000 most commonly used names (3,000 for each gender) in Korean were used to train a deep neural network (DNN), which is an ensemble model of recurrent neural networks and convolution neural networks. The phonotactic learner (PL) was used as the baseline model. The DNN and PL models predicted the gender of 50 test names compiled from low-frequency names. The models’ predictions were compared with human judgments on the gender of the test names. The models’ predicted labels matched the names’ actual labels, with a higher accuracy in the DNN (90%) than in the PL (76%). The predictions also matched the labels assigned by human subjects with a higher accuracy for the DNN (86%) than the PL (72%). The DNN model correlated more closely with human judgments (<em>r<sup>2</sup></em> = 0.743) than the PL (<em>r<sup>2</sup></em> = 0.312). Considering the similarity of responses between the DNN and humans, these results suggest that neural network models should be incorporated into phonological studies.</p></div>","PeriodicalId":47955,"journal":{"name":"Lingua","volume":"303 ","pages":"Article 103703"},"PeriodicalIF":1.1000,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Lingua","FirstCategoryId":"98","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0024384124000329","RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}
引用次数: 0
Abstract
In many languages, female and male names have different phonotactic characteristics. The name–gender relationship is probabilistic; therefore, it can be captured more adequately using stochastic models than deterministic phonological theories. In this study, a total of 6,000 most commonly used names (3,000 for each gender) in Korean were used to train a deep neural network (DNN), which is an ensemble model of recurrent neural networks and convolution neural networks. The phonotactic learner (PL) was used as the baseline model. The DNN and PL models predicted the gender of 50 test names compiled from low-frequency names. The models’ predictions were compared with human judgments on the gender of the test names. The models’ predicted labels matched the names’ actual labels, with a higher accuracy in the DNN (90%) than in the PL (76%). The predictions also matched the labels assigned by human subjects with a higher accuracy for the DNN (86%) than the PL (72%). The DNN model correlated more closely with human judgments (r2 = 0.743) than the PL (r2 = 0.312). Considering the similarity of responses between the DNN and humans, these results suggest that neural network models should be incorporated into phonological studies.
期刊介绍:
Lingua publishes papers of any length, if justified, as well as review articles surveying developments in the various fields of linguistics, and occasional discussions. A considerable number of pages in each issue are devoted to critical book reviews. Lingua also publishes Lingua Franca articles consisting of provocative exchanges expressing strong opinions on central topics in linguistics; The Decade In articles which are educational articles offering the nonspecialist linguist an overview of a given area of study; and Taking up the Gauntlet special issues composed of a set number of papers examining one set of data and exploring whose theory offers the most insight with a minimal set of assumptions and a maximum of arguments.