{"title":"Classification of personal names with application to DBLP","authors":"M. Biryukov, Yafang Wang","doi":"10.1109/ICDIM.2008.4746754","DOIUrl":null,"url":null,"abstract":"In this paper we propose a new perspective for the data analysis in digital libraries, bibliographic and other databases containing personal names. Knowing language/cultural background of a person can be beneficial in many applications, however this information is often not present explicitly in the databases. We present here a statistical tool for the automatic language detection of personal names. Our system does not require a dictionary of names for training and handles 14 different languages so far. General purpose corpora for all Western European, Chinese, Japanese and Turkish languages are used in order to build simple statistical models of the languages. The tool is fine tuned to achieve precision and recall above 90% for many languages which proves better performance than some other systems aiming at the language identification of personal names. On an example of a bibliographical database DBLP we show how our tool can be used in tasks such as data cleaning and discovery of trends.","PeriodicalId":415013,"journal":{"name":"2008 Third International Conference on Digital Information Management","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 Third International Conference on Digital Information Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDIM.2008.4746754","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
In this paper we propose a new perspective for the data analysis in digital libraries, bibliographic and other databases containing personal names. Knowing language/cultural background of a person can be beneficial in many applications, however this information is often not present explicitly in the databases. We present here a statistical tool for the automatic language detection of personal names. Our system does not require a dictionary of names for training and handles 14 different languages so far. General purpose corpora for all Western European, Chinese, Japanese and Turkish languages are used in order to build simple statistical models of the languages. The tool is fine tuned to achieve precision and recall above 90% for many languages which proves better performance than some other systems aiming at the language identification of personal names. On an example of a bibliographical database DBLP we show how our tool can be used in tasks such as data cleaning and discovery of trends.