Linguistic Issues in the Machine Transliteration of Chinese, Japanese and Arabic Names

NEWS@ACM Pub Date : 1900-01-01 DOI:10.18653/v1/W16-2707

Jack Halpern

{"title":"Linguistic Issues in the Machine Transliteration of Chinese, Japanese and Arabic Names","authors":"Jack Halpern","doi":"10.18653/v1/W16-2707","DOIUrl":null,"url":null,"abstract":"The romanization of non-Latin scripts is a complex computational task that is highly language dependent. This presentation will focus on three of the most challenging nonLatin scripts: Chinese, Japanese, and Arabic (CJA). Much progress has been made in personal name machine-transliteration methodologies, as documented in the various NEWS reports over the last several years. Such techniques as phrase-based SMT, RNN-based LM and CRF have emerged, leading to gradual improvements in accuracy scores. But methodology is only one aspect of the problem. Equally important is the high level of ambiguity of the CJA scripts, which poses special challenges to named entity extraction and machine transliteration. These difficulties are exacerbated by the lack of comprehensive proper noun dictionaries, the multiplicity of ambiguous transcription schemes, and orthographic variation. This presentation will clear up the differences between three basic concepts -transliteration, transcription, and romanization -that are a source of much confusion, even among computational linguists, and will focus on (1) the major linguistics issues, that is, the special characteristics of the CJA scripts that impact machine transliteration, and (2) the important role played by lexical resources such as personal name dictionaries. A major issue in romanizing Simplified Chinese (SC) is the one-to-many ambiguity of many characters (polyphones), such as /le/ and /yue/ for 乐. To disambiguate accurately, the names must be looked up in word-level (not character-level) name mapping tables. This is complicated by (1) the presence of orthographic variants in traditional Chinese (TC), and (2) the need to for cross-script conversion between (SC) and (TC), Transcription into Chinese is even more ambiguous, since some phonemes can correspond to dozens of characters. A major characteristic of Japanese, a highly agglutinative language, is the presence of countless orthographic variants. The four Japanese scripts interact in a complex way, resulting in okurigana variants (取り扱い, 取扱い, 取扱 etc. for /toriatsukai/), crossscript variants (猫, ねこ, ネコ for /neko/), kanji variants (大幅 and 大巾 for /oohaba/), kana variants (ユーザー and ユーザ for /yuuza(a)/), and more. Another issue is the numerous kun and nanori readings (some kanji have dozens) and the various romanization systems in current use, such as the Hepburn, Kunrei and hybrid systems.","PeriodicalId":254249,"journal":{"name":"NEWS@ACM","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"NEWS@ACM","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/W16-2707","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The romanization of non-Latin scripts is a complex computational task that is highly language dependent. This presentation will focus on three of the most challenging nonLatin scripts: Chinese, Japanese, and Arabic (CJA). Much progress has been made in personal name machine-transliteration methodologies, as documented in the various NEWS reports over the last several years. Such techniques as phrase-based SMT, RNN-based LM and CRF have emerged, leading to gradual improvements in accuracy scores. But methodology is only one aspect of the problem. Equally important is the high level of ambiguity of the CJA scripts, which poses special challenges to named entity extraction and machine transliteration. These difficulties are exacerbated by the lack of comprehensive proper noun dictionaries, the multiplicity of ambiguous transcription schemes, and orthographic variation. This presentation will clear up the differences between three basic concepts -transliteration, transcription, and romanization -that are a source of much confusion, even among computational linguists, and will focus on (1) the major linguistics issues, that is, the special characteristics of the CJA scripts that impact machine transliteration, and (2) the important role played by lexical resources such as personal name dictionaries. A major issue in romanizing Simplified Chinese (SC) is the one-to-many ambiguity of many characters (polyphones), such as /le/ and /yue/ for 乐. To disambiguate accurately, the names must be looked up in word-level (not character-level) name mapping tables. This is complicated by (1) the presence of orthographic variants in traditional Chinese (TC), and (2) the need to for cross-script conversion between (SC) and (TC), Transcription into Chinese is even more ambiguous, since some phonemes can correspond to dozens of characters. A major characteristic of Japanese, a highly agglutinative language, is the presence of countless orthographic variants. The four Japanese scripts interact in a complex way, resulting in okurigana variants (取り扱い, 取扱い, 取扱 etc. for /toriatsukai/), crossscript variants (猫, ねこ, ネコ for /neko/), kanji variants (大幅 and 大巾 for /oohaba/), kana variants (ユーザー and ユーザ for /yuuza(a)/), and more. Another issue is the numerous kun and nanori readings (some kanji have dozens) and the various romanization systems in current use, such as the Hepburn, Kunrei and hybrid systems.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

汉语、日语和阿拉伯语人名机器音译中的语言学问题

非拉丁文字的罗马化是一项高度依赖语言的复杂计算任务。本演讲将重点介绍三种最具挑战性的非拉丁文字:汉语、日语和阿拉伯语(CJA)。个人姓名的机器音译方法取得了很大进展，正如过去几年各种新闻报道所记载的那样。基于短语的SMT、基于rnn的LM和CRF等技术已经出现，导致准确率分数逐步提高。但方法论只是问题的一个方面。同样重要的是CJA脚本的高度模糊性，这对命名实体提取和机器音译提出了特殊的挑战。由于缺乏全面的专有名词词典，歧义转录方案的多样性和正字法的变化，这些困难加剧了。本演讲将澄清三个基本概念之间的差异——音译、转录和罗马化——这是许多混淆的来源，即使在计算语言学家中也是如此，并将重点放在(1)主要语言学问题，即影响机器音译的CJA脚本的特殊特征，以及(2)词汇资源(如人名字典)所起的重要作用。简体中文(SC)罗马化的一个主要问题是许多字符(多音素)的一对多歧义，例如“踢腿”的“/乐/”和“/越/”。为了准确地消除歧义，必须在单词级(而不是字符级)名称映射表中查找名称。(1)繁体中文(TC)中存在正字法变体，(2)繁体中文(SC)和繁体中文(TC)之间需要跨文字转换，这使得这一问题变得更加复杂，因为一些音素可以对应几十个字符。日语是一种高度黏着的语言，它的一个主要特点是存在着无数的正字法变体。四名日本脚本以复杂的方式相互作用,导致okurigana变体(取り扱い,取扱い,取扱toriatsukai /等等),crossscript变体(猫,ねこ,ネコ/三氯二苯脲/),汉字的变体(大幅和大巾/ oohaba /),假名变体(ユーザー和ユーザyuuza /(一)/),等等。另一个问题是大量的汉字和纳诺里读法(一些汉字有几十种)和目前使用的各种罗马化系统，如赫本、昆雷和混合系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

NEWS@ACM

自引率

0.00%

发文量

期刊最新文献

Multi-source named entity typing for social media Applying Neural Networks to English-Chinese Named Entity Transliteration Regulating Orthography-Phonology Relationship for English to Thai Transliteration Spanish NER with Word Representations and Conditional Random Fields German NER with a Multilingual Rule Based Information Extraction System: Analysis and Issues