首页 > 最新文献

2019 International Conference on Asian Language Processing (IALP)最新文献

英文 中文
A Comparative Analysis of Acoustic Characteristics between Kazak & Uyghur Mandarin Learners and Standard Mandarin Speakers 哈萨克族、维吾尔族普通话学习者与标准普通话使用者的声学特征比较分析
Pub Date : 2019-11-01 DOI: 10.1109/IALP48816.2019.9037703
Gulnur Arkin, Gvljan Alijan, A. Hamdulla, Mijit Ablimit
In this paper, based on the vowel and phonological pronunciation corpora of 20 Kazakh undergraduate Mandarin learners, 10 Uyghur learners, and 10 standard pronunciations, under the framework of the phonetic learning model and comparative analysis, the method of experimental phonetics will be applied to the Kazak and Uyghur learners. The learners and standard speaker Mandarin vowels were analyzed for acoustic characteristics, such as formant frequency values, the vowel duration similarity and other prosodic parameters were compared with the standard speaker. These results are conducive to providing learners with effective teaching-related reference information, providing reliable and correct parameters and pronunciation assessments for computer-assisted language teaching systems (CALLs), as well as improving the accuracy of multinational Chinese Putonghua speech recognition and ethnic identification.
本文以20名哈萨克语本科生普通话学习者、10名维吾尔语学习者和10个标准语音的元音和语音语料库为基础,在语音学习模型和对比分析的框架下,将实验语音学方法应用于哈萨克语和维吾尔语学习者。分析学习者和标准说话者普通话元音的声学特征,如形成峰频率值、元音音长相似度和其他韵律参数与标准说话者进行比较。这些结果有利于为学习者提供有效的教学相关参考信息,为计算机辅助语言教学系统(call)提供可靠、正确的参数和语音评估,以及提高跨国汉语普通话语音识别和民族识别的准确性。
{"title":"A Comparative Analysis of Acoustic Characteristics between Kazak & Uyghur Mandarin Learners and Standard Mandarin Speakers","authors":"Gulnur Arkin, Gvljan Alijan, A. Hamdulla, Mijit Ablimit","doi":"10.1109/IALP48816.2019.9037703","DOIUrl":"https://doi.org/10.1109/IALP48816.2019.9037703","url":null,"abstract":"In this paper, based on the vowel and phonological pronunciation corpora of 20 Kazakh undergraduate Mandarin learners, 10 Uyghur learners, and 10 standard pronunciations, under the framework of the phonetic learning model and comparative analysis, the method of experimental phonetics will be applied to the Kazak and Uyghur learners. The learners and standard speaker Mandarin vowels were analyzed for acoustic characteristics, such as formant frequency values, the vowel duration similarity and other prosodic parameters were compared with the standard speaker. These results are conducive to providing learners with effective teaching-related reference information, providing reliable and correct parameters and pronunciation assessments for computer-assisted language teaching systems (CALLs), as well as improving the accuracy of multinational Chinese Putonghua speech recognition and ethnic identification.","PeriodicalId":208066,"journal":{"name":"2019 International Conference on Asian Language Processing (IALP)","volume":"53 12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130861347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On the Etymology of he ‘river’ in Chinese 论汉语“河”的词源
Pub Date : 2019-11-01 DOI: 10.1109/IALP48816.2019.9037654
Huibin Zhuang, Zhanting Bu
In Chinese he 河 ‘river’ can be used as proper names (for the Yellow River), as well as a common word for rivers in North China. Based on linguistic data, ethnological evidence and historical documents, this paper argues against these leading hypotheses and proposes that he originated from the Old Yi language, entered Chinese through language contact, and replaced shui which was from Old Qiang and later became the only common noun for river in North China.
在汉语中,“河”既可以作为黄河的专有名称,也可以作为中国北方河流的常用词。本文根据语言学资料、民族学证据和历史文献,反驳了这些主流假说,提出“水”起源于古彝语,通过语言接触进入汉语,并取代了古羌语中的“水”,成为华北地区唯一常用的“河”名词。
{"title":"On the Etymology of he ‘river’ in Chinese","authors":"Huibin Zhuang, Zhanting Bu","doi":"10.1109/IALP48816.2019.9037654","DOIUrl":"https://doi.org/10.1109/IALP48816.2019.9037654","url":null,"abstract":"In Chinese he 河 ‘river’ can be used as proper names (for the Yellow River), as well as a common word for rivers in North China. Based on linguistic data, ethnological evidence and historical documents, this paper argues against these leading hypotheses and proposes that he originated from the Old Yi language, entered Chinese through language contact, and replaced shui which was from Old Qiang and later became the only common noun for river in North China.","PeriodicalId":208066,"journal":{"name":"2019 International Conference on Asian Language Processing (IALP)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122344303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Diachronic Synonymy and Polysemy: Exploring Dynamic Relation Between Forms and Meanings of Words Based on Word Embeddings 历时同义与多义:基于词嵌入的词形与词义动态关系研究
Pub Date : 2019-11-01 DOI: 10.1109/IALP48816.2019.9037663
Shichen Liang, Jianyu Zheng, Xuemei Tang, Renfen Hu, Zhiying Liu
In recent years, there has been a large number of publications that use distributed methods to track temporal changes in lexical semantics. However, most current researches only state the simple fact that the meaning of words has changed, lacking more detailed and in-depth analysis. We combine linguistic theory and word embedding model to study Chinese diachronic semantics. Specifically, two methods of word analogy and word similarity are associated with diachronic synonymy and diachronic polysemy respectively, and the aligned diachronic word embeddings are used to detect the changes of relationship between forms and meanings of words. Through experiments and case studies, our method achieves the ideal result. We also find that the evolution of Chinese vocabulary is closely related to social development, and there is a certain correlation between the polysemy and synonymy of the word meaning.
近年来,已经有大量的出版物使用分布式方法来跟踪词汇语义的时间变化。然而,目前的研究大多只陈述了词语意义变化的简单事实,缺乏更详细和深入的分析。本文结合语言学理论和词嵌入模型对汉语历时语义进行了研究。具体而言,分别将词的类比和词的相似两种方法与历时同义词和历时多义相关联,并利用对齐的历时词嵌入来检测词的形式和意义关系的变化。通过实验和案例分析,该方法取得了理想的效果。我们还发现,汉语词汇的演变与社会发展密切相关,词义的多义词和同义词之间存在一定的相关性。
{"title":"Diachronic Synonymy and Polysemy: Exploring Dynamic Relation Between Forms and Meanings of Words Based on Word Embeddings","authors":"Shichen Liang, Jianyu Zheng, Xuemei Tang, Renfen Hu, Zhiying Liu","doi":"10.1109/IALP48816.2019.9037663","DOIUrl":"https://doi.org/10.1109/IALP48816.2019.9037663","url":null,"abstract":"In recent years, there has been a large number of publications that use distributed methods to track temporal changes in lexical semantics. However, most current researches only state the simple fact that the meaning of words has changed, lacking more detailed and in-depth analysis. We combine linguistic theory and word embedding model to study Chinese diachronic semantics. Specifically, two methods of word analogy and word similarity are associated with diachronic synonymy and diachronic polysemy respectively, and the aligned diachronic word embeddings are used to detect the changes of relationship between forms and meanings of words. Through experiments and case studies, our method achieves the ideal result. We also find that the evolution of Chinese vocabulary is closely related to social development, and there is a certain correlation between the polysemy and synonymy of the word meaning.","PeriodicalId":208066,"journal":{"name":"2019 International Conference on Asian Language Processing (IALP)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126248466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Developing a machine learning-based grade level classifier for Filipino children’s literature 为菲律宾儿童文学开发一个基于机器学习的年级分类器
Pub Date : 2019-11-01 DOI: 10.1109/IALP48816.2019.9037694
Joseph Marvin Imperial, R. Roxas, Erica Mae Campos, Jemelee Oandasan, Reyniel Caraballo, Ferry Winsley Sabdani, Ani Rosa Almaroi
Reading is an essential part of children’s learning. Identifying the proper readability level of reading materials will ensure effective comprehension. We present our efforts to develop a baseline model for automatically identifying the readability of children’s and young adult’s books written in Filipino using machine learning algorithms. For this study, we processed 258 picture books published by Adarna House Inc. In contrast to old readability formulas relying on static attributes like number of words, sentences, syllables, etc., other textual features were explored. Count vectors, Term FrequencyInverse Document Frequency (TF-IDF), n-grams, and character-level n-grams were extracted to train models using three major machine learning algorithms–Multinomial Naïve-Bayes, Random Forest, and K-Nearest Neighbors. A combination of K-Nearest Neighbors and Random Forest via voting-based classification mechanism resulted with the best performing model with a high average training accuracy and validation accuracy of 0.822 and 0.74 respectively. Analysis of the top 10 most useful features for each algorithm show that they share common similarity in identifying readability levels–the use of Filipino stop words. Performance of other classifiers and features were also explored.
阅读是儿童学习的重要组成部分。确定阅读材料的适当可读性水平将确保有效理解。我们展示了我们的努力,开发一个基线模型,用于使用机器学习算法自动识别用菲律宾语编写的儿童和青少年书籍的可读性。在这项研究中,我们处理了由Adarna House Inc.出版的258本绘本。与以往依赖单词数、句子数、音节数等静态属性的可读性公式不同,我们探索了其他文本特征。提取计数向量、Term Frequency、inverse Document Frequency (TF-IDF)、n-gram和字符级n-gram,并使用三种主要的机器学习算法(multinomial Naïve-Bayes、Random Forest和K-Nearest Neighbors)训练模型。通过基于投票的分类机制,将k近邻与随机森林相结合,得到了最佳的模型,平均训练精度和验证精度分别达到0.822和0.74。对每个算法最有用的前10个特征的分析表明,它们在识别可读性水平上有共同的相似性——使用菲律宾语停顿词。对其他分类器和特征的性能也进行了探讨。
{"title":"Developing a machine learning-based grade level classifier for Filipino children’s literature","authors":"Joseph Marvin Imperial, R. Roxas, Erica Mae Campos, Jemelee Oandasan, Reyniel Caraballo, Ferry Winsley Sabdani, Ani Rosa Almaroi","doi":"10.1109/IALP48816.2019.9037694","DOIUrl":"https://doi.org/10.1109/IALP48816.2019.9037694","url":null,"abstract":"Reading is an essential part of children’s learning. Identifying the proper readability level of reading materials will ensure effective comprehension. We present our efforts to develop a baseline model for automatically identifying the readability of children’s and young adult’s books written in Filipino using machine learning algorithms. For this study, we processed 258 picture books published by Adarna House Inc. In contrast to old readability formulas relying on static attributes like number of words, sentences, syllables, etc., other textual features were explored. Count vectors, Term FrequencyInverse Document Frequency (TF-IDF), n-grams, and character-level n-grams were extracted to train models using three major machine learning algorithms–Multinomial Naïve-Bayes, Random Forest, and K-Nearest Neighbors. A combination of K-Nearest Neighbors and Random Forest via voting-based classification mechanism resulted with the best performing model with a high average training accuracy and validation accuracy of 0.822 and 0.74 respectively. Analysis of the top 10 most useful features for each algorithm show that they share common similarity in identifying readability levels–the use of Filipino stop words. Performance of other classifiers and features were also explored.","PeriodicalId":208066,"journal":{"name":"2019 International Conference on Asian Language Processing (IALP)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121464491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Employing Gated Attention and Multi-similarities to Resolve Document-level Chinese Event Coreference 用门控注意和多重相似度解决文档级汉语事件共指
Pub Date : 2019-11-01 DOI: 10.1109/IALP48816.2019.9037674
Haoyi Cheng, Peifeng Li, Qiaoming Zhu
Event coreference resolution is a challenging task. To address the issues of the influence on event-independent information in event mentions and the flexible and diverse sentence structure in Chinese language, this paper introduces a GANN (Gated Attention Neural Networks) model to document-level Chinese event coreference resolution. GANN introduces a gated attention mechanism to select eventrelated information from event mentions and then filter noisy information. Moreover, GANN not only uses a single Cosine distance to calculate the linear distance between two event mentions, but also introduces multi-mechanisms, i.e., Bilinear distance and Single Layer Network, to further calculate the linear and nonlinear distances. The experimental results on the ACE 2005 Chinese corpus illustrate that our model GANN outperforms the state-of-the-art baselines.
事件共引用解析是一项具有挑战性的任务。为了解决事件提及对事件无关信息的影响以及汉语句子结构的灵活多变等问题,本文引入了一种基于GANN(门控注意神经网络)的文档级汉语事件共指解析模型。GANN引入了一种门控注意机制,从事件提及中选择与事件相关的信息,然后过滤噪声信息。此外,GANN不仅使用单个余弦距离来计算两个事件提及之间的线性距离,而且还引入了双线性距离和单层网络等多机制来进一步计算线性和非线性距离。在ACE 2005中文语料库上的实验结果表明,我们的模型GANN优于最先进的基线。
{"title":"Employing Gated Attention and Multi-similarities to Resolve Document-level Chinese Event Coreference","authors":"Haoyi Cheng, Peifeng Li, Qiaoming Zhu","doi":"10.1109/IALP48816.2019.9037674","DOIUrl":"https://doi.org/10.1109/IALP48816.2019.9037674","url":null,"abstract":"Event coreference resolution is a challenging task. To address the issues of the influence on event-independent information in event mentions and the flexible and diverse sentence structure in Chinese language, this paper introduces a GANN (Gated Attention Neural Networks) model to document-level Chinese event coreference resolution. GANN introduces a gated attention mechanism to select eventrelated information from event mentions and then filter noisy information. Moreover, GANN not only uses a single Cosine distance to calculate the linear distance between two event mentions, but also introduces multi-mechanisms, i.e., Bilinear distance and Single Layer Network, to further calculate the linear and nonlinear distances. The experimental results on the ACE 2005 Chinese corpus illustrate that our model GANN outperforms the state-of-the-art baselines.","PeriodicalId":208066,"journal":{"name":"2019 International Conference on Asian Language Processing (IALP)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131571031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An End-to-End Model Based on TDNN-BiGRU for Keyword Spotting 基于TDNN-BiGRU的端到端关键字识别模型
Pub Date : 2019-11-01 DOI: 10.1109/IALP48816.2019.9037714
Shuzhou Chai, Zhenye Yang, Changsheng Lv, Weiqiang Zhang
In this paper, we proposed a neural network architecture based on Time-Delay Neural Network (TDNN)Bidirectional Gated Recurrent Unit (BiGRU) for small-footprint keyWord spotting. Our model consists of three parts: TDNN, BiGRU and Attention Mechanism. TDNN models the time information and BiGRU extracts the hidden layer features of the audio. The attention mechanism generates a vector of fixed length with hidden layer features. The system generates the final score through vector linear transformation and softmax function. We explored the step size and unit size of TDNN and two attention mechanisms. Our model has achieved a true positive rate of 99.63% at a 5% false positive rate.
本文提出了一种基于时延神经网络(TDNN)双向门控循环单元(BiGRU)的神经网络结构,用于小空间关键字识别。我们的模型由三部分组成:TDNN、BiGRU和注意机制。TDNN对时间信息进行建模,BiGRU提取音频的隐藏层特征。注意机制生成具有隐层特征的固定长度向量。系统通过向量线性变换和softmax函数生成最终分数。我们探讨了TDNN的步长和单位大小以及两种注意机制。我们的模型在5%的假阳性率下实现了99.63%的真阳性率。
{"title":"An End-to-End Model Based on TDNN-BiGRU for Keyword Spotting","authors":"Shuzhou Chai, Zhenye Yang, Changsheng Lv, Weiqiang Zhang","doi":"10.1109/IALP48816.2019.9037714","DOIUrl":"https://doi.org/10.1109/IALP48816.2019.9037714","url":null,"abstract":"In this paper, we proposed a neural network architecture based on Time-Delay Neural Network (TDNN)Bidirectional Gated Recurrent Unit (BiGRU) for small-footprint keyWord spotting. Our model consists of three parts: TDNN, BiGRU and Attention Mechanism. TDNN models the time information and BiGRU extracts the hidden layer features of the audio. The attention mechanism generates a vector of fixed length with hidden layer features. The system generates the final score through vector linear transformation and softmax function. We explored the step size and unit size of TDNN and two attention mechanisms. Our model has achieved a true positive rate of 99.63% at a 5% false positive rate.","PeriodicalId":208066,"journal":{"name":"2019 International Conference on Asian Language Processing (IALP)","volume":"521 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131869190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Improving Japanese-English Bilingual Mapping of Word Embeddings based on Language Specificity 基于语言特异性的日英双语词嵌入映射改进
Pub Date : 2019-11-01 DOI: 10.1109/IALP48816.2019.9037649
Yuting Song, Biligsaikhan Batjargal, Akira Maeda
Recently, cross-lingual word embeddings have attracted a lot of attention, because they can capture semantic meaning of words across languages, which can be applied to cross-lingual tasks. Most methods learn a single mapping (e.g., a linear mapping) to transform word embeddings space from one language to another. In this paper, we propose an advanced method for improving bilingual word embeddings by adding a language-specific mapping. We focus on learning Japanese-English bilingual word embedding mapping by considering the specificity of Japanese language. On a benchmark data set of JapaneseEnglish bilingual lexicon induction, the proposed method achieved competitive performance compared to the method using a single mapping, with better results being found on original Japanese words.
近年来,跨语言词嵌入技术因其能够捕获跨语言词的语义而备受关注,并可应用于跨语言任务中。大多数方法学习单一映射(例如,线性映射)来将词嵌入空间从一种语言转换为另一种语言。在本文中,我们提出了一种通过添加语言特定映射来改进双语词嵌入的高级方法。考虑到日语语言的特殊性,重点研究日英双语词嵌入映射的学习。在日英双语词汇归纳的基准数据集上,与使用单一映射的方法相比,所提出的方法取得了具有竞争力的性能,并且在原始日语单词上发现了更好的结果。
{"title":"Improving Japanese-English Bilingual Mapping of Word Embeddings based on Language Specificity","authors":"Yuting Song, Biligsaikhan Batjargal, Akira Maeda","doi":"10.1109/IALP48816.2019.9037649","DOIUrl":"https://doi.org/10.1109/IALP48816.2019.9037649","url":null,"abstract":"Recently, cross-lingual word embeddings have attracted a lot of attention, because they can capture semantic meaning of words across languages, which can be applied to cross-lingual tasks. Most methods learn a single mapping (e.g., a linear mapping) to transform word embeddings space from one language to another. In this paper, we propose an advanced method for improving bilingual word embeddings by adding a language-specific mapping. We focus on learning Japanese-English bilingual word embedding mapping by considering the specificity of Japanese language. On a benchmark data set of JapaneseEnglish bilingual lexicon induction, the proposed method achieved competitive performance compared to the method using a single mapping, with better results being found on original Japanese words.","PeriodicalId":208066,"journal":{"name":"2019 International Conference on Asian Language Processing (IALP)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133367257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Extremely Low Resource Text simplification with Pre-trained Transformer Language Model 极低的资源文本简化与预训练的转换语言模型
Pub Date : 2019-11-01 DOI: 10.1109/IALP48816.2019.9037650
T. Maruyama, Kazuhide Yamamoto
Recent text simplification approaches regard the task as a monolingual text-to-text generation inspired by machine translation. In particular, the transformer-based translation model outperform previous methods. Although machine translation approaches need a large-scale parallel corpus, parallel corpora for text simplification are very small compared to machine translation tasks. Therefore, we attempt a simple approach which fine-tunes the pre-trained language model for text simplification with a small parallel corpus. Specifically, we conduct experiments with the following two models: transformer-based encoder-decoder model and a language model that receives a joint input of original and simplified sentences, called TransformerLM. Thus, we show that TransformerLM, which is a simple text generation model, substantially outperforms a strong baseline. In addition, we show that fine-tuned TransformerLM with only 3,000 supervised examples can achieve performance comparable to a strong baseline trained by all supervised data.
最近的文本简化方法将任务视为受机器翻译启发的单语文本到文本生成。特别是,基于变压器的翻译模型优于以前的方法。虽然机器翻译方法需要大规模的并行语料库,但与机器翻译任务相比,用于文本简化的并行语料库非常小。因此,我们尝试了一种简单的方法,该方法对预训练的语言模型进行微调,以使用小型并行语料库进行文本简化。具体来说,我们用以下两个模型进行了实验:基于转换器的编码器-解码器模型和接收原始和简化句子联合输入的语言模型,称为TransformerLM。因此,我们展示了TransformerLM,它是一个简单的文本生成模型,在本质上优于一个强大的基线。此外,我们还表明,仅使用3,000个监督示例进行微调的TransformerLM可以达到与所有监督数据训练的强基线相当的性能。
{"title":"Extremely Low Resource Text simplification with Pre-trained Transformer Language Model","authors":"T. Maruyama, Kazuhide Yamamoto","doi":"10.1109/IALP48816.2019.9037650","DOIUrl":"https://doi.org/10.1109/IALP48816.2019.9037650","url":null,"abstract":"Recent text simplification approaches regard the task as a monolingual text-to-text generation inspired by machine translation. In particular, the transformer-based translation model outperform previous methods. Although machine translation approaches need a large-scale parallel corpus, parallel corpora for text simplification are very small compared to machine translation tasks. Therefore, we attempt a simple approach which fine-tunes the pre-trained language model for text simplification with a small parallel corpus. Specifically, we conduct experiments with the following two models: transformer-based encoder-decoder model and a language model that receives a joint input of original and simplified sentences, called TransformerLM. Thus, we show that TransformerLM, which is a simple text generation model, substantially outperforms a strong baseline. In addition, we show that fine-tuned TransformerLM with only 3,000 supervised examples can achieve performance comparable to a strong baseline trained by all supervised data.","PeriodicalId":208066,"journal":{"name":"2019 International Conference on Asian Language Processing (IALP)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116837582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Neural Machine Translation Strategies for Generating Honorific-style Korean 敬语式韩语的神经机器翻译策略研究
Pub Date : 2019-11-01 DOI: 10.1109/IALP48816.2019.9037681
Lijie Wang, Mei Tu, Mengxia Zhai, Huadong Wang, Song Liu, Sang Ha Kim
Expression with honorifics is an important way of dressing up the language and showing politeness in Korean. For machine translation, generating honorifics is indispensable on the formal occasion when the target language is Korean. However, current Neural Machine Translation (NMT) models ignore generation of honorifics, which causes the limitation of the MT application on business occasion. In order to address the problem, this paper presents two strategies to improve Korean honorific generation ratio: 1) we introduce honorific fusion training (HFT) loss under the minimum risk training framework to guide the model to generate honorifics; 2) we introduce a data labeling (DL) method which tags the training corpus with distinctive labels without any modification to the model structure. Our experimental results show that the proposed two strategies can significantly improve the honorific generation ratio by 34.35% and 45.59%.
韩国语的敬语表达是修饰语言、表现礼貌的重要方式。对于机器翻译来说,在目的语为韩语的正式场合,敬语的生成是必不可少的。然而,目前的神经机器翻译模型忽略了敬语的生成,这限制了机器翻译在商务场合的应用。针对这一问题,本文提出了提高韩语敬语生成率的两种策略:1)引入最小风险训练框架下的敬语融合训练(HFT)损失来指导模型生成敬语;2)引入数据标注(DL)方法,在不改变模型结构的情况下对训练语料库进行标注。实验结果表明,两种策略均能显著提高敬语生成率,分别提高34.35%和45.59%。
{"title":"Neural Machine Translation Strategies for Generating Honorific-style Korean","authors":"Lijie Wang, Mei Tu, Mengxia Zhai, Huadong Wang, Song Liu, Sang Ha Kim","doi":"10.1109/IALP48816.2019.9037681","DOIUrl":"https://doi.org/10.1109/IALP48816.2019.9037681","url":null,"abstract":"Expression with honorifics is an important way of dressing up the language and showing politeness in Korean. For machine translation, generating honorifics is indispensable on the formal occasion when the target language is Korean. However, current Neural Machine Translation (NMT) models ignore generation of honorifics, which causes the limitation of the MT application on business occasion. In order to address the problem, this paper presents two strategies to improve Korean honorific generation ratio: 1) we introduce honorific fusion training (HFT) loss under the minimum risk training framework to guide the model to generate honorifics; 2) we introduce a data labeling (DL) method which tags the training corpus with distinctive labels without any modification to the model structure. Our experimental results show that the proposed two strategies can significantly improve the honorific generation ratio by 34.35% and 45.59%.","PeriodicalId":208066,"journal":{"name":"2019 International Conference on Asian Language Processing (IALP)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128095103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Study on Syntactic Complexity and Text Readability of ASEAN English News 东盟英语新闻的句法复杂性与篇章可读性研究
Pub Date : 2019-11-01 DOI: 10.1109/IALP48816.2019.9037695
Yusha Zhang, Nankai Lin, Sheng-yi Jiang
English is the most widely used language in the world. With the spread and evolution of language, there are differences in the English text expression and reading difficulty in different regions. Due to the difference in the content and wording, English news in some countries is easier to understand than in others. Using an accurate and effective method to calculate the difficulty of text is not only beneficial for news writers to write easy-to-understand articles, but also for readers to choose articles that they can understand. In this paper, we study the differences in the text readability between most ASEAN countries, England and America. We compare the textual readability and syntactic complexity of English news texts among England, America and eight ASEAN countries (Indonesia, Malaysia, Philippines, Singapore, Brunei, Thailand, Vietnam, Cambodia). This paper selected the authoritative news media of each country as the research object. We used different indicators including Flesch-Kincaid Grade Level (FKG), Flesch Reading Ease Index (FRE), Gunning Fog Index (GF), Automated Readability Index (AR), Coleman-Liau Index (CL) and Linsear Write Index (LW) to measure the textual readability, and then applied L2SCA to analyze the syntactic complexity of news text. According to the analysis results, we used the hierarchical clustering method to classify the English texts of different countries into six different levels. Moreover, we elucidated the reasons for such readability differences in these countries.
英语是世界上使用最广泛的语言。随着语言的传播和演变,不同地区的英语文本表达和阅读难度存在差异。由于内容和措辞的不同,一些国家的英语新闻比另一些国家的英语新闻更容易理解。使用一种准确有效的方法来计算文本的难度,不仅有利于新闻作者写出易于理解的文章,也有利于读者选择自己能够理解的文章。本文研究了大多数东盟国家、英国和美国在文本可读性方面的差异。我们比较了英国、美国和东盟八个国家(印度尼西亚、马来西亚、菲律宾、新加坡、文莱、泰国、越南、柬埔寨)英语新闻语篇的可读性和句法复杂性。本文选取了各国的权威新闻媒体作为研究对象。本文采用Flesch- kincaid Grade Level (FKG)、Flesch Reading Ease Index (FRE)、Gunning Fog Index (GF)、Automated Readability Index (AR)、Coleman-Liau Index (CL)和Linsear Write Index (LW)等不同指标衡量新闻文本的可读性,并应用L2SCA对新闻文本的句法复杂度进行分析。根据分析结果,我们使用层次聚类方法将不同国家的英语文本划分为六个不同的层次。此外,我们还阐明了这些国家的可读性差异的原因。
{"title":"A Study on Syntactic Complexity and Text Readability of ASEAN English News","authors":"Yusha Zhang, Nankai Lin, Sheng-yi Jiang","doi":"10.1109/IALP48816.2019.9037695","DOIUrl":"https://doi.org/10.1109/IALP48816.2019.9037695","url":null,"abstract":"English is the most widely used language in the world. With the spread and evolution of language, there are differences in the English text expression and reading difficulty in different regions. Due to the difference in the content and wording, English news in some countries is easier to understand than in others. Using an accurate and effective method to calculate the difficulty of text is not only beneficial for news writers to write easy-to-understand articles, but also for readers to choose articles that they can understand. In this paper, we study the differences in the text readability between most ASEAN countries, England and America. We compare the textual readability and syntactic complexity of English news texts among England, America and eight ASEAN countries (Indonesia, Malaysia, Philippines, Singapore, Brunei, Thailand, Vietnam, Cambodia). This paper selected the authoritative news media of each country as the research object. We used different indicators including Flesch-Kincaid Grade Level (FKG), Flesch Reading Ease Index (FRE), Gunning Fog Index (GF), Automated Readability Index (AR), Coleman-Liau Index (CL) and Linsear Write Index (LW) to measure the textual readability, and then applied L2SCA to analyze the syntactic complexity of news text. According to the analysis results, we used the hierarchical clustering method to classify the English texts of different countries into six different levels. Moreover, we elucidated the reasons for such readability differences in these countries.","PeriodicalId":208066,"journal":{"name":"2019 International Conference on Asian Language Processing (IALP)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125219339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2019 International Conference on Asian Language Processing (IALP)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1