首页 > 最新文献

Acta Linguistica Academica最新文献

英文 中文
Guest Editor's Foreword 客座编辑前言
IF 0.5 3区 文学 0 LANGUAGE & LINGUISTICS Pub Date : 2022-12-12 DOI: 10.1556/2062.2022.00623
Gábor Prószéky
{"title":"Guest Editor's Foreword","authors":"Gábor Prószéky","doi":"10.1556/2062.2022.00623","DOIUrl":"https://doi.org/10.1556/2062.2022.00623","url":null,"abstract":"","PeriodicalId":37594,"journal":{"name":"Acta Linguistica Academica","volume":" ","pages":""},"PeriodicalIF":0.5,"publicationDate":"2022-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44038232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A proof-of-concept meaning discrimination experiment to compile a word-in-context dataset for adjectives – A graph-based distributional approach 一个概念验证的意义辨别实验,用于编译形容词的上下文词数据集-基于图的分布方法
IF 0.5 3区 文学 0 LANGUAGE & LINGUISTICS Pub Date : 2022-12-12 DOI: 10.1556/2062.2022.00579
Enikő Héja, Noémi Ligeti-Nagy
The Word-in-Context corpus, which forms part of the SuperGLUE benchmark dataset, focuses on a specific sense disambiguation task: it has to be decided whether two occurrences of a given target word in two different contexts convey the same meaning or not. Unfortunately, the WiC database exhibits a relatively low consistency in terms of inter-annotator agreement, which implies that the meaning discrimination task is not well defined even for humans. The present paper aims at tackling this problem through anchoring semantic information to observable surface data. For doing so, we have experimented with a graph-based distributional approach, where both sparse and dense adjectival vector representations served as input. According to our expectations the algorithm is able to anchor the semantic information to contextual data, and therefore it is able to provide clear and explicit criteria as to when the same meaning should be assigned to the occurrences. Moreover, since this method does not rely on any external knowledge base, it should be suitable for any low- or medium-resourced language.
word -in- context语料库是SuperGLUE基准数据集的一部分,它专注于一个特定的意义消歧义任务:它必须确定给定目标单词在两个不同的上下文中的两次出现是否传达相同的含义。不幸的是,WiC数据库在注释者之间的一致性方面表现出相对较低的一致性,这意味着即使对于人类来说,意义区分任务也没有很好地定义。本文旨在通过将语义信息锚定到可观测表面数据来解决这一问题。为此,我们尝试了一种基于图的分布方法,其中稀疏和密集的形容词向量表示都作为输入。根据我们的期望,该算法能够将语义信息锚定到上下文数据,因此它能够提供清晰明确的标准,以确定何时应该为出现的事件分配相同的含义。此外,由于这种方法不依赖于任何外部知识库,因此它应该适用于任何低资源或中等资源的语言。
{"title":"A proof-of-concept meaning discrimination experiment to compile a word-in-context dataset for adjectives – A graph-based distributional approach","authors":"Enikő Héja, Noémi Ligeti-Nagy","doi":"10.1556/2062.2022.00579","DOIUrl":"https://doi.org/10.1556/2062.2022.00579","url":null,"abstract":"The Word-in-Context corpus, which forms part of the SuperGLUE benchmark dataset, focuses on a specific sense disambiguation task: it has to be decided whether two occurrences of a given target word in two different contexts convey the same meaning or not. Unfortunately, the WiC database exhibits a relatively low consistency in terms of inter-annotator agreement, which implies that the meaning discrimination task is not well defined even for humans. The present paper aims at tackling this problem through anchoring semantic information to observable surface data. For doing so, we have experimented with a graph-based distributional approach, where both sparse and dense adjectival vector representations served as input. According to our expectations the algorithm is able to anchor the semantic information to contextual data, and therefore it is able to provide clear and explicit criteria as to when the same meaning should be assigned to the occurrences. Moreover, since this method does not rely on any external knowledge base, it should be suitable for any low- or medium-resourced language.","PeriodicalId":37594,"journal":{"name":"Acta Linguistica Academica","volume":" ","pages":""},"PeriodicalIF":0.5,"publicationDate":"2022-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43712821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BiVaSE: A bilingual variational sentence encoder with randomly initialized Transformer layers BiVaSE:一种具有随机初始化Transformer层的双语变分句编码器
IF 0.5 3区 文学 0 LANGUAGE & LINGUISTICS Pub Date : 2022-12-12 DOI: 10.1556/2062.2022.00584
Bence Nyéki
Transformer-based NLP models have achieved state-of-the-art results in many NLP tasks including text classification and text generation. However, the layers of these models do not output any explicit representations for texts units larger than tokens (e.g. sentences), although such representations are required to perform text classification. Sentence encodings are usually obtained by applying a pooling technique during fine-tuning on a specific task. In this paper, a new sentence encoder is introduced. Relying on an autoencoder architecture, it was trained to learn sentence representations from the very beginning of its training. The model was trained on bilingual data with variational Bayesian inference. Sentence representations were evaluated in downstream and linguistic probing tasks. Although the newly introduced encoder generally performs worse than well-known Transformer-based encoders, the experiments show that it was able to learn to incorporate linguistic information in the sentence representations.
基于Transformer的NLP模型在许多NLP任务中取得了最先进的结果,包括文本分类和文本生成。然而,这些模型的层不输出大于标记(例如句子)的文本单元的任何显式表示,尽管执行文本分类需要这样的表示。句子编码通常是通过在对特定任务进行微调时应用池技术来获得的。本文介绍了一种新的句子编码器。依靠自动编码器架构,它从一开始就被训练来学习句子表示。该模型使用变分贝叶斯推理在双语数据上进行训练。在下游和语言探究任务中评估句子表征。尽管新引入的编码器通常比众所周知的基于Transformer的编码器性能更差,但实验表明,它能够学会将语言信息融入句子表示中。
{"title":"BiVaSE: A bilingual variational sentence encoder with randomly initialized Transformer layers","authors":"Bence Nyéki","doi":"10.1556/2062.2022.00584","DOIUrl":"https://doi.org/10.1556/2062.2022.00584","url":null,"abstract":"Transformer-based NLP models have achieved state-of-the-art results in many NLP tasks including text classification and text generation. However, the layers of these models do not output any explicit representations for texts units larger than tokens (e.g. sentences), although such representations are required to perform text classification. Sentence encodings are usually obtained by applying a pooling technique during fine-tuning on a specific task. In this paper, a new sentence encoder is introduced. Relying on an autoencoder architecture, it was trained to learn sentence representations from the very beginning of its training. The model was trained on bilingual data with variational Bayesian inference. Sentence representations were evaluated in downstream and linguistic probing tasks. Although the newly introduced encoder generally performs worse than well-known Transformer-based encoders, the experiments show that it was able to learn to incorporate linguistic information in the sentence representations.","PeriodicalId":37594,"journal":{"name":"Acta Linguistica Academica","volume":" ","pages":""},"PeriodicalIF":0.5,"publicationDate":"2022-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46375866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Neural machine translation for Hungarian 匈牙利语的神经机器翻译
IF 0.5 3区 文学 0 LANGUAGE & LINGUISTICS Pub Date : 2022-11-30 DOI: 10.1556/2062.2022.00576
L. Laki, Zijian Győző Yang
In the scope of this research, we aim to give an overview of the currently existing solutions for machine translation and we assess their performance on the English-Hungarian language pair. Hungarian is considered to be a challenging language for machine translation because it has a highly different grammatical structure and word ordering compared to English. We probed various machine translation systems from both academic and industrial applications. One key highlight of our work is that our models (Marian NMT, BART) performed significantly better than the solutions offered by most of the market-leader multinational companies. Finally, we fine-tuned different pre-finetuned models (mT5, mBART, M2M100) for English-Hungarian translation, which achieved state-of-the-art results in our test corpora.
在本研究的范围内,我们旨在概述当前现有的机器翻译解决方案,并评估它们在英语-匈牙利语言对上的表现。匈牙利语被认为是机器翻译的一种具有挑战性的语言,因为它与英语相比具有高度不同的语法结构和词序。我们探讨了各种机器翻译系统从学术和工业应用。我们工作的一个关键亮点是,我们的模型(Marian NMT, BART)比大多数市场领先的跨国公司提供的解决方案表现得好得多。最后,我们对不同的预微调模型(mT5、mBART、M2M100)进行了英匈语翻译,在我们的测试语料库中取得了最先进的结果。
{"title":"Neural machine translation for Hungarian","authors":"L. Laki, Zijian Győző Yang","doi":"10.1556/2062.2022.00576","DOIUrl":"https://doi.org/10.1556/2062.2022.00576","url":null,"abstract":"In the scope of this research, we aim to give an overview of the currently existing solutions for machine translation and we assess their performance on the English-Hungarian language pair. Hungarian is considered to be a challenging language for machine translation because it has a highly different grammatical structure and word ordering compared to English. We probed various machine translation systems from both academic and industrial applications. One key highlight of our work is that our models (Marian NMT, BART) performed significantly better than the solutions offered by most of the market-leader multinational companies. Finally, we fine-tuned different pre-finetuned models (mT5, mBART, M2M100) for English-Hungarian translation, which achieved state-of-the-art results in our test corpora.","PeriodicalId":37594,"journal":{"name":"Acta Linguistica Academica","volume":" ","pages":""},"PeriodicalIF":0.5,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42316871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Neural text summarization for Hungarian 匈牙利语的神经文本摘要
IF 0.5 3区 文学 0 LANGUAGE & LINGUISTICS Pub Date : 2022-11-29 DOI: 10.1556/2062.2022.00577
Zijian Győző Yang
One of the most important NLP tasks for the industry today is to produce an extract from longer text documents. This task is one of the hottest topics for the researchers and they have created some solutions for English. There are two types of the text summarization called extractive and abstractive. The goal of the first task is to find the relevant sentences from the text, while the second one should generate the extraction based on the original text. In this research I have built the first solutions for Hungarian text summarization systems both for extractive and abstractive subtasks. Different kinds of neural transformer-based methods were used and evaluated. I present in this publication the first Hungarian abstractive summarization tool based on mBART and mT5 models, which gained state-of-the-art results.
当今行业最重要的NLP任务之一是从较长的文本文档中生成摘录。这项任务是研究人员最热门的话题之一,他们为英语创造了一些解决方案。有两种类型的文本摘要,称为提取和抽象。第一个任务的目标是从文本中找到相关的句子,而第二个任务应该根据原始文本生成提取。在这项研究中,我建立了匈牙利文本摘要系统的第一个解决方案,包括提取子任务和抽象子任务。使用并评估了不同类型的基于神经变换器的方法。我在本出版物中介绍了第一个基于mBART和mT5模型的匈牙利抽象摘要工具,该工具获得了最先进的结果。
{"title":"Neural text summarization for Hungarian","authors":"Zijian Győző Yang","doi":"10.1556/2062.2022.00577","DOIUrl":"https://doi.org/10.1556/2062.2022.00577","url":null,"abstract":"One of the most important NLP tasks for the industry today is to produce an extract from longer text documents. This task is one of the hottest topics for the researchers and they have created some solutions for English. There are two types of the text summarization called extractive and abstractive. The goal of the first task is to find the relevant sentences from the text, while the second one should generate the extraction based on the original text. In this research I have built the first solutions for Hungarian text summarization systems both for extractive and abstractive subtasks. Different kinds of neural transformer-based methods were used and evaluated. I present in this publication the first Hungarian abstractive summarization tool based on mBART and mT5 models, which gained state-of-the-art results.","PeriodicalId":37594,"journal":{"name":"Acta Linguistica Academica","volume":" ","pages":""},"PeriodicalIF":0.5,"publicationDate":"2022-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41444405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-lingual transfer of knowledge in distributional language models: Experiments in Hungarian 分布语言模型中知识的跨语言迁移:匈牙利语实验
IF 0.5 3区 文学 0 LANGUAGE & LINGUISTICS Pub Date : 2022-11-22 DOI: 10.1556/2062.2022.00580
Attila Novák, Borbála Novák
In this paper, we argue that the very convincing performance of recent deep-neural-model-based NLP applications has demonstrated that the distributionalist approach to language description has proven to be more successful than the earlier subtle rule-based models created by the generative school. The now ubiquitous neural models can naturally handle ambiguity and achieve human-like linguistic performance with most of their training consisting only of noisy raw linguistic data without any multimodal grounding or external supervision refuting Chomsky's argument that some generic neural architecture cannot arrive at the linguistic performance exhibited by humans given the limited input available to children. In addition, we demonstrate in experiments with Hungarian as the target language that the shared internal representations in multilingually trained versions of these models make them able to transfer specific linguistic skills, including structured annotation skills, from one language to another remarkably efficiently.
在本文中,我们认为,最近基于深度神经模型的NLP应用程序的令人信服的性能表明,语言描述的分布主义方法已被证明比生成学派创建的早期微妙的基于规则的模型更成功。现在普遍存在的神经模型可以自然地处理歧义,并实现类似人类的语言表现,因为它们的大部分训练仅由有噪声的原始语言数据组成,而没有任何多模态基础或外部监督,驳斥了乔姆斯基的论点,即在有限的输入下,一些通用神经架构无法达到人类表现出的语言表现儿童可用。此外,我们在以匈牙利语为目标语言的实验中证明,这些模型的多语言训练版本中的共享内部表示使他们能够非常有效地将特定的语言技能,包括结构化注释技能,从一种语言转移到另一种语言。
{"title":"Cross-lingual transfer of knowledge in distributional language models: Experiments in Hungarian","authors":"Attila Novák, Borbála Novák","doi":"10.1556/2062.2022.00580","DOIUrl":"https://doi.org/10.1556/2062.2022.00580","url":null,"abstract":"In this paper, we argue that the very convincing performance of recent deep-neural-model-based NLP applications has demonstrated that the distributionalist approach to language description has proven to be more successful than the earlier subtle rule-based models created by the generative school. The now ubiquitous neural models can naturally handle ambiguity and achieve human-like linguistic performance with most of their training consisting only of noisy raw linguistic data without any multimodal grounding or external supervision refuting Chomsky's argument that some generic neural architecture cannot arrive at the linguistic performance exhibited by humans given the limited input available to children. In addition, we demonstrate in experiments with Hungarian as the target language that the shared internal representations in multilingually trained versions of these models make them able to transfer specific linguistic skills, including structured annotation skills, from one language to another remarkably efficiently.","PeriodicalId":37594,"journal":{"name":"Acta Linguistica Academica","volume":" ","pages":""},"PeriodicalIF":0.5,"publicationDate":"2022-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43867203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Winograd schemata and other datasets for anaphora resolution in Hungarian 匈牙利语回指解析的Winograd模式和其他数据集
IF 0.5 3区 文学 0 LANGUAGE & LINGUISTICS Pub Date : 2022-11-22 DOI: 10.1556/2062.2022.00575
Noémi Vadász, Noémi Ligeti-Nagy
The Winograd Schema Challenge (WSC, proposed by Levesque, Davis & Morgenstern 2012) is considered to be the novel Turing Test to examine machine intelligence. Winograd schema questions require the resolution of anaphora with the help of world knowledge and commonsense reasoning. Anaphora resolution is itself an important and difficult issue in natural language processing, therefore, many other datasets have been created to address this issue. In this paper we look into the Winograd schemata and other Winograd-like datasets and the translations of the schemata to other languages, such as Chinese, French and Portuguese. We present the Hungarian translation of the original Winograd schemata and a parallel corpus of all the translations of the schemata currently available. We also adapted some other anaphora resolution datasets to Hungarian. We aim to discuss the challenges we faced during the translation/adaption process.
Winograd模式挑战(WSC,由Levesque, Davis & Morgenstern 2012年提出)被认为是检验机器智能的新型图灵测试。温诺格拉德图式问题需要借助世界知识和常识推理来解决回指问题。在自然语言处理中,回指解析本身就是一个重要而困难的问题,因此,许多其他的数据集已经被创建来解决这个问题。在本文中,我们研究了Winograd模式和其他类似Winograd的数据集,并将这些模式翻译成其他语言,如中文、法语和葡萄牙语。我们提供了原始Winograd模式的匈牙利语翻译和当前可用的所有模式翻译的平行语料库。我们也适应了一些其他的回指分辨率数据集匈牙利语。我们的目标是讨论我们在翻译/改编过程中面临的挑战。
{"title":"Winograd schemata and other datasets for anaphora resolution in Hungarian","authors":"Noémi Vadász, Noémi Ligeti-Nagy","doi":"10.1556/2062.2022.00575","DOIUrl":"https://doi.org/10.1556/2062.2022.00575","url":null,"abstract":"The Winograd Schema Challenge (WSC, proposed by Levesque, Davis & Morgenstern 2012) is considered to be the novel Turing Test to examine machine intelligence. Winograd schema questions require the resolution of anaphora with the help of world knowledge and commonsense reasoning. Anaphora resolution is itself an important and difficult issue in natural language processing, therefore, many other datasets have been created to address this issue. In this paper we look into the Winograd schemata and other Winograd-like datasets and the translations of the schemata to other languages, such as Chinese, French and Portuguese. We present the Hungarian translation of the original Winograd schemata and a parallel corpus of all the translations of the schemata currently available. We also adapted some other anaphora resolution datasets to Hungarian. We aim to discuss the challenges we faced during the translation/adaption process.","PeriodicalId":37594,"journal":{"name":"Acta Linguistica Academica","volume":" ","pages":""},"PeriodicalIF":0.5,"publicationDate":"2022-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48321613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Principles of corpus querying: A discussion note 语料库查询的原则:讨论笔记
IF 0.5 3区 文学 0 LANGUAGE & LINGUISTICS Pub Date : 2022-11-22 DOI: 10.1556/2062.2022.00581
Bálint Sass
Nowadays, it is quite common in linguistics to base research on data instead of introspection. There are countless corpora – both raw and linguistically annotated – available to us which provide essential data needed. Corpora are large in most cases, ranging from several million words to some billion words in size, clearly not suitable to investigate word by word by close reading. Basically, there are two ways to retrieve data from them: (1) through a query interface or (2) directly by automatic text processing. Here we present principles on how to soundly and effectively collect linguistic data from corpora by querying i.e. without knowledge of programming to directly manipulate the data. What is worth thinking about, which tools to use, what to do by default and how to solve problematic cases. In sum, how to obtain correct and complete data from corpora to do linguistic research.
目前,在语言学研究中,以数据为基础而非自省的现象十分普遍。有无数的语料库——包括原始的和有语言注释的——可供我们使用,它们提供了所需的基本数据。在大多数情况下,语料库都很大,从几百万字到几十亿字不等,显然不适合通过细读逐字研究。基本上,有两种方法可以从中检索数据:(1)通过查询接口或(2)直接通过自动文本处理。在这里,我们提出了如何通过查询,即在没有编程知识的情况下直接操作数据,从语料库中健全有效地收集语言数据的原则。什么是值得思考的,使用哪些工具,默认情况下做什么,以及如何解决有问题的情况。综上所述,如何从语料库中获得正确完整的数据来进行语言学研究。
{"title":"Principles of corpus querying: A discussion note","authors":"Bálint Sass","doi":"10.1556/2062.2022.00581","DOIUrl":"https://doi.org/10.1556/2062.2022.00581","url":null,"abstract":"Nowadays, it is quite common in linguistics to base research on data instead of introspection. There are countless corpora – both raw and linguistically annotated – available to us which provide essential data needed. Corpora are large in most cases, ranging from several million words to some billion words in size, clearly not suitable to investigate word by word by close reading. Basically, there are two ways to retrieve data from them: (1) through a query interface or (2) directly by automatic text processing. Here we present principles on how to soundly and effectively collect linguistic data from corpora by querying i.e. without knowledge of programming to directly manipulate the data. What is worth thinking about, which tools to use, what to do by default and how to solve problematic cases. In sum, how to obtain correct and complete data from corpora to do linguistic research.","PeriodicalId":37594,"journal":{"name":"Acta Linguistica Academica","volume":" ","pages":""},"PeriodicalIF":0.5,"publicationDate":"2022-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44300339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
PrevDistro: An open-access dataset of Hungarian preverb constructions PrevDistro:匈牙利preverb结构的开放访问数据集
IF 0.5 3区 文学 0 LANGUAGE & LINGUISTICS Pub Date : 2022-11-22 DOI: 10.1556/2062.2022.00578
Ágnes Kalivoda
Hungarian has a prolific system of complex predicate formation combining a separable preverb and a verb. These combinations can enter a wide range of constructions, with the preverb preserving its separability to some extent, depending on the construction in question. The primary concern of this paper is to advance the investigation of these phenomena by presenting PrevDistro (Preverb Distributions), an open-access dataset containing more than 41.5 million corpus occurrences of 49 preverb construction types. The paper gives a detailed introduction to PrevDistro, including design considerations, methodology and the resulting dataset's main characteristics.
匈牙利语有一个由可分离谓语和动词组成的复杂谓语构成的丰富系统。这些组合可以进入广泛的结构,先行词在某种程度上保留其可分离性,这取决于所讨论的结构。本文主要关注的是通过介绍PrevDistro(预动词分布)来推进这些现象的研究,PrevDistro是一个开放获取的数据集,包含49种预动词结构类型的4150多万语料出现。本文详细介绍了PrevDistro,包括设计注意事项,方法和生成的数据集的主要特征。
{"title":"PrevDistro: An open-access dataset of Hungarian preverb constructions","authors":"Ágnes Kalivoda","doi":"10.1556/2062.2022.00578","DOIUrl":"https://doi.org/10.1556/2062.2022.00578","url":null,"abstract":"Hungarian has a prolific system of complex predicate formation combining a separable preverb and a verb. These combinations can enter a wide range of constructions, with the preverb preserving its separability to some extent, depending on the construction in question. The primary concern of this paper is to advance the investigation of these phenomena by presenting PrevDistro (Preverb Distributions), an open-access dataset containing more than 41.5 million corpus occurrences of 49 preverb construction types. The paper gives a detailed introduction to PrevDistro, including design considerations, methodology and the resulting dataset's main characteristics.","PeriodicalId":37594,"journal":{"name":"Acta Linguistica Academica","volume":"56 11","pages":""},"PeriodicalIF":0.5,"publicationDate":"2022-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41295681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Morphology aware data augmentation with neural language models for online hybrid ASR 基于神经语言模型的在线混合ASR形态学感知数据增强
IF 0.5 3区 文学 0 LANGUAGE & LINGUISTICS Pub Date : 2022-11-21 DOI: 10.1556/2062.2022.00582
Balázs Tarján, T. Fegyó, P. Mihajlik
Recognition of Hungarian conversational telephone speech is challenging due to the informal style and morphological richness of the language. Neural Network Language Models (NNLMs) can provide remedy for the high perplexity of the task; however, their high complexity makes them very difficult to apply in the first (single) pass of an online system. Recent studies showed that a considerable part of the knowledge of NNLMs can be transferred to traditional n-grams by using neural text generation based data augmentation. Data augmentation with NNLMs works well for isolating languages; however, we show that it causes a vocabulary explosion in a morphologically rich language. Therefore, we propose a new, morphology aware neural text augmentation method, where we retokenize the generated text into statistically derived subwords. We compare the performance of word-based and subword-based data augmentation techniques with recurrent and Transformer language models and show that subword-based methods can significantly improve the Word Error Rate (WER) while greatly reducing vocabulary size and memory requirements. Combining subword-based modeling and neural language model-based data augmentation, we were able to achieve 11% relative WER reduction and preserve real-time operation of our conversational telephone speech recognition system. Finally, we also demonstrate that subword-based neural text augmentation outperforms the word-based approach not only in terms of overall WER but also in recognition of Out-of-Vocabulary (OOV) words.
由于匈牙利语的非正式风格和丰富的形态,识别匈牙利语会话电话语音具有挑战性。神经网络语言模型(NNLMs)可以弥补任务的高度困惑;然而,它们的高复杂性使得它们很难应用于在线系统的第一道(单道)。最近的研究表明,通过使用基于神经文本生成的数据扩充,可以将相当一部分NNLMs的知识转移到传统的n-gram中。NNLMs的数据扩充对于隔离语言非常有效;然而,我们发现,在形态丰富的语言中,它会导致词汇爆炸。因此,我们提出了一种新的形态学感知神经文本增强方法,将生成的文本重新命名为统计衍生的子词。我们将基于单词和基于子单词的数据扩充技术与递归和Transformer语言模型的性能进行了比较,结果表明,基于子词的方法可以显著提高单词错误率(WER),同时大大降低词汇大小和内存需求。将基于子词的建模和基于神经语言的数据增强相结合,我们能够实现11%的相对WER降低,并保持会话电话语音识别系统的实时运行。最后,我们还证明了基于子词的神经文本增强不仅在整体WER方面,而且在词汇外(OOV)词的识别方面都优于基于词的方法。
{"title":"Morphology aware data augmentation with neural language models for online hybrid ASR","authors":"Balázs Tarján, T. Fegyó, P. Mihajlik","doi":"10.1556/2062.2022.00582","DOIUrl":"https://doi.org/10.1556/2062.2022.00582","url":null,"abstract":"Recognition of Hungarian conversational telephone speech is challenging due to the informal style and morphological richness of the language. Neural Network Language Models (NNLMs) can provide remedy for the high perplexity of the task; however, their high complexity makes them very difficult to apply in the first (single) pass of an online system. Recent studies showed that a considerable part of the knowledge of NNLMs can be transferred to traditional n-grams by using neural text generation based data augmentation. Data augmentation with NNLMs works well for isolating languages; however, we show that it causes a vocabulary explosion in a morphologically rich language. Therefore, we propose a new, morphology aware neural text augmentation method, where we retokenize the generated text into statistically derived subwords. We compare the performance of word-based and subword-based data augmentation techniques with recurrent and Transformer language models and show that subword-based methods can significantly improve the Word Error Rate (WER) while greatly reducing vocabulary size and memory requirements. Combining subword-based modeling and neural language model-based data augmentation, we were able to achieve 11% relative WER reduction and preserve real-time operation of our conversational telephone speech recognition system. Finally, we also demonstrate that subword-based neural text augmentation outperforms the word-based approach not only in terms of overall WER but also in recognition of Out-of-Vocabulary (OOV) words.","PeriodicalId":37594,"journal":{"name":"Acta Linguistica Academica","volume":" ","pages":""},"PeriodicalIF":0.5,"publicationDate":"2022-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49069067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Acta Linguistica Academica
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1