The Word-in-Context corpus, which forms part of the SuperGLUE benchmark dataset, focuses on a specific sense disambiguation task: it has to be decided whether two occurrences of a given target word in two different contexts convey the same meaning or not. Unfortunately, the WiC database exhibits a relatively low consistency in terms of inter-annotator agreement, which implies that the meaning discrimination task is not well defined even for humans. The present paper aims at tackling this problem through anchoring semantic information to observable surface data. For doing so, we have experimented with a graph-based distributional approach, where both sparse and dense adjectival vector representations served as input. According to our expectations the algorithm is able to anchor the semantic information to contextual data, and therefore it is able to provide clear and explicit criteria as to when the same meaning should be assigned to the occurrences. Moreover, since this method does not rely on any external knowledge base, it should be suitable for any low- or medium-resourced language.
word -in- context语料库是SuperGLUE基准数据集的一部分,它专注于一个特定的意义消歧义任务:它必须确定给定目标单词在两个不同的上下文中的两次出现是否传达相同的含义。不幸的是,WiC数据库在注释者之间的一致性方面表现出相对较低的一致性,这意味着即使对于人类来说,意义区分任务也没有很好地定义。本文旨在通过将语义信息锚定到可观测表面数据来解决这一问题。为此,我们尝试了一种基于图的分布方法,其中稀疏和密集的形容词向量表示都作为输入。根据我们的期望,该算法能够将语义信息锚定到上下文数据,因此它能够提供清晰明确的标准,以确定何时应该为出现的事件分配相同的含义。此外,由于这种方法不依赖于任何外部知识库,因此它应该适用于任何低资源或中等资源的语言。
{"title":"A proof-of-concept meaning discrimination experiment to compile a word-in-context dataset for adjectives – A graph-based distributional approach","authors":"Enikő Héja, Noémi Ligeti-Nagy","doi":"10.1556/2062.2022.00579","DOIUrl":"https://doi.org/10.1556/2062.2022.00579","url":null,"abstract":"The Word-in-Context corpus, which forms part of the SuperGLUE benchmark dataset, focuses on a specific sense disambiguation task: it has to be decided whether two occurrences of a given target word in two different contexts convey the same meaning or not. Unfortunately, the WiC database exhibits a relatively low consistency in terms of inter-annotator agreement, which implies that the meaning discrimination task is not well defined even for humans. The present paper aims at tackling this problem through anchoring semantic information to observable surface data. For doing so, we have experimented with a graph-based distributional approach, where both sparse and dense adjectival vector representations served as input. According to our expectations the algorithm is able to anchor the semantic information to contextual data, and therefore it is able to provide clear and explicit criteria as to when the same meaning should be assigned to the occurrences. Moreover, since this method does not rely on any external knowledge base, it should be suitable for any low- or medium-resourced language.","PeriodicalId":37594,"journal":{"name":"Acta Linguistica Academica","volume":" ","pages":""},"PeriodicalIF":0.5,"publicationDate":"2022-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43712821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Transformer-based NLP models have achieved state-of-the-art results in many NLP tasks including text classification and text generation. However, the layers of these models do not output any explicit representations for texts units larger than tokens (e.g. sentences), although such representations are required to perform text classification. Sentence encodings are usually obtained by applying a pooling technique during fine-tuning on a specific task. In this paper, a new sentence encoder is introduced. Relying on an autoencoder architecture, it was trained to learn sentence representations from the very beginning of its training. The model was trained on bilingual data with variational Bayesian inference. Sentence representations were evaluated in downstream and linguistic probing tasks. Although the newly introduced encoder generally performs worse than well-known Transformer-based encoders, the experiments show that it was able to learn to incorporate linguistic information in the sentence representations.
{"title":"BiVaSE: A bilingual variational sentence encoder with randomly initialized Transformer layers","authors":"Bence Nyéki","doi":"10.1556/2062.2022.00584","DOIUrl":"https://doi.org/10.1556/2062.2022.00584","url":null,"abstract":"Transformer-based NLP models have achieved state-of-the-art results in many NLP tasks including text classification and text generation. However, the layers of these models do not output any explicit representations for texts units larger than tokens (e.g. sentences), although such representations are required to perform text classification. Sentence encodings are usually obtained by applying a pooling technique during fine-tuning on a specific task. In this paper, a new sentence encoder is introduced. Relying on an autoencoder architecture, it was trained to learn sentence representations from the very beginning of its training. The model was trained on bilingual data with variational Bayesian inference. Sentence representations were evaluated in downstream and linguistic probing tasks. Although the newly introduced encoder generally performs worse than well-known Transformer-based encoders, the experiments show that it was able to learn to incorporate linguistic information in the sentence representations.","PeriodicalId":37594,"journal":{"name":"Acta Linguistica Academica","volume":" ","pages":""},"PeriodicalIF":0.5,"publicationDate":"2022-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46375866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the scope of this research, we aim to give an overview of the currently existing solutions for machine translation and we assess their performance on the English-Hungarian language pair. Hungarian is considered to be a challenging language for machine translation because it has a highly different grammatical structure and word ordering compared to English. We probed various machine translation systems from both academic and industrial applications. One key highlight of our work is that our models (Marian NMT, BART) performed significantly better than the solutions offered by most of the market-leader multinational companies. Finally, we fine-tuned different pre-finetuned models (mT5, mBART, M2M100) for English-Hungarian translation, which achieved state-of-the-art results in our test corpora.
{"title":"Neural machine translation for Hungarian","authors":"L. Laki, Zijian Győző Yang","doi":"10.1556/2062.2022.00576","DOIUrl":"https://doi.org/10.1556/2062.2022.00576","url":null,"abstract":"In the scope of this research, we aim to give an overview of the currently existing solutions for machine translation and we assess their performance on the English-Hungarian language pair. Hungarian is considered to be a challenging language for machine translation because it has a highly different grammatical structure and word ordering compared to English. We probed various machine translation systems from both academic and industrial applications. One key highlight of our work is that our models (Marian NMT, BART) performed significantly better than the solutions offered by most of the market-leader multinational companies. Finally, we fine-tuned different pre-finetuned models (mT5, mBART, M2M100) for English-Hungarian translation, which achieved state-of-the-art results in our test corpora.","PeriodicalId":37594,"journal":{"name":"Acta Linguistica Academica","volume":" ","pages":""},"PeriodicalIF":0.5,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42316871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
One of the most important NLP tasks for the industry today is to produce an extract from longer text documents. This task is one of the hottest topics for the researchers and they have created some solutions for English. There are two types of the text summarization called extractive and abstractive. The goal of the first task is to find the relevant sentences from the text, while the second one should generate the extraction based on the original text. In this research I have built the first solutions for Hungarian text summarization systems both for extractive and abstractive subtasks. Different kinds of neural transformer-based methods were used and evaluated. I present in this publication the first Hungarian abstractive summarization tool based on mBART and mT5 models, which gained state-of-the-art results.
{"title":"Neural text summarization for Hungarian","authors":"Zijian Győző Yang","doi":"10.1556/2062.2022.00577","DOIUrl":"https://doi.org/10.1556/2062.2022.00577","url":null,"abstract":"One of the most important NLP tasks for the industry today is to produce an extract from longer text documents. This task is one of the hottest topics for the researchers and they have created some solutions for English. There are two types of the text summarization called extractive and abstractive. The goal of the first task is to find the relevant sentences from the text, while the second one should generate the extraction based on the original text. In this research I have built the first solutions for Hungarian text summarization systems both for extractive and abstractive subtasks. Different kinds of neural transformer-based methods were used and evaluated. I present in this publication the first Hungarian abstractive summarization tool based on mBART and mT5 models, which gained state-of-the-art results.","PeriodicalId":37594,"journal":{"name":"Acta Linguistica Academica","volume":" ","pages":""},"PeriodicalIF":0.5,"publicationDate":"2022-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41444405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we argue that the very convincing performance of recent deep-neural-model-based NLP applications has demonstrated that the distributionalist approach to language description has proven to be more successful than the earlier subtle rule-based models created by the generative school. The now ubiquitous neural models can naturally handle ambiguity and achieve human-like linguistic performance with most of their training consisting only of noisy raw linguistic data without any multimodal grounding or external supervision refuting Chomsky's argument that some generic neural architecture cannot arrive at the linguistic performance exhibited by humans given the limited input available to children. In addition, we demonstrate in experiments with Hungarian as the target language that the shared internal representations in multilingually trained versions of these models make them able to transfer specific linguistic skills, including structured annotation skills, from one language to another remarkably efficiently.
{"title":"Cross-lingual transfer of knowledge in distributional language models: Experiments in Hungarian","authors":"Attila Novák, Borbála Novák","doi":"10.1556/2062.2022.00580","DOIUrl":"https://doi.org/10.1556/2062.2022.00580","url":null,"abstract":"In this paper, we argue that the very convincing performance of recent deep-neural-model-based NLP applications has demonstrated that the distributionalist approach to language description has proven to be more successful than the earlier subtle rule-based models created by the generative school. The now ubiquitous neural models can naturally handle ambiguity and achieve human-like linguistic performance with most of their training consisting only of noisy raw linguistic data without any multimodal grounding or external supervision refuting Chomsky's argument that some generic neural architecture cannot arrive at the linguistic performance exhibited by humans given the limited input available to children. In addition, we demonstrate in experiments with Hungarian as the target language that the shared internal representations in multilingually trained versions of these models make them able to transfer specific linguistic skills, including structured annotation skills, from one language to another remarkably efficiently.","PeriodicalId":37594,"journal":{"name":"Acta Linguistica Academica","volume":" ","pages":""},"PeriodicalIF":0.5,"publicationDate":"2022-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43867203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Winograd Schema Challenge (WSC, proposed by Levesque, Davis & Morgenstern 2012) is considered to be the novel Turing Test to examine machine intelligence. Winograd schema questions require the resolution of anaphora with the help of world knowledge and commonsense reasoning. Anaphora resolution is itself an important and difficult issue in natural language processing, therefore, many other datasets have been created to address this issue. In this paper we look into the Winograd schemata and other Winograd-like datasets and the translations of the schemata to other languages, such as Chinese, French and Portuguese. We present the Hungarian translation of the original Winograd schemata and a parallel corpus of all the translations of the schemata currently available. We also adapted some other anaphora resolution datasets to Hungarian. We aim to discuss the challenges we faced during the translation/adaption process.
Winograd模式挑战(WSC,由Levesque, Davis & Morgenstern 2012年提出)被认为是检验机器智能的新型图灵测试。温诺格拉德图式问题需要借助世界知识和常识推理来解决回指问题。在自然语言处理中,回指解析本身就是一个重要而困难的问题,因此,许多其他的数据集已经被创建来解决这个问题。在本文中,我们研究了Winograd模式和其他类似Winograd的数据集,并将这些模式翻译成其他语言,如中文、法语和葡萄牙语。我们提供了原始Winograd模式的匈牙利语翻译和当前可用的所有模式翻译的平行语料库。我们也适应了一些其他的回指分辨率数据集匈牙利语。我们的目标是讨论我们在翻译/改编过程中面临的挑战。
{"title":"Winograd schemata and other datasets for anaphora resolution in Hungarian","authors":"Noémi Vadász, Noémi Ligeti-Nagy","doi":"10.1556/2062.2022.00575","DOIUrl":"https://doi.org/10.1556/2062.2022.00575","url":null,"abstract":"The Winograd Schema Challenge (WSC, proposed by Levesque, Davis & Morgenstern 2012) is considered to be the novel Turing Test to examine machine intelligence. Winograd schema questions require the resolution of anaphora with the help of world knowledge and commonsense reasoning. Anaphora resolution is itself an important and difficult issue in natural language processing, therefore, many other datasets have been created to address this issue. In this paper we look into the Winograd schemata and other Winograd-like datasets and the translations of the schemata to other languages, such as Chinese, French and Portuguese. We present the Hungarian translation of the original Winograd schemata and a parallel corpus of all the translations of the schemata currently available. We also adapted some other anaphora resolution datasets to Hungarian. We aim to discuss the challenges we faced during the translation/adaption process.","PeriodicalId":37594,"journal":{"name":"Acta Linguistica Academica","volume":" ","pages":""},"PeriodicalIF":0.5,"publicationDate":"2022-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48321613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nowadays, it is quite common in linguistics to base research on data instead of introspection. There are countless corpora – both raw and linguistically annotated – available to us which provide essential data needed. Corpora are large in most cases, ranging from several million words to some billion words in size, clearly not suitable to investigate word by word by close reading. Basically, there are two ways to retrieve data from them: (1) through a query interface or (2) directly by automatic text processing. Here we present principles on how to soundly and effectively collect linguistic data from corpora by querying i.e. without knowledge of programming to directly manipulate the data. What is worth thinking about, which tools to use, what to do by default and how to solve problematic cases. In sum, how to obtain correct and complete data from corpora to do linguistic research.
{"title":"Principles of corpus querying: A discussion note","authors":"Bálint Sass","doi":"10.1556/2062.2022.00581","DOIUrl":"https://doi.org/10.1556/2062.2022.00581","url":null,"abstract":"Nowadays, it is quite common in linguistics to base research on data instead of introspection. There are countless corpora – both raw and linguistically annotated – available to us which provide essential data needed. Corpora are large in most cases, ranging from several million words to some billion words in size, clearly not suitable to investigate word by word by close reading. Basically, there are two ways to retrieve data from them: (1) through a query interface or (2) directly by automatic text processing. Here we present principles on how to soundly and effectively collect linguistic data from corpora by querying i.e. without knowledge of programming to directly manipulate the data. What is worth thinking about, which tools to use, what to do by default and how to solve problematic cases. In sum, how to obtain correct and complete data from corpora to do linguistic research.","PeriodicalId":37594,"journal":{"name":"Acta Linguistica Academica","volume":" ","pages":""},"PeriodicalIF":0.5,"publicationDate":"2022-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44300339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hungarian has a prolific system of complex predicate formation combining a separable preverb and a verb. These combinations can enter a wide range of constructions, with the preverb preserving its separability to some extent, depending on the construction in question. The primary concern of this paper is to advance the investigation of these phenomena by presenting PrevDistro (Preverb Distributions), an open-access dataset containing more than 41.5 million corpus occurrences of 49 preverb construction types. The paper gives a detailed introduction to PrevDistro, including design considerations, methodology and the resulting dataset's main characteristics.
{"title":"PrevDistro: An open-access dataset of Hungarian preverb constructions","authors":"Ágnes Kalivoda","doi":"10.1556/2062.2022.00578","DOIUrl":"https://doi.org/10.1556/2062.2022.00578","url":null,"abstract":"Hungarian has a prolific system of complex predicate formation combining a separable preverb and a verb. These combinations can enter a wide range of constructions, with the preverb preserving its separability to some extent, depending on the construction in question. The primary concern of this paper is to advance the investigation of these phenomena by presenting PrevDistro (Preverb Distributions), an open-access dataset containing more than 41.5 million corpus occurrences of 49 preverb construction types. The paper gives a detailed introduction to PrevDistro, including design considerations, methodology and the resulting dataset's main characteristics.","PeriodicalId":37594,"journal":{"name":"Acta Linguistica Academica","volume":"56 11","pages":""},"PeriodicalIF":0.5,"publicationDate":"2022-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41295681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recognition of Hungarian conversational telephone speech is challenging due to the informal style and morphological richness of the language. Neural Network Language Models (NNLMs) can provide remedy for the high perplexity of the task; however, their high complexity makes them very difficult to apply in the first (single) pass of an online system. Recent studies showed that a considerable part of the knowledge of NNLMs can be transferred to traditional n-grams by using neural text generation based data augmentation. Data augmentation with NNLMs works well for isolating languages; however, we show that it causes a vocabulary explosion in a morphologically rich language. Therefore, we propose a new, morphology aware neural text augmentation method, where we retokenize the generated text into statistically derived subwords. We compare the performance of word-based and subword-based data augmentation techniques with recurrent and Transformer language models and show that subword-based methods can significantly improve the Word Error Rate (WER) while greatly reducing vocabulary size and memory requirements. Combining subword-based modeling and neural language model-based data augmentation, we were able to achieve 11% relative WER reduction and preserve real-time operation of our conversational telephone speech recognition system. Finally, we also demonstrate that subword-based neural text augmentation outperforms the word-based approach not only in terms of overall WER but also in recognition of Out-of-Vocabulary (OOV) words.
{"title":"Morphology aware data augmentation with neural language models for online hybrid ASR","authors":"Balázs Tarján, T. Fegyó, P. Mihajlik","doi":"10.1556/2062.2022.00582","DOIUrl":"https://doi.org/10.1556/2062.2022.00582","url":null,"abstract":"Recognition of Hungarian conversational telephone speech is challenging due to the informal style and morphological richness of the language. Neural Network Language Models (NNLMs) can provide remedy for the high perplexity of the task; however, their high complexity makes them very difficult to apply in the first (single) pass of an online system. Recent studies showed that a considerable part of the knowledge of NNLMs can be transferred to traditional n-grams by using neural text generation based data augmentation. Data augmentation with NNLMs works well for isolating languages; however, we show that it causes a vocabulary explosion in a morphologically rich language. Therefore, we propose a new, morphology aware neural text augmentation method, where we retokenize the generated text into statistically derived subwords. We compare the performance of word-based and subword-based data augmentation techniques with recurrent and Transformer language models and show that subword-based methods can significantly improve the Word Error Rate (WER) while greatly reducing vocabulary size and memory requirements. Combining subword-based modeling and neural language model-based data augmentation, we were able to achieve 11% relative WER reduction and preserve real-time operation of our conversational telephone speech recognition system. Finally, we also demonstrate that subword-based neural text augmentation outperforms the word-based approach not only in terms of overall WER but also in recognition of Out-of-Vocabulary (OOV) words.","PeriodicalId":37594,"journal":{"name":"Acta Linguistica Academica","volume":" ","pages":""},"PeriodicalIF":0.5,"publicationDate":"2022-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49069067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}