首页 > 最新文献

Natural Language Engineering最新文献

英文 中文
Comparison of text preprocessing methods 文本预处理方法比较
IF 2.5 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2022-06-13 DOI: 10.1017/S1351324922000213
Christine P. Chai
Abstract Text preprocessing is not only an essential step to prepare the corpus for modeling but also a key area that directly affects the natural language processing (NLP) application results. For instance, precise tokenization increases the accuracy of part-of-speech (POS) tagging, and retaining multiword expressions improves reasoning and machine translation. The text corpus needs to be appropriately preprocessed before it is ready to serve as the input to computer models. The preprocessing requirements depend on both the nature of the corpus and the NLP application itself, that is, what researchers would like to achieve from analyzing the data. Conventional text preprocessing practices generally suffice, but there exist situations where the text preprocessing needs to be customized for better analysis results. Hence, we discuss the pros and cons of several common text preprocessing methods: removing formatting, tokenization, text normalization, handling punctuation, removing stopwords, stemming and lemmatization, n-gramming, and identifying multiword expressions. Then, we provide examples of text datasets which require special preprocessing and how previous researchers handled the challenge. We expect this article to be a starting guideline on how to select and fine-tune text preprocessing methods.
摘要文本预处理不仅是为建模准备语料库的重要步骤,也是直接影响自然语言处理(NLP)应用结果的关键领域。例如,精确的标记化提高了词性(POS)标记的准确性,而保留多词表达则提高了推理和机器翻译。在准备好作为计算机模型的输入之前,需要对文本语料库进行适当的预处理。预处理要求取决于语料库的性质和NLP应用程序本身,也就是说,研究人员希望通过分析数据来实现什么。传统的文本预处理实践通常就足够了,但在某些情况下,需要对文本预处理进行定制,以获得更好的分析结果。因此,我们讨论了几种常见的文本预处理方法的优缺点:删除格式、标记化、文本规范化、处理标点符号、删除停止语、词干和引理化、n语法和识别多词表达式。然后,我们提供了需要特殊预处理的文本数据集的例子,以及以前的研究人员是如何应对这一挑战的。我们希望这篇文章能成为如何选择和微调文本预处理方法的入门指南。
{"title":"Comparison of text preprocessing methods","authors":"Christine P. Chai","doi":"10.1017/S1351324922000213","DOIUrl":"https://doi.org/10.1017/S1351324922000213","url":null,"abstract":"Abstract Text preprocessing is not only an essential step to prepare the corpus for modeling but also a key area that directly affects the natural language processing (NLP) application results. For instance, precise tokenization increases the accuracy of part-of-speech (POS) tagging, and retaining multiword expressions improves reasoning and machine translation. The text corpus needs to be appropriately preprocessed before it is ready to serve as the input to computer models. The preprocessing requirements depend on both the nature of the corpus and the NLP application itself, that is, what researchers would like to achieve from analyzing the data. Conventional text preprocessing practices generally suffice, but there exist situations where the text preprocessing needs to be customized for better analysis results. Hence, we discuss the pros and cons of several common text preprocessing methods: removing formatting, tokenization, text normalization, handling punctuation, removing stopwords, stemming and lemmatization, n-gramming, and identifying multiword expressions. Then, we provide examples of text datasets which require special preprocessing and how previous researchers handled the challenge. We expect this article to be a starting guideline on how to select and fine-tune text preprocessing methods.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"509 - 553"},"PeriodicalIF":2.5,"publicationDate":"2022-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49277409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Ad astra or astray: Exploring linguistic knowledge of multilingual BERT through NLI task 误入歧途还是误入歧途:通过NLI任务探索多语言BERT的语言学知识
IF 2.5 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2022-06-09 DOI: 10.1017/S1351324922000225
M. Tikhonova, V. Mikhailov, D. Pisarevskaya, Valentin Malykh, Tatiana Shavrina
Abstract Recent research has reported that standard fine-tuning approaches can be unstable due to being prone to various sources of randomness, including but not limited to weight initialization, training data order, and hardware. Such brittleness can lead to different evaluation results, prediction confidences, and generalization inconsistency of the same models independently fine-tuned under the same experimental setup. Our paper explores this problem in natural language inference, a common task in benchmarking practices, and extends the ongoing research to the multilingual setting. We propose six novel textual entailment and broad-coverage diagnostic datasets for French, German, and Swedish. Our key findings are that the mBERT model demonstrates fine-tuning instability for categories that involve lexical semantics, logic, and predicate-argument structure and struggles to learn monotonicity, negation, numeracy, and symmetry. We also observe that using extra training data only in English can enhance the generalization performance and fine-tuning stability, which we attribute to the cross-lingual transfer capabilities. However, the ratio of particular features in the additional training data might rather hurt the performance for model instances. We are publicly releasing the datasets, hoping to foster the diagnostic investigation of language models (LMs) in a cross-lingual scenario, particularly in terms of benchmarking, which might promote a more holistic understanding of multilingualism in LMs and cross-lingual knowledge transfer.
最近的研究表明,标准的微调方法可能不稳定,因为容易受到各种随机性的影响,包括但不限于权重初始化、训练数据顺序和硬件。这种脆弱性会导致在同一实验设置下独立微调的同一模型的评估结果、预测置信度和泛化不一致。我们的论文探讨了自然语言推理中的这个问题,这是基准测试实践中的一个常见任务,并将正在进行的研究扩展到多语言环境。我们为法语、德语和瑞典语提出了六个新的文本蕴涵和广泛覆盖的诊断数据集。我们的主要发现是,mBERT模型展示了涉及词汇语义、逻辑和谓词-参数结构的类别的微调不稳定性,并努力学习单调性、否定性、计算性和对称性。我们还观察到,仅使用英语的额外训练数据可以提高泛化性能和微调稳定性,我们将其归因于跨语言迁移能力。然而,额外训练数据中特定特征的比例可能会损害模型实例的性能。我们公开发布这些数据集,希望促进跨语言场景下语言模型(LMs)的诊断研究,特别是在基准测试方面,这可能会促进对LMs中的多语言使用和跨语言知识转移的更全面的理解。
{"title":"Ad astra or astray: Exploring linguistic knowledge of multilingual BERT through NLI task","authors":"M. Tikhonova, V. Mikhailov, D. Pisarevskaya, Valentin Malykh, Tatiana Shavrina","doi":"10.1017/S1351324922000225","DOIUrl":"https://doi.org/10.1017/S1351324922000225","url":null,"abstract":"Abstract Recent research has reported that standard fine-tuning approaches can be unstable due to being prone to various sources of randomness, including but not limited to weight initialization, training data order, and hardware. Such brittleness can lead to different evaluation results, prediction confidences, and generalization inconsistency of the same models independently fine-tuned under the same experimental setup. Our paper explores this problem in natural language inference, a common task in benchmarking practices, and extends the ongoing research to the multilingual setting. We propose six novel textual entailment and broad-coverage diagnostic datasets for French, German, and Swedish. Our key findings are that the mBERT model demonstrates fine-tuning instability for categories that involve lexical semantics, logic, and predicate-argument structure and struggles to learn monotonicity, negation, numeracy, and symmetry. We also observe that using extra training data only in English can enhance the generalization performance and fine-tuning stability, which we attribute to the cross-lingual transfer capabilities. However, the ratio of particular features in the additional training data might rather hurt the performance for model instances. We are publicly releasing the datasets, hoping to foster the diagnostic investigation of language models (LMs) in a cross-lingual scenario, particularly in terms of benchmarking, which might promote a more holistic understanding of multilingualism in LMs and cross-lingual knowledge transfer.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"554 - 583"},"PeriodicalIF":2.5,"publicationDate":"2022-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47433843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Emerging trends: General fine-tuning (gft) 新兴趋势:通用微调(gft)
IF 2.5 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2022-05-23 DOI: 10.1017/S1351324922000237
Kenneth Ward Church, Xingyu Cai, Yibiao Ying, Zeyu Chen, Guangxu Xun, Yuchen Bian
Abstract This paper describes gft (general fine-tuning), a little language for deep nets, introduced at an ACL-2022 tutorial. gft makes deep nets accessible to a broad audience including non-programmers. It is standard practice in many fields to use statistics packages such as R. One should not need to know how to program in order to fit a regression or classification model and to use the model to make predictions for novel inputs. With gft, fine-tuning and inference are similar to fit and predict in regression and classification. gft demystifies deep nets; no one would suggest that regression-like methods are “intelligent.”
摘要本文描述了gft(通用微调),这是一种用于深度网络的小语言,在ACL-2022教程中介绍。gft使得包括非程序员在内的广大用户都可以访问deep nets。使用R等统计软件包是许多领域的标准做法。不需要知道如何编程来拟合回归或分类模型,也不需要使用该模型来预测新的输入。在gft中,微调和推理类似于回归和分类中的拟合和预测。gft揭开深网的神秘面纱;没有人会认为类似回归的方法是“聪明的”
{"title":"Emerging trends: General fine-tuning (gft)","authors":"Kenneth Ward Church, Xingyu Cai, Yibiao Ying, Zeyu Chen, Guangxu Xun, Yuchen Bian","doi":"10.1017/S1351324922000237","DOIUrl":"https://doi.org/10.1017/S1351324922000237","url":null,"abstract":"Abstract This paper describes gft (general fine-tuning), a little language for deep nets, introduced at an ACL-2022 tutorial. gft makes deep nets accessible to a broad audience including non-programmers. It is standard practice in many fields to use statistics packages such as R. One should not need to know how to program in order to fit a regression or classification model and to use the model to make predictions for novel inputs. With gft, fine-tuning and inference are similar to fit and predict in regression and classification. gft demystifies deep nets; no one would suggest that regression-like methods are “intelligent.”","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"28 1","pages":"519 - 535"},"PeriodicalIF":2.5,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45341971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Turkish abstractive text summarization using pretrained sequence-to-sequence models 使用预训练序列到序列模型的土耳其语抽象文本摘要
IF 2.5 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2022-05-13 DOI: 10.1017/S1351324922000195
Batuhan Baykara, Tunga Güngör
Abstract The tremendous amount of increase in the number of documents available on the Web has turned finding the relevant piece of information into a challenging, tedious, and time-consuming activity. Accordingly, automatic text summarization has become an important field of study by gaining significant attention from the researchers. Lately, with the advances in deep learning, neural abstractive text summarization with sequence-to-sequence (Seq2Seq) models has gained popularity. There have been many improvements in these models such as the use of pretrained language models (e.g., GPT, BERT, and XLM) and pretrained Seq2Seq models (e.g., BART and T5). These improvements have addressed certain shortcomings in neural summarization and have improved upon challenges such as saliency, fluency, and semantics which enable generating higher quality summaries. Unfortunately, these research attempts were mostly limited to the English language. Monolingual BERT models and multilingual pretrained Seq2Seq models have been released recently providing the opportunity to utilize such state-of-the-art models in low-resource languages such as Turkish. In this study, we make use of pretrained Seq2Seq models and obtain state-of-the-art results on the two large-scale Turkish datasets, TR-News and MLSum, for the text summarization task. Then, we utilize the title information in the datasets and establish hard baselines for the title generation task on both datasets. We show that the input to the models has a substantial amount of importance for the success of such tasks. Additionally, we provide extensive analysis of the models including cross-dataset evaluations, various text generation options, and the effect of preprocessing in ROUGE evaluations for Turkish. It is shown that the monolingual BERT models outperform the multilingual BERT models on all tasks across all the datasets. Lastly, qualitative evaluations of the generated summaries and titles of the models are provided.
随着网络上可用文档数量的大量增加,查找相关信息已成为一项具有挑战性、乏味且耗时的活动。因此,文本自动摘要已成为一个重要的研究领域,受到了研究者的广泛关注。近年来,随着深度学习的发展,基于序列到序列(sequence-to-sequence, Seq2Seq)模型的神经抽象文本摘要越来越受欢迎。这些模型有很多改进,比如使用预训练的语言模型(如GPT、BERT和XLM)和预训练的Seq2Seq模型(如BART和T5)。这些改进解决了神经摘要的某些缺点,并改进了诸如显著性、流畅性和语义等挑战,从而能够生成更高质量的摘要。不幸的是,这些研究大多局限于英语语言。最近发布的单语言BERT模型和多语言预训练Seq2Seq模型提供了在低资源语言(如土耳其语)中使用这些最先进模型的机会。在本研究中,我们使用预训练的Seq2Seq模型,并在两个大型土耳其数据集TR-News和MLSum上获得了最先进的结果,用于文本摘要任务。然后,我们利用数据集中的标题信息,并在两个数据集中建立标题生成任务的硬基线。我们表明,模型的输入对于此类任务的成功具有相当大的重要性。此外,我们还提供了对模型的广泛分析,包括跨数据集评估、各种文本生成选项以及土耳其语ROUGE评估中预处理的影响。结果表明,单语言BERT模型在所有数据集上的所有任务上都优于多语言BERT模型。最后,对生成的摘要和模型的标题进行了定性评价。
{"title":"Turkish abstractive text summarization using pretrained sequence-to-sequence models","authors":"Batuhan Baykara, Tunga Güngör","doi":"10.1017/S1351324922000195","DOIUrl":"https://doi.org/10.1017/S1351324922000195","url":null,"abstract":"Abstract The tremendous amount of increase in the number of documents available on the Web has turned finding the relevant piece of information into a challenging, tedious, and time-consuming activity. Accordingly, automatic text summarization has become an important field of study by gaining significant attention from the researchers. Lately, with the advances in deep learning, neural abstractive text summarization with sequence-to-sequence (Seq2Seq) models has gained popularity. There have been many improvements in these models such as the use of pretrained language models (e.g., GPT, BERT, and XLM) and pretrained Seq2Seq models (e.g., BART and T5). These improvements have addressed certain shortcomings in neural summarization and have improved upon challenges such as saliency, fluency, and semantics which enable generating higher quality summaries. Unfortunately, these research attempts were mostly limited to the English language. Monolingual BERT models and multilingual pretrained Seq2Seq models have been released recently providing the opportunity to utilize such state-of-the-art models in low-resource languages such as Turkish. In this study, we make use of pretrained Seq2Seq models and obtain state-of-the-art results on the two large-scale Turkish datasets, TR-News and MLSum, for the text summarization task. Then, we utilize the title information in the datasets and establish hard baselines for the title generation task on both datasets. We show that the input to the models has a substantial amount of importance for the success of such tasks. Additionally, we provide extensive analysis of the models including cross-dataset evaluations, various text generation options, and the effect of preprocessing in ROUGE evaluations for Turkish. It is shown that the monolingual BERT models outperform the multilingual BERT models on all tasks across all the datasets. Lastly, qualitative evaluations of the generated summaries and titles of the models are provided.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"1275 - 1304"},"PeriodicalIF":2.5,"publicationDate":"2022-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46982077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Natural Language Processing for Corpus Linguistics by Jonathan Dunn. Cambridge: Cambridge University Press, 2022. ISBN 9781009070447 (PB), ISBN 9781009070447 (OC), vi+88 pages. 《语料库语言学中的自然语言处理》,作者:Jonathan Dunn。剑桥:剑桥大学出版社,2022。ISBN 9781009070447 (PB), ISBN 9781009070447 (OC), vi+88页。
IF 2.5 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2022-05-12 DOI: 10.1017/S1351324922000201
J. Wen, Lan Yi
Corpus linguistics is essentially the computer-based empirical analysis that examines naturally occurring language and its use with a representative collection of machine-readable texts (Sinclair, 1991; Biber, Conrad and Reppen, 1998; McEnery and Hardie, 2012). The techniques of corpus linguistics enable the analyzing of large amounts of corpus data from both qualitative (e.g., concordances) and quantitative (e.g., word frequencies) perspectives, which in turn may yield evidence for or against the proposed linguistic statements or assumptions (Reppen, 2010). Despite its success in a wide range of fields (Römer, 2022), traditional corpus linguistics has become seemingly disconnected from recent technological advances in artificial intelligence as the computing power and corpus data available for linguistic analysis continue to grow in the past decades. In this connection, more sophisticated methods are needed to update and expand the arsenal for corpus linguistics research. As its name suggests, this monograph focuses exclusively on utilizing NLP techniques to uncover different aspects of language use through the lens of corpus linguistics. It consists of four main chapters plus a brief conclusion. Each of the four main chapters highlights a different aspect of computational methodologies for corpus linguistic research, followed by a discussion on the potential ethical issues that are pertinent to the application. Five corpus-based case studies are presented to demonstrate how and why a particular computational method is used for linguistic analysis. Given the methodological orientation of the book, it is not surprising that there are substantial technical details concerning the implementation of these methods, which is usually a daunting task for those readers without any background knowledge in computer programming. Fortunately, the author has made all the Python scripts and corpus data used in the case studies publicly available online at https://doi.org/10.24433/CO.3402613.v1. These online supporting materials are an invaluable complement to the book because they not only ease readers from coding but also make every result and graph in the book readily reproducible. To provide better hands-on experience for readers, a quick walkthrough on the accessing of online materials is presented prior to the beginning of the main chapters. With just a few clicks, readers will be able to run the code and replicate the case studies with interactive code notebooks. Of course, readers who are familiar with Python programming are encouraged to further explore the corpus data and expand the scripts to serve their own research purposes. Chapter 1 provides a general overview of the computational analysis in corpus linguistics research and outlines the key issues to be addressed. It first defines the major problems (namely, categorization and comparison) in corpus analysis that NLP models can solve, and explains why computational linguistic analysis is needed for
语料库语言学本质上是基于计算机的实证分析,通过机器可读文本的代表性集合来研究自然语言及其使用(Sinclair,1991;Biber、Conrad和Reppen,1998;McEnery和Hardie,2012)。语料库语言学的技术能够从定性(例如,一致性)和定量(例如,词频)两个角度分析大量的语料库数据,这反过来可能产生支持或反对所提出的语言陈述或假设的证据(Reppen,2010)。尽管传统语料库语言学在广泛的领域取得了成功(Römer,2022),但随着可用于语言分析的计算能力和语料库数据在过去几十年中不断增长,传统语料库语言学似乎与人工智能的最新技术进步脱节。在这方面,需要更复杂的方法来更新和扩大语料库语言学研究的武器库。正如它的名字所暗示的,这本专著专注于利用NLP技术,通过语料库语言学的视角揭示语言使用的不同方面。它由四个主要章节和一个简短的结论组成。四个主要章节中的每一章都强调了语料库语言学研究的计算方法的不同方面,然后讨论了与应用相关的潜在伦理问题。五个基于语料库的案例研究展示了一种特殊的计算方法是如何以及为什么被用于语言分析的。考虑到本书的方法论方向,关于这些方法的实施有大量的技术细节也就不足为奇了,对于那些没有任何计算机编程背景知识的读者来说,这通常是一项艰巨的任务。幸运的是,作者已经在网上公开了案例研究中使用的所有Python脚本和语料库数据https://doi.org/10.24433/CO.3402613.v1.这些在线支持材料是对这本书的宝贵补充,因为它们不仅使读者易于编码,而且使书中的每一个结果和图表都易于复制。为了给读者提供更好的动手体验,在主要章节开始之前,我们将简要介绍如何访问在线材料。只需点击几下,读者就可以运行代码,并通过交互式代码笔记本复制案例研究。当然,我们鼓励熟悉Python编程的读者进一步探索语料库数据,并扩展脚本,以满足他们自己的研究目的。第一章概述了语料库语言学研究中的计算分析,并概述了需要解决的关键问题。它首先定义了NLP模型可以解决的语料库分析中的主要问题(即分类和比较),并解释了为什么语料库语言学研究需要计算语言学分析(即再现性和可扩展性)。然后,作者介绍了将在下一章中介绍的所有五个案例研究。这些研究,从基于用法的语法到基于语料库的社会语言学,展示了NLP方法如何应用于研究真实世界的语言现象。至于关键问题,分类问题和比较问题分为两部分进行了讨论
{"title":"Natural Language Processing for Corpus Linguistics by Jonathan Dunn. Cambridge: Cambridge University Press, 2022. ISBN 9781009070447 (PB), ISBN 9781009070447 (OC), vi+88 pages.","authors":"J. Wen, Lan Yi","doi":"10.1017/S1351324922000201","DOIUrl":"https://doi.org/10.1017/S1351324922000201","url":null,"abstract":"Corpus linguistics is essentially the computer-based empirical analysis that examines naturally occurring language and its use with a representative collection of machine-readable texts (Sinclair, 1991; Biber, Conrad and Reppen, 1998; McEnery and Hardie, 2012). The techniques of corpus linguistics enable the analyzing of large amounts of corpus data from both qualitative (e.g., concordances) and quantitative (e.g., word frequencies) perspectives, which in turn may yield evidence for or against the proposed linguistic statements or assumptions (Reppen, 2010). Despite its success in a wide range of fields (Römer, 2022), traditional corpus linguistics has become seemingly disconnected from recent technological advances in artificial intelligence as the computing power and corpus data available for linguistic analysis continue to grow in the past decades. In this connection, more sophisticated methods are needed to update and expand the arsenal for corpus linguistics research. As its name suggests, this monograph focuses exclusively on utilizing NLP techniques to uncover different aspects of language use through the lens of corpus linguistics. It consists of four main chapters plus a brief conclusion. Each of the four main chapters highlights a different aspect of computational methodologies for corpus linguistic research, followed by a discussion on the potential ethical issues that are pertinent to the application. Five corpus-based case studies are presented to demonstrate how and why a particular computational method is used for linguistic analysis. Given the methodological orientation of the book, it is not surprising that there are substantial technical details concerning the implementation of these methods, which is usually a daunting task for those readers without any background knowledge in computer programming. Fortunately, the author has made all the Python scripts and corpus data used in the case studies publicly available online at https://doi.org/10.24433/CO.3402613.v1. These online supporting materials are an invaluable complement to the book because they not only ease readers from coding but also make every result and graph in the book readily reproducible. To provide better hands-on experience for readers, a quick walkthrough on the accessing of online materials is presented prior to the beginning of the main chapters. With just a few clicks, readers will be able to run the code and replicate the case studies with interactive code notebooks. Of course, readers who are familiar with Python programming are encouraged to further explore the corpus data and expand the scripts to serve their own research purposes. Chapter 1 provides a general overview of the computational analysis in corpus linguistics research and outlines the key issues to be addressed. It first defines the major problems (namely, categorization and comparison) in corpus analysis that NLP models can solve, and explains why computational linguistic analysis is needed for","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"842 - 845"},"PeriodicalIF":2.5,"publicationDate":"2022-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42332455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Abstract meaning representation of Turkish 土耳其语的抽象意义表征
IF 2.5 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2022-04-28 DOI: 10.1017/s1351324922000183
Elif Oral, Ali Acar, Gülşen Eryiğit
Abstract meaning representation (AMR) is a graph-based sentence-level meaning representation that has become highly popular in recent years. AMR is a knowledge-based meaning representation heavily relying on frame semantics for linking predicate frames and entity knowledge bases such as DBpedia for linking named entity concepts. Although it is originally designed for English, its adaptation to non-English languages is possible by defining language-specific divergences and representations. This article introduces the first AMR representation framework for Turkish, which poses diverse challenges for AMR due to its typological differences compared to English; agglutinative, free constituent order, morphologically highly rich resulting in fewer word surface forms in sentences. The introduced solutions to these peculiarities are expected to guide the studies for other similar languages and speed up the construction of a cross-lingual universal AMR framework. Besides this main contribution, the article also presents the construction of the first AMR corpus of 700 sentences, the first AMR parser (i.e., a tree-to-graph rule-based AMR parser) used for semi-automatic annotation, and the evaluation of the introduced resources for Turkish.
摘要意义表示(AMR)是一种基于图的句子级意义表示方法,近年来得到了广泛的应用。AMR是一种基于知识的意义表示,严重依赖于用于链接谓词框架的框架语义和用于链接命名实体概念的实体知识库(如DBpedia)。尽管它最初是为英语设计的,但通过定义特定语言的差异和表示,它可以适应非英语语言。本文介绍了第一个土耳其语的AMR表示框架,与英语相比,由于其类型差异,这对AMR提出了不同的挑战;粘性的,自由的组成顺序,形态高度丰富,导致句子中较少的词表形式。针对这些特点提出的解决方案有望指导其他类似语言的研究,并加快跨语言通用AMR框架的构建。除了这一主要贡献外,本文还介绍了第一个700句的AMR语料库的构建,第一个用于半自动注释的AMR解析器(即基于树到图的AMR语法分析器),以及对引入的土耳其语资源的评估。
{"title":"Abstract meaning representation of Turkish","authors":"Elif Oral, Ali Acar, Gülşen Eryiğit","doi":"10.1017/s1351324922000183","DOIUrl":"https://doi.org/10.1017/s1351324922000183","url":null,"abstract":"\u0000 Abstract meaning representation (AMR) is a graph-based sentence-level meaning representation that has become highly popular in recent years. AMR is a knowledge-based meaning representation heavily relying on frame semantics for linking predicate frames and entity knowledge bases such as DBpedia for linking named entity concepts. Although it is originally designed for English, its adaptation to non-English languages is possible by defining language-specific divergences and representations. This article introduces the first AMR representation framework for Turkish, which poses diverse challenges for AMR due to its typological differences compared to English; agglutinative, free constituent order, morphologically highly rich resulting in fewer word surface forms in sentences. The introduced solutions to these peculiarities are expected to guide the studies for other similar languages and speed up the construction of a cross-lingual universal AMR framework. Besides this main contribution, the article also presents the construction of the first AMR corpus of 700 sentences, the first AMR parser (i.e., a tree-to-graph rule-based AMR parser) used for semi-automatic annotation, and the evaluation of the introduced resources for Turkish.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2022-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46085536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
A survey of methods for revealing and overcoming weaknesses of data-driven Natural Language Understanding 揭示和克服数据驱动的自然语言理解弱点的方法综述
IF 2.5 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2022-04-22 DOI: 10.1017/s1351324922000171
Viktor Schlegel, G. Nenadic, R. Batista-Navarro
Abstract Recent years have seen a growing number of publications that analyse Natural Language Understanding (NLU) datasets for superficial cues, whether they undermine the complexity of the tasks underlying those datasets and how they impact those models that are optimised and evaluated on this data. This structured survey provides an overview of the evolving research area by categorising reported weaknesses in models and datasets and the methods proposed to reveal and alleviate those weaknesses for the English language. We summarise and discuss the findings and conclude with a set of recommendations for possible future research directions. We hope that it will be a useful resource for researchers who propose new datasets to assess the suitability and quality of their data to evaluate various phenomena of interest, as well as those who propose novel NLU approaches, to further understand the implications of their improvements with respect to their model’s acquired capabilities.
摘要近年来,越来越多的出版物分析自然语言理解(NLU)数据集的表面线索,它们是否会破坏这些数据集背后任务的复杂性,以及它们如何影响根据这些数据优化和评估的模型。这项结构化调查通过对模型和数据集中报告的弱点进行分类,以及为揭示和缓解英语中的这些弱点而提出的方法,对不断发展的研究领域进行了概述。我们总结并讨论了这些发现,并为未来可能的研究方向提出了一系列建议。我们希望,对于那些提出新数据集来评估其数据的适用性和质量以评估各种感兴趣现象的研究人员,以及那些提出新的NLU方法的研究人员来说,这将是一个有用的资源,以进一步了解其改进对其模型获得能力的影响。
{"title":"A survey of methods for revealing and overcoming weaknesses of data-driven Natural Language Understanding","authors":"Viktor Schlegel, G. Nenadic, R. Batista-Navarro","doi":"10.1017/s1351324922000171","DOIUrl":"https://doi.org/10.1017/s1351324922000171","url":null,"abstract":"Abstract Recent years have seen a growing number of publications that analyse Natural Language Understanding (NLU) datasets for superficial cues, whether they undermine the complexity of the tasks underlying those datasets and how they impact those models that are optimised and evaluated on this data. This structured survey provides an overview of the evolving research area by categorising reported weaknesses in models and datasets and the methods proposed to reveal and alleviate those weaknesses for the English language. We summarise and discuss the findings and conclude with a set of recommendations for possible future research directions. We hope that it will be a useful resource for researchers who propose new datasets to assess the suitability and quality of their data to evaluate various phenomena of interest, as well as those who propose novel NLU approaches, to further understand the implications of their improvements with respect to their model’s acquired capabilities.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"1 - 31"},"PeriodicalIF":2.5,"publicationDate":"2022-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44296997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
NLE volume 28 issue 3 Cover and Front matter NLE第28卷第3期封面和封面问题
IF 2.5 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2022-04-08 DOI: 10.1017/s1351324922000158
R. Mitkov, B. Boguraev
trans-lation, computer science or engineering. Its is to the computational linguistics research and the implementation of practical applications with potential real-world use. As well as publishing original research articles on a broad range of topics - from text analy- sis, machine translation, information retrieval, speech processing and generation to integrated systems and multi-modal interfaces - it also publishes special issues on specific natural language processing methods, tasks or applications. The journal welcomes survey papers describing the state of the art of a specific topic. The Journal of Natural Language Engineering also publishes the popular Industry Watch and Emerging Trends columns as well as book reviews.
翻译、计算机科学或工程。它是对计算语言学的研究和实际应用的实现,具有潜在的现实应用价值。除了发表关于广泛主题的原创研究文章——从文本分析、机器翻译、信息检索、语音处理和生成到集成系统和多模态接口——它还出版了关于特定自然语言处理方法、任务或应用的特刊。该杂志欢迎描述特定主题的最新技术的调查论文。《自然语言工程杂志》还出版了受欢迎的行业观察和新兴趋势专栏以及书评。
{"title":"NLE volume 28 issue 3 Cover and Front matter","authors":"R. Mitkov, B. Boguraev","doi":"10.1017/s1351324922000158","DOIUrl":"https://doi.org/10.1017/s1351324922000158","url":null,"abstract":"trans-lation, computer science or engineering. Its is to the computational linguistics research and the implementation of practical applications with potential real-world use. As well as publishing original research articles on a broad range of topics - from text analy- sis, machine translation, information retrieval, speech processing and generation to integrated systems and multi-modal interfaces - it also publishes special issues on specific natural language processing methods, tasks or applications. The journal welcomes survey papers describing the state of the art of a specific topic. The Journal of Natural Language Engineering also publishes the popular Industry Watch and Emerging Trends columns as well as book reviews.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":"f1 - f2"},"PeriodicalIF":2.5,"publicationDate":"2022-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45341304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NLE volume 28 issue 3 Cover and Back matter NLE第28卷第3期封面和封底
IF 2.5 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2022-04-08 DOI: 10.1017/s135132492200016x
{"title":"NLE volume 28 issue 3 Cover and Back matter","authors":"","doi":"10.1017/s135132492200016x","DOIUrl":"https://doi.org/10.1017/s135132492200016x","url":null,"abstract":"","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":"b1 - b2"},"PeriodicalIF":2.5,"publicationDate":"2022-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46428676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The voice synthesis business: 2022 update 语音合成业务:2022年更新
IF 2.5 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2022-04-08 DOI: 10.1017/S1351324922000146
R. Dale
Abstract In the past few years, high-quality automated text-to-speech synthesis has effectively become a commodity, with easy access to cloud-based APIs provided by a number of major players. At the same time, developments in deep learning have broadened the scope of voice synthesis functionalities that can be delivered, leading to a growth in the range of commercially viable use cases. We take a look at the technology features and use cases that have attracted attention and investment in the past few years, identifying the major players and recent start-ups in the space.
在过去的几年中,高质量的自动文本到语音合成已经有效地成为一种商品,可以轻松访问由许多主要参与者提供的基于云的api。与此同时,深度学习的发展扩大了可交付的语音合成功能的范围,导致商业上可行的用例范围的增长。我们来看看在过去几年中吸引了关注和投资的技术特性和用例,确定该领域的主要参与者和最近的初创企业。
{"title":"The voice synthesis business: 2022 update","authors":"R. Dale","doi":"10.1017/S1351324922000146","DOIUrl":"https://doi.org/10.1017/S1351324922000146","url":null,"abstract":"Abstract In the past few years, high-quality automated text-to-speech synthesis has effectively become a commodity, with easy access to cloud-based APIs provided by a number of major players. At the same time, developments in deep learning have broadened the scope of voice synthesis functionalities that can be delivered, leading to a growth in the range of commercially viable use cases. We take a look at the technology features and use cases that have attracted attention and investment in the past few years, identifying the major players and recent start-ups in the space.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"28 1","pages":"401 - 408"},"PeriodicalIF":2.5,"publicationDate":"2022-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44725390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
Natural Language Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1