Natural Language Processing for Corpus Linguistics by Jonathan Dunn. Cambridge: Cambridge University Press, 2022. ISBN 9781009070447 (PB), ISBN 9781009070447 (OC), vi+88 pages.
{"title":"Natural Language Processing for Corpus Linguistics by Jonathan Dunn. Cambridge: Cambridge University Press, 2022. ISBN 9781009070447 (PB), ISBN 9781009070447 (OC), vi+88 pages.","authors":"J. Wen, Lan Yi","doi":"10.1017/S1351324922000201","DOIUrl":null,"url":null,"abstract":"Corpus linguistics is essentially the computer-based empirical analysis that examines naturally occurring language and its use with a representative collection of machine-readable texts (Sinclair, 1991; Biber, Conrad and Reppen, 1998; McEnery and Hardie, 2012). The techniques of corpus linguistics enable the analyzing of large amounts of corpus data from both qualitative (e.g., concordances) and quantitative (e.g., word frequencies) perspectives, which in turn may yield evidence for or against the proposed linguistic statements or assumptions (Reppen, 2010). Despite its success in a wide range of fields (Römer, 2022), traditional corpus linguistics has become seemingly disconnected from recent technological advances in artificial intelligence as the computing power and corpus data available for linguistic analysis continue to grow in the past decades. In this connection, more sophisticated methods are needed to update and expand the arsenal for corpus linguistics research. As its name suggests, this monograph focuses exclusively on utilizing NLP techniques to uncover different aspects of language use through the lens of corpus linguistics. It consists of four main chapters plus a brief conclusion. Each of the four main chapters highlights a different aspect of computational methodologies for corpus linguistic research, followed by a discussion on the potential ethical issues that are pertinent to the application. Five corpus-based case studies are presented to demonstrate how and why a particular computational method is used for linguistic analysis. Given the methodological orientation of the book, it is not surprising that there are substantial technical details concerning the implementation of these methods, which is usually a daunting task for those readers without any background knowledge in computer programming. Fortunately, the author has made all the Python scripts and corpus data used in the case studies publicly available online at https://doi.org/10.24433/CO.3402613.v1. These online supporting materials are an invaluable complement to the book because they not only ease readers from coding but also make every result and graph in the book readily reproducible. To provide better hands-on experience for readers, a quick walkthrough on the accessing of online materials is presented prior to the beginning of the main chapters. With just a few clicks, readers will be able to run the code and replicate the case studies with interactive code notebooks. Of course, readers who are familiar with Python programming are encouraged to further explore the corpus data and expand the scripts to serve their own research purposes. Chapter 1 provides a general overview of the computational analysis in corpus linguistics research and outlines the key issues to be addressed. It first defines the major problems (namely, categorization and comparison) in corpus analysis that NLP models can solve, and explains why computational linguistic analysis is needed for corpus linguistic research (namely., reproducibility and scalability). The author then introduces all five case studies to be presented in the forthcoming chapters. These studies, ranging from usage-based grammar to corpus-based sociolinguistics, demonstrate how NLP methods can be applied to investigate real-world linguistic phenomena. As for the key issues, the categorization problems and comparison problems are discussed in two","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"842 - 845"},"PeriodicalIF":2.3000,"publicationDate":"2022-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1017/S1351324922000201","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 1
Abstract
Corpus linguistics is essentially the computer-based empirical analysis that examines naturally occurring language and its use with a representative collection of machine-readable texts (Sinclair, 1991; Biber, Conrad and Reppen, 1998; McEnery and Hardie, 2012). The techniques of corpus linguistics enable the analyzing of large amounts of corpus data from both qualitative (e.g., concordances) and quantitative (e.g., word frequencies) perspectives, which in turn may yield evidence for or against the proposed linguistic statements or assumptions (Reppen, 2010). Despite its success in a wide range of fields (Römer, 2022), traditional corpus linguistics has become seemingly disconnected from recent technological advances in artificial intelligence as the computing power and corpus data available for linguistic analysis continue to grow in the past decades. In this connection, more sophisticated methods are needed to update and expand the arsenal for corpus linguistics research. As its name suggests, this monograph focuses exclusively on utilizing NLP techniques to uncover different aspects of language use through the lens of corpus linguistics. It consists of four main chapters plus a brief conclusion. Each of the four main chapters highlights a different aspect of computational methodologies for corpus linguistic research, followed by a discussion on the potential ethical issues that are pertinent to the application. Five corpus-based case studies are presented to demonstrate how and why a particular computational method is used for linguistic analysis. Given the methodological orientation of the book, it is not surprising that there are substantial technical details concerning the implementation of these methods, which is usually a daunting task for those readers without any background knowledge in computer programming. Fortunately, the author has made all the Python scripts and corpus data used in the case studies publicly available online at https://doi.org/10.24433/CO.3402613.v1. These online supporting materials are an invaluable complement to the book because they not only ease readers from coding but also make every result and graph in the book readily reproducible. To provide better hands-on experience for readers, a quick walkthrough on the accessing of online materials is presented prior to the beginning of the main chapters. With just a few clicks, readers will be able to run the code and replicate the case studies with interactive code notebooks. Of course, readers who are familiar with Python programming are encouraged to further explore the corpus data and expand the scripts to serve their own research purposes. Chapter 1 provides a general overview of the computational analysis in corpus linguistics research and outlines the key issues to be addressed. It first defines the major problems (namely, categorization and comparison) in corpus analysis that NLP models can solve, and explains why computational linguistic analysis is needed for corpus linguistic research (namely., reproducibility and scalability). The author then introduces all five case studies to be presented in the forthcoming chapters. These studies, ranging from usage-based grammar to corpus-based sociolinguistics, demonstrate how NLP methods can be applied to investigate real-world linguistic phenomena. As for the key issues, the categorization problems and comparison problems are discussed in two
期刊介绍:
Natural Language Engineering meets the needs of professionals and researchers working in all areas of computerised language processing, whether from the perspective of theoretical or descriptive linguistics, lexicology, computer science or engineering. Its aim is to bridge the gap between traditional computational linguistics research and the implementation of practical applications with potential real-world use. As well as publishing research articles on a broad range of topics - from text analysis, machine translation, information retrieval and speech analysis and generation to integrated systems and multi modal interfaces - it also publishes special issues on specific areas and technologies within these topics, an industry watch column and book reviews.