Modest XPath and XQuery for corpora: Exploiting deep XML annotation

Christoph Rühlemann, Andrej Bagoutdinov, M. O'Donnell
{"title":"Modest XPath and XQuery for corpora: Exploiting deep XML annotation","authors":"Christoph Rühlemann, Andrej Bagoutdinov, M. O'Donnell","doi":"10.1515/icame-2015-0003","DOIUrl":null,"url":null,"abstract":"Abstract This paper outlines a modest approach to XPath and XQuery, tools allowing the navigation and exploitation of XML-encoded texts. The paper starts off from where Andrew Hardie’s paper “Modest XML for corpora: Not a standard, but a suggestion” (Hardie 2014) left the reader, namely wondering how one’s corpus can be usefully analyzed once its XML-encoding is finished, a question the paper did not address. Hardie argued persuasively that “there is a clear benefit to be had from a set of recommendations (not a standard) that outlines general best practices in the use of XML in corpora without going into any of the more technical aspects of XML or the full weight of TEI encoding” (Hardie 2014: 73). In a similar vein this paper argues that even a basic understanding of XPath and XQuery can bring great benefits to corpus linguists. To make this point, we present not only a modest introduction to basic structures underlying the XPath and XQuery syntax but demonstrate their analytical potential using Obama’s 2009 Inaugural Address as a test bed. The speech was encoded in XML, automatically PoS-tagged and manually annotated on additional layers that target two rhetorical figures, anaphora and isocola. We refer to this resource as the Inaugural Rhetorical Corpus (IRC). Further, we created a companion website hosting not only the Inaugural Rhetorical Corpus, but also the Inaugural Training Corpus) (a training corpus in the form of an abbreviated version of the IRC to allow manual checks of query results) as well as an extensive list of tried and tested queries for use with either corpus. All of the queries presented in this paper are at beginners to lower-intermediate levels of XPath/XQuery expertise. Nonetheless, they yield fruitful results: they show how Obama uses the inclusive pronouns we and our as a discursive strategy to advance his political strategy to re-focus American politics on economic and domestic matters. Further, they demonstrate how sentence length contributes to the build-up of climactic tension. Finally, they suggest that Obama’s signature rhetorical figure is the isocolon and that the overwhelming majority of isocola in the speech instantiate the crescens type, where the cola gradually increase in length over the sequence.","PeriodicalId":73271,"journal":{"name":"ICAME journal : computers in English linguistics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2015-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICAME journal : computers in English linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1515/icame-2015-0003","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13

Abstract

Abstract This paper outlines a modest approach to XPath and XQuery, tools allowing the navigation and exploitation of XML-encoded texts. The paper starts off from where Andrew Hardie’s paper “Modest XML for corpora: Not a standard, but a suggestion” (Hardie 2014) left the reader, namely wondering how one’s corpus can be usefully analyzed once its XML-encoding is finished, a question the paper did not address. Hardie argued persuasively that “there is a clear benefit to be had from a set of recommendations (not a standard) that outlines general best practices in the use of XML in corpora without going into any of the more technical aspects of XML or the full weight of TEI encoding” (Hardie 2014: 73). In a similar vein this paper argues that even a basic understanding of XPath and XQuery can bring great benefits to corpus linguists. To make this point, we present not only a modest introduction to basic structures underlying the XPath and XQuery syntax but demonstrate their analytical potential using Obama’s 2009 Inaugural Address as a test bed. The speech was encoded in XML, automatically PoS-tagged and manually annotated on additional layers that target two rhetorical figures, anaphora and isocola. We refer to this resource as the Inaugural Rhetorical Corpus (IRC). Further, we created a companion website hosting not only the Inaugural Rhetorical Corpus, but also the Inaugural Training Corpus) (a training corpus in the form of an abbreviated version of the IRC to allow manual checks of query results) as well as an extensive list of tried and tested queries for use with either corpus. All of the queries presented in this paper are at beginners to lower-intermediate levels of XPath/XQuery expertise. Nonetheless, they yield fruitful results: they show how Obama uses the inclusive pronouns we and our as a discursive strategy to advance his political strategy to re-focus American politics on economic and domestic matters. Further, they demonstrate how sentence length contributes to the build-up of climactic tension. Finally, they suggest that Obama’s signature rhetorical figure is the isocolon and that the overwhelming majority of isocola in the speech instantiate the crescens type, where the cola gradually increase in length over the sequence.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
语料库的适度XPath和XQuery:利用深度XML注释
本文概述了XPath和XQuery的一种适度的方法,这些工具允许导航和利用xml编码的文本。本文从Andrew Hardie的论文“语料库的适度XML:不是标准,而是建议”(Hardie 2014)开始,即想知道一旦XML编码完成,语料库如何有效地分析,这是一个论文没有解决的问题。Hardie很有说服力地指出:“一组建议(不是标准)概述了在语料库中使用XML的一般最佳实践,而不涉及XML的任何技术方面或TEI编码的全部重量,这显然有好处”(Hardie 2014: 73)。同样,本文认为,即使对XPath和XQuery有基本的了解,也能给语料库语言学家带来很大的好处。为了说明这一点,我们不仅简要介绍了XPath和XQuery语法的基本结构,而且以奥巴马2009年的就职演说为测试平台,展示了它们的分析潜力。演讲用XML编码,自动进行pos标记,并在针对两种修辞手法的额外层上进行手动注释,即回指和异指。我们把这个资源称为就职修辞语料库(IRC)。此外,我们创建了一个配套网站,不仅托管就职修辞语料库,还托管就职培训语料库(IRC的缩写形式的培训语料库,允许手动检查查询结果),以及用于两个语料库的经过尝试和测试的查询的广泛列表。本文中介绍的所有查询都适用于XPath/XQuery专业知识的初级到中低水平的读者。尽管如此,他们还是取得了丰硕的成果:他们展示了奥巴马如何使用包容性代词“我们”和“我们的”作为一种话语策略来推进他的政治策略,将美国政治重新聚焦于经济和国内事务。此外,他们还展示了句子长度是如何促成高潮紧张气氛的。最后,他们认为奥巴马的标志性修辞手法是“可乐”,而演讲中绝大多数的“可乐”都是“渐长”类型的实例,即可乐的长度随着序列的增加而逐渐增加。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
审稿时长
32 weeks
期刊最新文献
Ole Schützler and Julia Schlüter (eds.). Data and methods in corpus linguistics. Comparative approaches. Cambridge: Cambridge University Press, 2022. 357 pp. ISBN 978-1-10849964-4 Compiling a corpus of South Asian online Englishes: A report, some reflections and a pilot study A comparative corpus-based investigation of results sections of research articles in Applied Linguistics and Physics Tony McEnery and Vaclav Brezina. Fundamental principles of corpus linguistics. Cambridge: Cambridge University Press, 2022. 313 pp. ISBN 978-1-1071-1062-5 Gender and evaluation in contemporary American English: A corpus study based on pronominal and nominal expressions with male and female reference
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1