Compiling a corpus of South Asian online Englishes: A report, some reflections and a pilot study

Muhammad Shakir, Dagmar Deuber
{"title":"Compiling a corpus of South Asian online Englishes: A report, some reflections and a pilot study","authors":"Muhammad Shakir, Dagmar Deuber","doi":"10.2478/icame-2023-0007","DOIUrl":null,"url":null,"abstract":"Abstract In this research article we introduce the South Asian Online Englishes (SAOnE) corpus representing four South Asian countries, i.e. Bangladesh, India, Pakistan, and Sri Lanka, and two native English-speaking countries, i.e. the UK and the USA. We have used semi-automatic and manual methods to collect data from three internet registers, i.e. newspaper comments, web forums and tweets, and a collection of internet sub-registers which we label as blogs and websites. Additionally, we have collected text messages using online freelance hiring platforms from each of the South Asian countries mentioned above. Each register category in the corpus consists of approximately 1 million words per register per country, except text messages, which contains around 500,000 words per country and only includes the four South Asian countries. We have verified the origin of website and blog links, authors of Twitter, and where possible of commenters and web forum users to make sure that only local content of each country is included. The corpus features some indigenous language content, which is tagged. In addition to the description of this dataset, we also present a pilot study analysing three discourse particles, namely na, neh, and yaar. The discourse particles na and yaar are native to Hindi/Urdu, while neh is based on a Sinhala negation marker. Our analysis indicates that na and neh have similarities in terms of their position in the clause/utterance. However, neh is confined to Sri Lanka while the Hindi/Urdu based discourse particles are also used in our Twitter data from Sri Lanka and Bangladesh. The use of these discourse particles in Bangladeshi tweets shows the influence of Indian culture through Bollywood celebrities. Of the Hindi/Urdu discourse particles yaar and na, yaar is preferred in Pakistan while na is preferred in India; additionally, yaar is used at the start of the clause more often in our Pakistani data. Lastly, we discuss the implications of the pilot study, the advantages of the type of data used for the pilot study, and future research directions.","PeriodicalId":73271,"journal":{"name":"ICAME journal : computers in English linguistics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICAME journal : computers in English linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/icame-2023-0007","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Abstract In this research article we introduce the South Asian Online Englishes (SAOnE) corpus representing four South Asian countries, i.e. Bangladesh, India, Pakistan, and Sri Lanka, and two native English-speaking countries, i.e. the UK and the USA. We have used semi-automatic and manual methods to collect data from three internet registers, i.e. newspaper comments, web forums and tweets, and a collection of internet sub-registers which we label as blogs and websites. Additionally, we have collected text messages using online freelance hiring platforms from each of the South Asian countries mentioned above. Each register category in the corpus consists of approximately 1 million words per register per country, except text messages, which contains around 500,000 words per country and only includes the four South Asian countries. We have verified the origin of website and blog links, authors of Twitter, and where possible of commenters and web forum users to make sure that only local content of each country is included. The corpus features some indigenous language content, which is tagged. In addition to the description of this dataset, we also present a pilot study analysing three discourse particles, namely na, neh, and yaar. The discourse particles na and yaar are native to Hindi/Urdu, while neh is based on a Sinhala negation marker. Our analysis indicates that na and neh have similarities in terms of their position in the clause/utterance. However, neh is confined to Sri Lanka while the Hindi/Urdu based discourse particles are also used in our Twitter data from Sri Lanka and Bangladesh. The use of these discourse particles in Bangladeshi tweets shows the influence of Indian culture through Bollywood celebrities. Of the Hindi/Urdu discourse particles yaar and na, yaar is preferred in Pakistan while na is preferred in India; additionally, yaar is used at the start of the clause more often in our Pakistani data. Lastly, we discuss the implications of the pilot study, the advantages of the type of data used for the pilot study, and future research directions.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
编写南亚在线英语语料库:一份报告、一些反思和一项试点研究
本文介绍了南亚在线英语(SAOnE)语料库,该语料库代表了四个南亚国家,即孟加拉国、印度、巴基斯坦和斯里兰卡,以及两个以英语为母语的国家,即英国和美国。我们使用半自动和手动方法从三个互联网注册表收集数据,即报纸评论,网络论坛和推文,以及我们标记为博客和网站的互联网子注册表集合。此外,我们还收集了来自上述每个南亚国家的在线自由职业招聘平台的短信。语料库中的每个寄存器类别由每个国家的每个寄存器大约100万单词组成,但短信除外,每个国家大约包含50万单词,并且仅包括四个南亚国家。我们已经核实了网站和博客链接的来源,Twitter的作者,以及可能的评论和网络论坛用户,以确保只包括每个国家的本地内容。语料库的特点是一些本土语言内容,这些内容被标记。除了对该数据集的描述之外,我们还提出了一个初步研究,分析了三个话语粒子,即na, neh和yaar。语篇小品na和yaar原产于印地语/乌尔都语,而neh则基于僧伽罗语的否定标记。我们的分析表明na和neh在从句/话语中的位置有相似之处。然而,neh仅限于斯里兰卡,而基于印地语/乌尔都语的话语粒子也用于我们来自斯里兰卡和孟加拉国的Twitter数据。这些话语粒子在孟加拉人推文中的使用表明了印度文化通过宝莱坞名人的影响。在印地语/乌尔都语的话语粒子yaar和na中,巴基斯坦人更喜欢yaar,而印度人更喜欢na;此外,在我们的巴基斯坦语数据中,yaar更常用于子句的开头。最后,我们讨论了本次先导研究的意义、先导研究数据类型的优势以及未来的研究方向。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
审稿时长
32 weeks
期刊最新文献
Ole Schützler and Julia Schlüter (eds.). Data and methods in corpus linguistics. Comparative approaches. Cambridge: Cambridge University Press, 2022. 357 pp. ISBN 978-1-10849964-4 Compiling a corpus of South Asian online Englishes: A report, some reflections and a pilot study A comparative corpus-based investigation of results sections of research articles in Applied Linguistics and Physics Tony McEnery and Vaclav Brezina. Fundamental principles of corpus linguistics. Cambridge: Cambridge University Press, 2022. 313 pp. ISBN 978-1-1071-1062-5 Gender and evaluation in contemporary American English: A corpus study based on pronominal and nominal expressions with male and female reference
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1