Data preparation in crowdsourcing for pedagogical purposes

Tanara Zingano Kuhn, Špela Arhar Holdt, Iztok Kosem, Carole Tiberius, Kristina Koppel, R. Zviel-Girshin
{"title":"Data preparation in crowdsourcing for pedagogical purposes","authors":"Tanara Zingano Kuhn, Špela Arhar Holdt, Iztok Kosem, Carole Tiberius, Kristina Koppel, R. Zviel-Girshin","doi":"10.4312/slo2.0.2022.2.62-100","DOIUrl":null,"url":null,"abstract":"One way to stimulate the use of corpora in language education is by making pedagogically appropriate corpora, labeled with different types of problems (sensitive content, offensive language, structural problems). However, manually labeling corpora is extremely time-consuming and a better approach should be found. We thus propose a combination of two approaches to the creation of problem-labeled pedagogical corpora of Dutch, Estonian, Slovene and Brazilian Portuguese: the use of games with a purpose and of crowdsourcing for the task. We conducted initial experiments to establish the suitability of the crowdsourcing task, and used the lessons learned to design the Crowdsourcing for Language Learning (CrowLL) game in which players identify problematic sentences, classify them, and indicate problematic excerpts. The focus of this paper is on data preparation, given the crucial role that such a stage plays in any crowdsourcing project dealing with the creation of language learning resources. We present the methodology for data preparation, offering a detailed presentation of source corpora selection, pedagogically oriented GDEX configurations, and the creation of lemma lists, with a special focus on common and language-dependent decisions. Finally, we offer a discussion of the challenges that emerged and the solutions that have been implemented so far.","PeriodicalId":371035,"journal":{"name":"Slovenščina 2.0: empirical, applied and interdisciplinary research","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Slovenščina 2.0: empirical, applied and interdisciplinary research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4312/slo2.0.2022.2.62-100","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

One way to stimulate the use of corpora in language education is by making pedagogically appropriate corpora, labeled with different types of problems (sensitive content, offensive language, structural problems). However, manually labeling corpora is extremely time-consuming and a better approach should be found. We thus propose a combination of two approaches to the creation of problem-labeled pedagogical corpora of Dutch, Estonian, Slovene and Brazilian Portuguese: the use of games with a purpose and of crowdsourcing for the task. We conducted initial experiments to establish the suitability of the crowdsourcing task, and used the lessons learned to design the Crowdsourcing for Language Learning (CrowLL) game in which players identify problematic sentences, classify them, and indicate problematic excerpts. The focus of this paper is on data preparation, given the crucial role that such a stage plays in any crowdsourcing project dealing with the creation of language learning resources. We present the methodology for data preparation, offering a detailed presentation of source corpora selection, pedagogically oriented GDEX configurations, and the creation of lemma lists, with a special focus on common and language-dependent decisions. Finally, we offer a discussion of the challenges that emerged and the solutions that have been implemented so far.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
以教学为目的的众包数据准备
促进语料库在语言教育中的使用的一种方法是制作适合教学的语料库,标记不同类型的问题(敏感内容、攻击性语言、结构性问题)。然而,手动标注语料库是非常耗时的,应该找到更好的方法。因此,我们建议结合两种方法来创建荷兰语、爱沙尼亚语、斯洛文尼亚语和巴西葡萄牙语的问题标记教学语料库:使用有目的的游戏和众包来完成任务。我们进行了初步实验,以确定众包任务的适用性,并利用吸取的经验教训设计了语言学习众包(CrowLL)游戏,让玩家识别问题句子,对它们进行分类,并指出问题摘录。本文的重点是数据准备,因为这一阶段在任何涉及语言学习资源创建的众包项目中都起着至关重要的作用。我们介绍了数据准备的方法,详细介绍了源语料库的选择、以教学为导向的GDEX配置和引理列表的创建,并特别关注公共和语言相关的决策。最后,我们将讨论到目前为止出现的挑战和已经实施的解决方案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Application of crowdsourcing in education on the example of eTwinning EnetCollect – European Network for Combining Language Learning with Crowdsourcing Techniques (COST Action CA16105) Crowdsourcing and language learning habits and practices in Turkey, Bosnia and Herzegovina, the Republic of North Macedonia and Poland in the pre-pandemic and pandemic periods Crowdsourcing ratings for single lexical items Data preparation in crowdsourcing for pedagogical purposes
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1