文本结构评估的主题建模:以俄语学术文本为例

IF 1 Q3 EDUCATION & EDUCATIONAL RESEARCH Journal of Language and Education Pub Date : 2023-09-30 DOI:10.17323/jle.2023.16604
Valery Solovyev, Marina Solnyshkina, Elena Tutubalina
{"title":"文本结构评估的主题建模:以俄语学术文本为例","authors":"Valery Solovyev, Marina Solnyshkina, Elena Tutubalina","doi":"10.17323/jle.2023.16604","DOIUrl":null,"url":null,"abstract":"Background: Automatic assessment of text complexity levels is viewed as an important task, primarily in education. The existing methods of computing text complexity employ simple surface text properties neglecting complexity of text content and structure. The current paradigm of complexity studies can no longer keep up with the challenges of automatic evaluation of text structure.
 Purpose: The aim of the paper is twofold: (1) it introduces a new notion, i.e. complexity of a text topical structure which we define as a quantifiable measure and combination of four parameters, i.e. number of topics, topic coherence, topic distribution, and topic weight. We hypothesize that these parameters are dependent variables of text complexity and aligned with the grade level; (2) the paper is also aimed at justifying applicability of the recently developed methods of topic modeling to measuring complexity of a text topical structure.
 Method: To test this hypothesis, we use Russian Academic Corpus comprising school textbooks, texts of Russian as a foreign language and fiction texts recommended for reading in different grades, and employ it in three versions: (i) Full Texts Corpus, (ii) Corpus of Segments, (iii) Corpus of Paragraphs. The software tools we implement include LDA (Latent Dirichlet Allocation), OnlineLDA and Additive Regularization Of Topic Models with Word2vec-based metric and Normalized Pairwise Mutual Information.
 Results: Our findings include the following: the optimal number of topics in educational texts varies around 20; topic coherence and topic distribution are identified to be functions of grade level complexity; text complexity is suggested to be estimated with structural organization parameters and viewed as a new algorithm complementing the classical approach of text complexity assessment based on linguistic features.
 Conclusion: The results reported and discussed in the article strongly suggest that the theoretical framework and the analytic algorithms used in the study might be fruitfully applied in education and provide a basis for assessing complexity of academic texts.","PeriodicalId":37020,"journal":{"name":"Journal of Language and Education","volume":"50 1","pages":"0"},"PeriodicalIF":1.0000,"publicationDate":"2023-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Topic Modeling for Text Structure Assessment: The case of Russian Academic Texts\",\"authors\":\"Valery Solovyev, Marina Solnyshkina, Elena Tutubalina\",\"doi\":\"10.17323/jle.2023.16604\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Automatic assessment of text complexity levels is viewed as an important task, primarily in education. The existing methods of computing text complexity employ simple surface text properties neglecting complexity of text content and structure. The current paradigm of complexity studies can no longer keep up with the challenges of automatic evaluation of text structure.
 Purpose: The aim of the paper is twofold: (1) it introduces a new notion, i.e. complexity of a text topical structure which we define as a quantifiable measure and combination of four parameters, i.e. number of topics, topic coherence, topic distribution, and topic weight. We hypothesize that these parameters are dependent variables of text complexity and aligned with the grade level; (2) the paper is also aimed at justifying applicability of the recently developed methods of topic modeling to measuring complexity of a text topical structure.
 Method: To test this hypothesis, we use Russian Academic Corpus comprising school textbooks, texts of Russian as a foreign language and fiction texts recommended for reading in different grades, and employ it in three versions: (i) Full Texts Corpus, (ii) Corpus of Segments, (iii) Corpus of Paragraphs. The software tools we implement include LDA (Latent Dirichlet Allocation), OnlineLDA and Additive Regularization Of Topic Models with Word2vec-based metric and Normalized Pairwise Mutual Information.
 Results: Our findings include the following: the optimal number of topics in educational texts varies around 20; topic coherence and topic distribution are identified to be functions of grade level complexity; text complexity is suggested to be estimated with structural organization parameters and viewed as a new algorithm complementing the classical approach of text complexity assessment based on linguistic features.
 Conclusion: The results reported and discussed in the article strongly suggest that the theoretical framework and the analytic algorithms used in the study might be fruitfully applied in education and provide a basis for assessing complexity of academic texts.\",\"PeriodicalId\":37020,\"journal\":{\"name\":\"Journal of Language and Education\",\"volume\":\"50 1\",\"pages\":\"0\"},\"PeriodicalIF\":1.0000,\"publicationDate\":\"2023-09-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Language and Education\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.17323/jle.2023.16604\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"EDUCATION & EDUCATIONAL RESEARCH\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Language and Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17323/jle.2023.16604","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}
引用次数: 0

摘要

背景:文本复杂程度的自动评估被视为一项重要任务,主要是在教育领域。现有的计算文本复杂度的方法采用简单的表面文本属性,忽略了文本内容和结构的复杂性。当前的复杂性研究范式已经跟不上文本结构自动评价的挑战。 目的:本文的目的有两个:(1)引入了一个新的概念,即文本主题结构的复杂性,我们将其定义为四个参数的可量化度量和组合,即主题数量,主题连贯,主题分布和主题权重。我们假设这些参数是文本复杂性的因变量,并与年级水平一致;(2)本文还旨在证明最近开发的主题建模方法在测量文本主题结构复杂性方面的适用性。 方法:为了验证这一假设,我们使用了俄语学术语料库,包括学校教科书、作为外语的俄语文本和推荐给不同年级阅读的小说文本,并将其分为三个版本:(i)全文语料库,(ii)片段语料库,(iii)段落语料库。我们实现的软件工具包括LDA (Latent Dirichlet Allocation), OnlineLDA和基于word2vec的度量和标准化成对互信息的主题模型的加性正则化。 结果:我们的研究结果包括:教育文本中主题的最佳数量在20左右;话题连贯和话题分布是年级复杂程度的函数;本文提出了用结构组织参数估计文本复杂度的方法,并将其作为基于语言特征的文本复杂度评估方法的一种新方法。 结论:本文报告和讨论的结果强烈表明,研究中使用的理论框架和分析算法可能会在教育中得到有效应用,并为评估学术文本的复杂性提供依据。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Topic Modeling for Text Structure Assessment: The case of Russian Academic Texts
Background: Automatic assessment of text complexity levels is viewed as an important task, primarily in education. The existing methods of computing text complexity employ simple surface text properties neglecting complexity of text content and structure. The current paradigm of complexity studies can no longer keep up with the challenges of automatic evaluation of text structure. Purpose: The aim of the paper is twofold: (1) it introduces a new notion, i.e. complexity of a text topical structure which we define as a quantifiable measure and combination of four parameters, i.e. number of topics, topic coherence, topic distribution, and topic weight. We hypothesize that these parameters are dependent variables of text complexity and aligned with the grade level; (2) the paper is also aimed at justifying applicability of the recently developed methods of topic modeling to measuring complexity of a text topical structure. Method: To test this hypothesis, we use Russian Academic Corpus comprising school textbooks, texts of Russian as a foreign language and fiction texts recommended for reading in different grades, and employ it in three versions: (i) Full Texts Corpus, (ii) Corpus of Segments, (iii) Corpus of Paragraphs. The software tools we implement include LDA (Latent Dirichlet Allocation), OnlineLDA and Additive Regularization Of Topic Models with Word2vec-based metric and Normalized Pairwise Mutual Information. Results: Our findings include the following: the optimal number of topics in educational texts varies around 20; topic coherence and topic distribution are identified to be functions of grade level complexity; text complexity is suggested to be estimated with structural organization parameters and viewed as a new algorithm complementing the classical approach of text complexity assessment based on linguistic features. Conclusion: The results reported and discussed in the article strongly suggest that the theoretical framework and the analytic algorithms used in the study might be fruitfully applied in education and provide a basis for assessing complexity of academic texts.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of Language and Education
Journal of Language and Education Arts and Humanities-Language and Linguistics
CiteScore
1.70
自引率
14.30%
发文量
33
审稿时长
18 weeks
期刊最新文献
Teacher Development in Technology-Enhanced Language Teaching: Book Review Writing with AI: University Students’ Use of ChatGPT Lingua-Cultural Identity in Translation: 'We' vs 'I' Cultures Occasionalisms in Social Networks During the Pandemic Predictors of Language Proficiency among Medical and Paramedical Students: Vygotskian Sociocultural Theory
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1