Lexical and syntactic features of academic Russian texts: a discriminant analysis

RESEARCH RESULT Theoretical and Applied Linguistics Pub Date : 2022-12-30 DOI:10.18413/2313-8912-2022-8-4-0-8

R. Kupriyanov, M. Solnyshkina, M. Dascalu, Tatyana A. Soldatkina

{"title":"Lexical and syntactic features of academic Russian texts: a discriminant analysis","authors":"R. Kupriyanov, M. Solnyshkina, M. Dascalu, Tatyana A. Soldatkina","doi":"10.18413/2313-8912-2022-8-4-0-8","DOIUrl":null,"url":null,"abstract":"This article presents three mathematical models to differentiate academic texts from three subject discourses written in Russian (i.e., Philological, Mathematical, and Natural Sciences) which further enable design and automated profiling of corresponding typologies. Our models include 5 indices, one at surface level (i.e., sentence length) and 4 syntax features (i.e., mean verbs per sentence, mean adjectives per sentence, local noun overlap, and global argument overlap). We identified and validated the five statistically significant features out of 45 linguistic features extracted from our research corpus consisting of 91.185 tokens. The shortest sentence length is found in Russian language textbooks while the longest sentences are identified in Natural Science texts. The mean number of verbs, nouns, and adjectives per sentence is higher in Natural Science textbooks, whereas Mathematics discourse is characterized by the shortest word length, highest local noun overlap, and highest global argument overlap. We assign the metric differences between the three discourses to their functions: Natural Science texts are characterized by descriptions and narrative passages in contrast to Philology that is associated with opinions. Mathematical discourse operates with precise definitions, explanations and justifications thus exercising numerous overlaps. The discriminant analysis built on top of the features supports the development of text profilers targeting parametric analyses. The automation of these features and the provided formulas for classification enable the design and development of text profilers required for textbook writing and editing. Our findings are useful for professional linguists, technologists, and academic writers to select and modify texts for their target audience.","PeriodicalId":346928,"journal":{"name":"RESEARCH RESULT Theoretical and Applied Linguistics","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"RESEARCH RESULT Theoretical and Applied Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18413/2313-8912-2022-8-4-0-8","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

This article presents three mathematical models to differentiate academic texts from three subject discourses written in Russian (i.e., Philological, Mathematical, and Natural Sciences) which further enable design and automated profiling of corresponding typologies. Our models include 5 indices, one at surface level (i.e., sentence length) and 4 syntax features (i.e., mean verbs per sentence, mean adjectives per sentence, local noun overlap, and global argument overlap). We identified and validated the five statistically significant features out of 45 linguistic features extracted from our research corpus consisting of 91.185 tokens. The shortest sentence length is found in Russian language textbooks while the longest sentences are identified in Natural Science texts. The mean number of verbs, nouns, and adjectives per sentence is higher in Natural Science textbooks, whereas Mathematics discourse is characterized by the shortest word length, highest local noun overlap, and highest global argument overlap. We assign the metric differences between the three discourses to their functions: Natural Science texts are characterized by descriptions and narrative passages in contrast to Philology that is associated with opinions. Mathematical discourse operates with precise definitions, explanations and justifications thus exercising numerous overlaps. The discriminant analysis built on top of the features supports the development of text profilers targeting parametric analyses. The automation of these features and the provided formulas for classification enable the design and development of text profilers required for textbook writing and editing. Our findings are useful for professional linguists, technologists, and academic writers to select and modify texts for their target audience.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

俄语学术语篇的词汇和句法特征辨析

本文提出了三个数学模型来区分用俄语写的三个学科话语(即，文献学，数学和自然科学)的学术文本，从而进一步实现相应类型学的设计和自动分析。我们的模型包括5个指标，一个在表层(即句子长度)和4个语法特征(即每个句子的平均动词，每个句子的平均形容词，局部名词重叠和全局参数重叠)。我们从91.185个token组成的研究语料库中提取的45个语言特征中识别并验证了5个具有统计意义的特征。俄语教科书的句子长度最短，而自然科学教科书的句子长度最长。在自然科学教科书中，每句动词、名词和形容词的平均数量更高，而数学话语的特点是最短的单词长度，最高的局部名词重叠和最高的全局论点重叠。我们将三种话语之间的度量差异分配给它们的功能:自然科学文本的特点是描述和叙事段落，而文字学则与意见有关。数学话语以精确的定义、解释和论证运作，因此有许多重叠。建立在特征之上的判别分析支持以参数分析为目标的文本分析器的开发。这些功能的自动化和提供的分类公式使设计和开发教科书编写和编辑所需的文本分析器成为可能。我们的发现对专业语言学家、技术专家和学术作家为他们的目标受众选择和修改文本很有用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

RESEARCH RESULT Theoretical and Applied Linguistics

自引率

0.00%

发文量