I. Shevchenko, Pavlo Andreiev, N. Khairova, Maiia Dernova
{"title":"使用连贯文本的统计模型作为定量内容分析的附加工具","authors":"I. Shevchenko, Pavlo Andreiev, N. Khairova, Maiia Dernova","doi":"10.30929/1995-0519.2021.5.62-67","DOIUrl":null,"url":null,"abstract":"Purpose. We consider the language system as a set of subsystems, structured in the form of a semiotic hierarchy, in which the content of higher-level units is not completely reduced to the substantive components of lower-level units. Therefore, the meaning of higher-level units cannot always be «calculated» taking into account information about the meaning of lower-level units and information about the relationships between these units. At the same time, the structural model of the language system uses thematic or semantic features of connectivity between units of one level of the hierarchy. This opens up certain possibilities for quantitative content analysis. Methodology. Considering the results of known works, we noticed that none of them uses the analysis of paragraphs as independent structural units of the text. The paragraph usually reveals one micro-theme of the text, which is in the development of the theme of the whole text. It is hypothesized that there should be certain patterns in the gradual dynamics of the frequencies of certain words from one paragraph to another, if the studied text has the property of coherence, when a certain topic plays the role of leitmotif. The aim of this work is to study the possibility of using the coherence of the frequency characteristics of paragraphs to identify keywords and satellite words surrounding the keywords – context sets. Results. To achieve this goal the following tasks are solved: development of a text model that takes into account the task of paragraph-by-paragraph analysis of the dynamics of relative frequencies; development of a method of paragraph-by-paragraph text analysis; testing of the developed method on a collection of documents. Originality. A text representation model has been developed that differs from the existing ones in that it includes a set of the most common words, a set of keywords, a set of satellite words, the intersection of sets of paragraphs, keywords, and satellite words. This provides a formal basis for building a method of analyzing the dynamics of relative frequencies of words that are most common in the text and identifying keywords and context sets. A method of text analysis has been developed, which differs from the existing ones in that it is based on the detection of positive correlations between the relative frequencies of occurrence of a subset of the most frequent words in paragraphs. This allows you to identify keywords and context subsets in texts that have some coherence and in individual paragraphs of text that have weak coherence. Practical value. A set of Ukrainian-language, Russian-language and English-language scientific and technical texts was formed to test the efficiency of the text analysis method. The set includes scientific and technical articles on various topics and fragments of textbooks. The results of machine analysis for keyword detection were compared with the author's sets of keywords in scientific and technical articles. Experts were involved to determine the keyword sets of the textbook fragments. Comparison of author's and expert sets of keywords with sets that were formed by the proposed method showed its efficiency. The match ranged from 50 % to 90 %, taking into account the fact that in the author's sets there were phrases, and in the machine sets the elements of these phrases were shown separately. The developed method can be used as an auxiliary tool for content analysis of related texts. References: 15.","PeriodicalId":405654,"journal":{"name":"Transactions of Kremenchuk Mykhailo Ostrohradskyi National University","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"USE OF THE STATISTICAL MODEL OF COHERENCE OF CONNECTED TEXT AS AN ADDITIONAL TOOL OF QUANTITATIVE CONTENT ANALYSIS\",\"authors\":\"I. Shevchenko, Pavlo Andreiev, N. Khairova, Maiia Dernova\",\"doi\":\"10.30929/1995-0519.2021.5.62-67\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Purpose. We consider the language system as a set of subsystems, structured in the form of a semiotic hierarchy, in which the content of higher-level units is not completely reduced to the substantive components of lower-level units. Therefore, the meaning of higher-level units cannot always be «calculated» taking into account information about the meaning of lower-level units and information about the relationships between these units. At the same time, the structural model of the language system uses thematic or semantic features of connectivity between units of one level of the hierarchy. This opens up certain possibilities for quantitative content analysis. Methodology. Considering the results of known works, we noticed that none of them uses the analysis of paragraphs as independent structural units of the text. The paragraph usually reveals one micro-theme of the text, which is in the development of the theme of the whole text. It is hypothesized that there should be certain patterns in the gradual dynamics of the frequencies of certain words from one paragraph to another, if the studied text has the property of coherence, when a certain topic plays the role of leitmotif. The aim of this work is to study the possibility of using the coherence of the frequency characteristics of paragraphs to identify keywords and satellite words surrounding the keywords – context sets. Results. To achieve this goal the following tasks are solved: development of a text model that takes into account the task of paragraph-by-paragraph analysis of the dynamics of relative frequencies; development of a method of paragraph-by-paragraph text analysis; testing of the developed method on a collection of documents. Originality. A text representation model has been developed that differs from the existing ones in that it includes a set of the most common words, a set of keywords, a set of satellite words, the intersection of sets of paragraphs, keywords, and satellite words. This provides a formal basis for building a method of analyzing the dynamics of relative frequencies of words that are most common in the text and identifying keywords and context sets. A method of text analysis has been developed, which differs from the existing ones in that it is based on the detection of positive correlations between the relative frequencies of occurrence of a subset of the most frequent words in paragraphs. This allows you to identify keywords and context subsets in texts that have some coherence and in individual paragraphs of text that have weak coherence. Practical value. A set of Ukrainian-language, Russian-language and English-language scientific and technical texts was formed to test the efficiency of the text analysis method. The set includes scientific and technical articles on various topics and fragments of textbooks. The results of machine analysis for keyword detection were compared with the author's sets of keywords in scientific and technical articles. Experts were involved to determine the keyword sets of the textbook fragments. Comparison of author's and expert sets of keywords with sets that were formed by the proposed method showed its efficiency. The match ranged from 50 % to 90 %, taking into account the fact that in the author's sets there were phrases, and in the machine sets the elements of these phrases were shown separately. The developed method can be used as an auxiliary tool for content analysis of related texts. References: 15.\",\"PeriodicalId\":405654,\"journal\":{\"name\":\"Transactions of Kremenchuk Mykhailo Ostrohradskyi National University\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Transactions of Kremenchuk Mykhailo Ostrohradskyi National University\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.30929/1995-0519.2021.5.62-67\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Transactions of Kremenchuk Mykhailo Ostrohradskyi National University","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.30929/1995-0519.2021.5.62-67","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
USE OF THE STATISTICAL MODEL OF COHERENCE OF CONNECTED TEXT AS AN ADDITIONAL TOOL OF QUANTITATIVE CONTENT ANALYSIS
Purpose. We consider the language system as a set of subsystems, structured in the form of a semiotic hierarchy, in which the content of higher-level units is not completely reduced to the substantive components of lower-level units. Therefore, the meaning of higher-level units cannot always be «calculated» taking into account information about the meaning of lower-level units and information about the relationships between these units. At the same time, the structural model of the language system uses thematic or semantic features of connectivity between units of one level of the hierarchy. This opens up certain possibilities for quantitative content analysis. Methodology. Considering the results of known works, we noticed that none of them uses the analysis of paragraphs as independent structural units of the text. The paragraph usually reveals one micro-theme of the text, which is in the development of the theme of the whole text. It is hypothesized that there should be certain patterns in the gradual dynamics of the frequencies of certain words from one paragraph to another, if the studied text has the property of coherence, when a certain topic plays the role of leitmotif. The aim of this work is to study the possibility of using the coherence of the frequency characteristics of paragraphs to identify keywords and satellite words surrounding the keywords – context sets. Results. To achieve this goal the following tasks are solved: development of a text model that takes into account the task of paragraph-by-paragraph analysis of the dynamics of relative frequencies; development of a method of paragraph-by-paragraph text analysis; testing of the developed method on a collection of documents. Originality. A text representation model has been developed that differs from the existing ones in that it includes a set of the most common words, a set of keywords, a set of satellite words, the intersection of sets of paragraphs, keywords, and satellite words. This provides a formal basis for building a method of analyzing the dynamics of relative frequencies of words that are most common in the text and identifying keywords and context sets. A method of text analysis has been developed, which differs from the existing ones in that it is based on the detection of positive correlations between the relative frequencies of occurrence of a subset of the most frequent words in paragraphs. This allows you to identify keywords and context subsets in texts that have some coherence and in individual paragraphs of text that have weak coherence. Practical value. A set of Ukrainian-language, Russian-language and English-language scientific and technical texts was formed to test the efficiency of the text analysis method. The set includes scientific and technical articles on various topics and fragments of textbooks. The results of machine analysis for keyword detection were compared with the author's sets of keywords in scientific and technical articles. Experts were involved to determine the keyword sets of the textbook fragments. Comparison of author's and expert sets of keywords with sets that were formed by the proposed method showed its efficiency. The match ranged from 50 % to 90 %, taking into account the fact that in the author's sets there were phrases, and in the machine sets the elements of these phrases were shown separately. The developed method can be used as an auxiliary tool for content analysis of related texts. References: 15.