Regilaulude teema-analüüs: võimalusi ja väljakutseid / Topic analysis of Estonian runosongs: Prospects and challenges

Q1 Arts and Humanities Methis Pub Date : 2020-12-15 DOI:10.7592/methis.v21i26.16914
M. Sarv
{"title":"Regilaulude teema-analüüs: võimalusi ja väljakutseid / Topic analysis of Estonian runosongs: Prospects and challenges","authors":"M. Sarv","doi":"10.7592/methis.v21i26.16914","DOIUrl":null,"url":null,"abstract":"Artikkel uurib regilaulu teema-analüüsi võimalusi teemade modelleerimise meetodi abil. Meetodi kasutamisel on probleemiks regilaulu keele piirkondlik varieeruvus. Laulutekstide esmane analüüs näitas, et sisukamaid tulemusi annab teema-analüüs ühtlasema keelega kogumite puhul. Lähemaks vaatluseks valitud Hiiumaa, Saaremaa ja Muhu laulude teema-analüüsil tuvastati 20 teemat, mis annavad kiire ülevaate vaadeldavate laulude temaatilisest struktuurist. Uurimus näitas, et tuvastatud teemad jaotuvad vaadeldud piirkonnas võrdlemisi ühtlaselt. Kuid arvutuslikud teemarühmad ei kattu üheselt regilaulu varasema liigitusega, arvestamata laulude žanrilisi erinevusi ning tuues esiplaanile vaadeldavas laulukogumis sagedamini esinevad laulutüübid. \n  \nThe article explores possibilities of computational topic analysis of Estonian runosong texts using the latent Dirichlet allocation (LDA) topic modelling. Runosong is an oral poetic tradition known among most of Finnic peoples. Estonian runosong texts, the material of the current research, have been collected mainly since 1880s and gathered into the Estonian Folklore Archives of the Estonian Literary Museum, where the runosong database with more than 100 000 texts has been compiled (Oras et al 2003–2020). Language of runosongs varies considerably across dialects and, in addition to that, it uses a specific archaic idiom different from the spoken language which complicates the computational analysis of the content aspects of the texts. \nTopic modelling is a method that enables to discover abstract topics detected statistically on the basis of the frequency of the co-occurrence of the words in the texts. In case of a runosong corpus, the method could be used to automatically detect the thematic structure of a large amount of runosong texts, to compare the thematic distribution of regional traditions of the runosong, and to analyse how the thematic distribution obtained with the help of computational methods relates to the classification of the texts resulting from folkloristic analysis. The idea of the current article is to explore whether topic modelling can give meaningful results if applied to unlemmatized and highly variative runosong texts. \nFor LDA topic modelling I used the application MALLET (McCallum 2002). The initial trials with the whole corpus of runosong texts made it clear that the language of the songs is too variative to reach the level of content. It also became obvious that it is necessary to remove stopwords and refrain words. The topics, obtained from the runosongs from all over Estonia, represented dialectal variants of the language rather than thematic clusters and it was necessary to restrict the material. I used stylometric analysis (using R package stylo, Eder et al 2013) to divide the area into linguistically more homogenous subregions, and chose the area of Western islands of Estonia with 16 parishes and 3672 song texts for further explorations. \nWith this material I decided to generate 20 topics. Within this smaller area the topics did not cluster regional language variants any more: (1) the linguistic variants of the main concepts of a topic were brought together under the keywords of the same topic; (2) in most cases, the detected topics were distributed among all the parishes included in the selection. \nLooking at the 20 keywords, the topics indeed seemed to reflect certain thematic subgroups of the songs. In several cases the most prominent song type of a topic was reflected in keywords, in other cases the keywords referred to larger groups of songs. Five of the 20 topics focused on weddings, more precisely, on different episodes of the wedding ritual: adornation and dressing, arriving and greeting, finding the bride and taking her to her new home, sharing the presents prepared by the bride, and recommendations to the bride and the groom. In all these topics the verbs refer either to the present or the future (rather than to the past which is common in narrative songs). A topic of swinging songs includes also the songs about dancing and feasts. Five topics focus on different narrative plots about the troubles of young people, about wooing and marriage. Lyric songs about the life of orphans and about singing form a separate topic each, and there is a separate male topic covering the songs of various genres related to horses, riding and the woods. The largest topic includes the songs on working at home and outside, but also the songs about premarital sex. There are two topics with the focus on well-known children’s songs and lullabies. Two topics relate to German landlords, their power and activities, and one to recruiting and the war. \nAs a conclusion of this exploration: (1) for topic modelling it is necessary to use the texts in homogenous language variants; otherwise, the linguistic differences override the topics at some point; (2) it is possible to use unlemmatized texts for topic modelling, but in this case the grammatical features (tense, modality) interfere with topic analysis; (3) the proportions of variable and stable (recurrent) elements (song types, motifs) in the material have a clear impact on topic formation: the more frequently an element occurs in the material, and the more stable is its wording, the bigger its probability to form the centre of a topic, whereas distinct but rare themes remain unnoticed and will be shared between the topics of more prominent subjects; (4) common sets of words assembled together as the topic may, in addition to the common thematic focus, refer to a common framework, for example environments, and behavioural or communicative patterns (for example, begging for something). Compared to the folkloristic classification of folk songs, the automatic distribution of songs (1) highlights the subjects occurring more frequently in the body of songs (for example, a topic highlights swinging songs instead of calendar songs of the folkloristic classification); (2) partly overrides the genre differences (for example song games can be found under different topics, whereas forming a distinct group in folkloristic classifications).","PeriodicalId":37565,"journal":{"name":"Methis","volume":"33 1-2 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2020-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Methis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.7592/methis.v21i26.16914","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Arts and Humanities","Score":null,"Total":0}
引用次数: 0

Abstract

Artikkel uurib regilaulu teema-analüüsi võimalusi teemade modelleerimise meetodi abil. Meetodi kasutamisel on probleemiks regilaulu keele piirkondlik varieeruvus. Laulutekstide esmane analüüs näitas, et sisukamaid tulemusi annab teema-analüüs ühtlasema keelega kogumite puhul. Lähemaks vaatluseks valitud Hiiumaa, Saaremaa ja Muhu laulude teema-analüüsil tuvastati 20 teemat, mis annavad kiire ülevaate vaadeldavate laulude temaatilisest struktuurist. Uurimus näitas, et tuvastatud teemad jaotuvad vaadeldud piirkonnas võrdlemisi ühtlaselt. Kuid arvutuslikud teemarühmad ei kattu üheselt regilaulu varasema liigitusega, arvestamata laulude žanrilisi erinevusi ning tuues esiplaanile vaadeldavas laulukogumis sagedamini esinevad laulutüübid.   The article explores possibilities of computational topic analysis of Estonian runosong texts using the latent Dirichlet allocation (LDA) topic modelling. Runosong is an oral poetic tradition known among most of Finnic peoples. Estonian runosong texts, the material of the current research, have been collected mainly since 1880s and gathered into the Estonian Folklore Archives of the Estonian Literary Museum, where the runosong database with more than 100 000 texts has been compiled (Oras et al 2003–2020). Language of runosongs varies considerably across dialects and, in addition to that, it uses a specific archaic idiom different from the spoken language which complicates the computational analysis of the content aspects of the texts. Topic modelling is a method that enables to discover abstract topics detected statistically on the basis of the frequency of the co-occurrence of the words in the texts. In case of a runosong corpus, the method could be used to automatically detect the thematic structure of a large amount of runosong texts, to compare the thematic distribution of regional traditions of the runosong, and to analyse how the thematic distribution obtained with the help of computational methods relates to the classification of the texts resulting from folkloristic analysis. The idea of the current article is to explore whether topic modelling can give meaningful results if applied to unlemmatized and highly variative runosong texts. For LDA topic modelling I used the application MALLET (McCallum 2002). The initial trials with the whole corpus of runosong texts made it clear that the language of the songs is too variative to reach the level of content. It also became obvious that it is necessary to remove stopwords and refrain words. The topics, obtained from the runosongs from all over Estonia, represented dialectal variants of the language rather than thematic clusters and it was necessary to restrict the material. I used stylometric analysis (using R package stylo, Eder et al 2013) to divide the area into linguistically more homogenous subregions, and chose the area of Western islands of Estonia with 16 parishes and 3672 song texts for further explorations. With this material I decided to generate 20 topics. Within this smaller area the topics did not cluster regional language variants any more: (1) the linguistic variants of the main concepts of a topic were brought together under the keywords of the same topic; (2) in most cases, the detected topics were distributed among all the parishes included in the selection. Looking at the 20 keywords, the topics indeed seemed to reflect certain thematic subgroups of the songs. In several cases the most prominent song type of a topic was reflected in keywords, in other cases the keywords referred to larger groups of songs. Five of the 20 topics focused on weddings, more precisely, on different episodes of the wedding ritual: adornation and dressing, arriving and greeting, finding the bride and taking her to her new home, sharing the presents prepared by the bride, and recommendations to the bride and the groom. In all these topics the verbs refer either to the present or the future (rather than to the past which is common in narrative songs). A topic of swinging songs includes also the songs about dancing and feasts. Five topics focus on different narrative plots about the troubles of young people, about wooing and marriage. Lyric songs about the life of orphans and about singing form a separate topic each, and there is a separate male topic covering the songs of various genres related to horses, riding and the woods. The largest topic includes the songs on working at home and outside, but also the songs about premarital sex. There are two topics with the focus on well-known children’s songs and lullabies. Two topics relate to German landlords, their power and activities, and one to recruiting and the war. As a conclusion of this exploration: (1) for topic modelling it is necessary to use the texts in homogenous language variants; otherwise, the linguistic differences override the topics at some point; (2) it is possible to use unlemmatized texts for topic modelling, but in this case the grammatical features (tense, modality) interfere with topic analysis; (3) the proportions of variable and stable (recurrent) elements (song types, motifs) in the material have a clear impact on topic formation: the more frequently an element occurs in the material, and the more stable is its wording, the bigger its probability to form the centre of a topic, whereas distinct but rare themes remain unnoticed and will be shared between the topics of more prominent subjects; (4) common sets of words assembled together as the topic may, in addition to the common thematic focus, refer to a common framework, for example environments, and behavioural or communicative patterns (for example, begging for something). Compared to the folkloristic classification of folk songs, the automatic distribution of songs (1) highlights the subjects occurring more frequently in the body of songs (for example, a topic highlights swinging songs instead of calendar songs of the folkloristic classification); (2) partly overrides the genre differences (for example song games can be found under different topics, whereas forming a distinct group in folkloristic classifications).
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
本研究的结论是:(1)主题建模需要使用同质语言变体中的文本;否则,语言差异在某种程度上覆盖了主题;(2)可以使用非货币化文本进行主题建模,但在这种情况下,语法特征(时态、情态)会干扰主题分析;(3)材料中可变元素和稳定元素(歌曲类型、母题)的比例对话题的形成有明显的影响:一个元素在材料中出现的频率越高,其措辞越稳定,其形成话题中心的可能性越大,而独特但罕见的主题则不被注意,并将在更突出的主题之间共享;(4)作为主题组合在一起的共同的词集,除了共同的主题焦点之外,还可以指一个共同的框架,例如环境,行为或交流模式(例如,乞求某物)。与民歌的民俗化分类相比,歌曲的自动分布(1)突出了歌曲主体中出现频率更高的主题(例如,一个主题突出了摇摆歌曲,而不是民俗化分类的日历歌曲);(2)部分覆盖类型差异(例如,歌曲游戏可以在不同主题下找到,而在民俗学分类中形成不同的组)。 本研究的结论是:(1)主题建模需要使用同质语言变体中的文本;否则,语言差异在某种程度上覆盖了主题;(2)可以使用非货币化文本进行主题建模,但在这种情况下,语法特征(时态、情态)会干扰主题分析;(3)材料中可变元素和稳定元素(歌曲类型、母题)的比例对话题的形成有明显的影响:一个元素在材料中出现的频率越高,其措辞越稳定,其形成话题中心的可能性越大,而独特但罕见的主题则不被注意,并将在更突出的主题之间共享;(4)作为主题组合在一起的共同的词集,除了共同的主题焦点之外,还可以指一个共同的框架,例如环境,行为或交流模式(例如,乞求某物)。与民歌的民俗化分类相比,歌曲的自动分布(1)突出了歌曲主体中出现频率更高的主题(例如,一个主题突出了摇摆歌曲,而不是民俗化分类的日历歌曲);(2)部分覆盖类型差异(例如,歌曲游戏可以在不同主题下找到,而在民俗学分类中形成不同的组)。 本研究的结论是:(1)主题建模需要使用同质语言变体中的文本;否则,语言差异在某种程度上覆盖了主题;(2)可以使用非货币化文本进行主题建模,但在这种情况下,语法特征(时态、情态)会干扰主题分析;(3)材料中可变元素和稳定元素(歌曲类型、母题)的比例对话题的形成有明显的影响:一个元素在材料中出现的频率越高,其措辞越稳定,其形成话题中心的可能性越大,而独特但罕见的主题则不被注意,并将在更突出的主题之间共享;(4)作为主题组合在一起的共同的词集,除了共同的主题焦点之外,还可以指一个共同的框架,例如环境,行为或交流模式(例如,乞求某物)。与民歌的民俗化分类相比,歌曲的自动分布(1)突出了歌曲主体中出现频率更高的主题(例如,一个主题突出了摇摆歌曲,而不是民俗化分类的日历歌曲);(2)部分覆盖类型差异(例如,歌曲游戏可以在不同主题下找到,而在民俗学分类中形成不同的组)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Methis
Methis Arts and Humanities-Literature and Literary Theory
CiteScore
0.60
自引率
0.00%
发文量
12
审稿时长
12 weeks
期刊介绍: Methis publishes original research in the field of humanities, in particular in the field of literary and cultural studies and theater studies. The journal features thematic issues on a regular basis with every third issue being a varia issue. Articles are published in Estonian (or in English) with a summary in English (or in Estonian). The journal also includes the following sections: - MANIFESTO: a programmatic (theoretical) article - MEDIATION OF THEORY: a translation of a key theoretical text within the field - REVIEW: a review article on recent developments within the field - ARCHIVAL FINDING: an annotated publication of some relevant archival source from the collections of Cultural History Archives of Estonian Literary Museum or another memory institution. - INTERVIEW
期刊最新文献
Keskkonnahumanitaaria / Environmental Humanities Vastuseisust protestideni / From Opposition to Protests Kunst, keskkond ja keskkonnaliikumine Eestis 1960.–1980. aastatel / Art, Environment, and Environmentalism in Estonia in the 1960s–1980s Eesti loomakaitseliikumine sõdadevahelisel perioodil / Animal Protection Movement in Interwar Estonia Roheliste rattaretked aastail 1988–1993 / Green Bicycle Tours in the Years 1988–1993
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1