首页 > 最新文献

Slovenščina 2.0: empirical, applied and interdisciplinary research最新文献

英文 中文
Spletna orodja za slovenščino in tuji študenti Univerze v Ljubljani 卢布尔雅那大学斯洛文尼亚学生和外国学生的在线工具
Pub Date : 2021-12-29 DOI: 10.4312/slo2.0.2021.2.100-125
Mojca Stritar Kučuk
Redno vpisani tuji študenti Univerze v Ljubljani, ki se v prvem letu študija v okviru modula Leto plus učijo slovensko, se v drugem semestru na posebni delavnici podrobneje spoznajo s spletnimi jezikovnimi viri in tehnologijami za slovenščino. V prispevku je opisana izvedba te delavnice v študijskem letu 2019/20, ko je zaradi pandemije koronavirusa potekala na daljavo, v obliki interaktivnih videoposnetkov z nalogami za preverjanje razumevanja snovi. Drugi del prispevka se osredotoča na mnenje študentov o tovrstnih jezikovnih virih. S spletno anketo sem analizirala stališča in izkušnje študentov dveh generacij: študenti generacije 2018/19 so spletna orodja spoznavali v razredu, študenti generacije 2019/20 pa na daljavo. Sodeč po rezultatih ankete, mlajša generacija študentov jezikovne vire na spletu uporablja pogosteje. Študenti obeh skupin najpogosteje uporabljajo Googlov Prevajalnik, ki mu sledijo Sloleks, pregibnik Besana, Fran in Pons. Kot argumente za uporabo teh virov izpostavljajo predvsem hitrost oz. enostavnost uporabe in navajenost na določen vir.
在卢布尔雅那大学学习斯洛文尼亚语的正规注册外国学生在第二学期的一个特别研讨会上学习了更多关于斯洛文尼亚语的在线语言资源和技术。本文介绍了该研讨会在2019/20学年的实施情况,当时由于冠状病毒大流行,该研讨会以带有理解任务的互动视频的形式远程举行。论文的第二部分侧重于学生对此类语言资源的看法。我利用在线调查分析了两代学生的观点和经验:2018/19 届学生在课堂上接触了在线工具,而 2019/20 届学生则通过远程接触了这些工具。调查结果显示,年轻一代学生使用在线语言资源的频率更高。这两个群体的学生最常使用谷歌翻译,其次是 Sloleks、Besana、Fran 和 Pons。 他们使用这些资源的主要理由是速度或易用性以及对特定资源的熟悉程度。
{"title":"Spletna orodja za slovenščino in tuji študenti Univerze v Ljubljani","authors":"Mojca Stritar Kučuk","doi":"10.4312/slo2.0.2021.2.100-125","DOIUrl":"https://doi.org/10.4312/slo2.0.2021.2.100-125","url":null,"abstract":"Redno vpisani tuji študenti Univerze v Ljubljani, ki se v prvem letu študija v okviru modula Leto plus učijo slovensko, se v drugem semestru na posebni delavnici podrobneje spoznajo s spletnimi jezikovnimi viri in tehnologijami za slovenščino. V prispevku je opisana izvedba te delavnice v študijskem letu 2019/20, ko je zaradi pandemije koronavirusa potekala na daljavo, v obliki interaktivnih videoposnetkov z nalogami za preverjanje razumevanja snovi. Drugi del prispevka se osredotoča na mnenje študentov o tovrstnih jezikovnih virih. S spletno anketo sem analizirala stališča in izkušnje študentov dveh generacij: študenti generacije 2018/19 so spletna orodja spoznavali v razredu, študenti generacije 2019/20 pa na daljavo. Sodeč po rezultatih ankete, mlajša generacija študentov jezikovne vire na spletu uporablja pogosteje. Študenti obeh skupin najpogosteje uporabljajo Googlov Prevajalnik, ki mu sledijo Sloleks, pregibnik Besana, Fran in Pons. Kot argumente za uporabo teh virov izpostavljajo predvsem hitrost oz. enostavnost uporabe in navajenost na določen vir.","PeriodicalId":371035,"journal":{"name":"Slovenščina 2.0: empirical, applied and interdisciplinary research","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134632139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Stalnost, variantnost in modificirana raba frazemov v slovenskem jeziku in slovarjih
Pub Date : 2021-12-29 DOI: 10.4312/slo2.0.2021.2.71-99
Eva Trivunović
Prispevek prinaša pregled variant in modifikacij sedmih (iz)biblijskih frazemov v sodobni slovenščini ter njihove prisotnosti v sodobnem jeziku. Ugotovitve so primerjane z obravnavo frazemov v obstoječih slovarjih, kjer se kaže velik razkorak med slovarskim prikazom in stanjem, ki ga izkazuje korpusno gradivo. Za zanesljivejše ugotavljanje, v katerih primerih lahko govorimo o že ustaljeni variantnosti, so bili v raziskavi uporabljeni trije zvrstno različni korpusi: Gigafida 2.0, Janes in slWaC. Poleg ustaljenih variant so predstavljene neustaljene modifikacije, poseben poudarek je na prenovitvah, vendar se je jasno zastavljena tipologija mestoma izkazala za preveč togo, saj pri nekaterih mejnih primerih ni bilo mogoče nedvoumno ločiti ustaljenih variant od neprenovitvenih modifikacij ter neprenovitvenih modifikacij od prenovitvenih. Vsi izbrani frazemi in njihove prenovitve so najpogostejši v korpusu Janes, kar dokazuje nujnost vključevanja večjega števila raznovrstnih korpusov v jezikoslovne raziskave.
本文概述了当代斯洛文尼亚语中七个(等)圣经短语的变体和修改及其在现代语言中的存在。研究结果与现有词典中对这些短语的处理进行了比较,发现词典中的表述与语料库中的语料存在很大差距。为了更可靠地确定在哪些情况下我们可以谈论已经确立的变体,本研究使用了三个不同体裁的语料库:Gigafida 2.0、Janes 和 slWaC。除既定变体外,还介绍了非既定修改,特别关注更新,但事实证明,明确界定的类型学在某些地方过于死板,因为在一些边缘情况下,无法明确区分既定变体和非既定修改,以及非既定修改和更新。所有被选中的短语及其改写在 Janes 语料库中都是最常见的,这说明在语言学研究中有必要纳入更多的异质语料库。
{"title":"Stalnost, variantnost in modificirana raba frazemov v slovenskem jeziku in slovarjih","authors":"Eva Trivunović","doi":"10.4312/slo2.0.2021.2.71-99","DOIUrl":"https://doi.org/10.4312/slo2.0.2021.2.71-99","url":null,"abstract":"Prispevek prinaša pregled variant in modifikacij sedmih (iz)biblijskih frazemov v sodobni slovenščini ter njihove prisotnosti v sodobnem jeziku. Ugotovitve so primerjane z obravnavo frazemov v obstoječih slovarjih, kjer se kaže velik razkorak med slovarskim prikazom in stanjem, ki ga izkazuje korpusno gradivo. Za zanesljivejše ugotavljanje, v katerih primerih lahko govorimo o že ustaljeni variantnosti, so bili v raziskavi uporabljeni trije zvrstno različni korpusi: Gigafida 2.0, Janes in slWaC. Poleg ustaljenih variant so predstavljene neustaljene modifikacije, poseben poudarek je na prenovitvah, vendar se je jasno zastavljena tipologija mestoma izkazala za preveč togo, saj pri nekaterih mejnih primerih ni bilo mogoče nedvoumno ločiti ustaljenih variant od neprenovitvenih modifikacij ter neprenovitvenih modifikacij od prenovitvenih. Vsi izbrani frazemi in njihove prenovitve so najpogostejši v korpusu Janes, kar dokazuje nujnost vključevanja večjega števila raznovrstnih korpusov v jezikoslovne raziskave.","PeriodicalId":371035,"journal":{"name":"Slovenščina 2.0: empirical, applied and interdisciplinary research","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127067154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sociolingvistični posvet: aktualni sociolingvistični izzivi in prednostne raziskovalne tematike
Pub Date : 2021-12-29 DOI: 10.4312/slo2.0.2021.2.1-40
Maja Bitenc, Marko Stabej, Nataša Gliha Komac, Matejka Grgič, Monika Kalin Golob, K. Kenda-Jež, Albina Nečak Lük, Sonja Novak Lukanovič, Krištof Savski
Zapis posveta o aktualnih sociolingvističnih izzivih in prednostnih raziskovalnih tematikah, ki sta ga organizirala doc. dr. Maja Bitenc in red. prof. dr. Marko Stabej z Oddelka za slovenistiko in je potekal v ponedeljek, 27. 9. 2021, na Filozofski fakulteti Univerze v Ljubljani in s prenosom preko Zooma. V prvem delu so vabljene strokovnjakinje in strokovnjaki predstavili svoje poglede ob izhodiščnih vprašanjih, v drugem je sledila razprava vseh sodelujočih. Zapis posnetka so govornice in govorniki uredili po lastni presoji, načeloma s čim manj intervencijami, iz razprave pa so za branje prilagojene in objavljene vsebinsko tehtnejše replike.
由斯洛文尼亚语研究系的 Maja Bitenc 博士副教授和 Marko Stabej 博士副教授组织,于 2021 年 9 月 27 日星期一在卢布尔雅那大学文学院举行的关于当前社会语言学挑战和优先研究课题的讨论实录,并通过 Zoom 进行了流媒体传输。在第一部分,受邀专家就最初的问题发表了看法,随后所有与会者进行了讨论。录音誊本由发言者自行编辑,原则上尽量减少发言,辩论中更具实质性的反驳意见经改编后供阅读和发表。
{"title":"Sociolingvistični posvet: aktualni sociolingvistični izzivi in prednostne raziskovalne tematike","authors":"Maja Bitenc, Marko Stabej, Nataša Gliha Komac, Matejka Grgič, Monika Kalin Golob, K. Kenda-Jež, Albina Nečak Lük, Sonja Novak Lukanovič, Krištof Savski","doi":"10.4312/slo2.0.2021.2.1-40","DOIUrl":"https://doi.org/10.4312/slo2.0.2021.2.1-40","url":null,"abstract":"Zapis posveta o aktualnih sociolingvističnih izzivih in prednostnih raziskovalnih tematikah, ki sta ga organizirala doc. dr. Maja Bitenc in red. prof. dr. Marko Stabej z Oddelka za slovenistiko in je potekal v ponedeljek, 27. 9. 2021, na Filozofski fakulteti Univerze v Ljubljani in s prenosom preko Zooma. V prvem delu so vabljene strokovnjakinje in strokovnjaki predstavili svoje poglede ob izhodiščnih vprašanjih, v drugem je sledila razprava vseh sodelujočih. Zapis posnetka so govornice in govorniki uredili po lastni presoji, načeloma s čim manj intervencijami, iz razprave pa so za branje prilagojene in objavljene vsebinsko tehtnejše replike.","PeriodicalId":371035,"journal":{"name":"Slovenščina 2.0: empirical, applied and interdisciplinary research","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126862833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Collocation ranking: frequency vs semantics 搭配排序:频率vs语义
Pub Date : 2021-12-29 DOI: 10.4312/slo2.0.2021.2.41-70
Nikola Ljubesic, N. Logar, Iztok Kosem
Collocations play a very important role in language description, especially in identifying meanings of words. Modern lexicography’s inevitable part of meaning deduction are lists of collocates ranked by some statistical measurement. In the paper, we present a comparison between two approaches to the ranking of collocates: (a) the logDice method, which is dominantly used and frequency-based, and (b) the fastText word embeddings method, which is new and semantic-based. The comparison was made on two Slovene datasets, one representing general language headwords and their collocates, and the other representing headwords and their collocates extracted from a language for special purposes corpus. In the experiment, two methods were used: for the quantitative part of the evaluation, we used supervised machine learning with the area-under-the-curve (AUC) ROC score and support-vector machines (SVMs) algorithm, and in the qualitative part the ranking results of the two methods were evaluated by lexicographers. The results were somewhat inconsistent; while the quantitative evaluation confirmed that the machine-learning-based approach produced better collocate ranking results than the frequency-based one, lexicographers in most cases considered the listings of collocates of both methods very similar.
搭配在语言描述中起着非常重要的作用,尤其是对词义的识别。现代词典编纂中不可避免的意义演绎部分是通过一些统计测量来排列搭配列表。在本文中,我们提出了两种方法之间的比较:(a) logDice方法,这是主要使用的和基于频率的方法,(b) fastText词嵌入方法,这是一种新的和基于语义的方法。对两个斯洛文尼亚语数据集进行了比较,其中一个数据集代表一般语言词首词及其搭配,另一个数据集代表从特殊用途语言语料库中提取的词首词及其搭配。在实验中,我们使用了两种方法:对于定量部分的评估,我们使用了带有曲线下面积(AUC) ROC评分和支持向量机(svm)算法的监督机器学习,在定性部分,两种方法的排名结果由词典编纂者进行评估。结果有些不一致;虽然定量评估证实,基于机器学习的方法比基于频率的方法产生了更好的搭配排名结果,但词典编纂者在大多数情况下认为这两种方法的搭配列表非常相似。
{"title":"Collocation ranking: frequency vs semantics","authors":"Nikola Ljubesic, N. Logar, Iztok Kosem","doi":"10.4312/slo2.0.2021.2.41-70","DOIUrl":"https://doi.org/10.4312/slo2.0.2021.2.41-70","url":null,"abstract":"Collocations play a very important role in language description, especially in identifying meanings of words. Modern lexicography’s inevitable part of meaning deduction are lists of collocates ranked by some statistical measurement. In the paper, we present a comparison between two approaches to the ranking of collocates: (a) the logDice method, which is dominantly used and frequency-based, and (b) the fastText word embeddings method, which is new and semantic-based. The comparison was made on two Slovene datasets, one representing general language headwords and their collocates, and the other representing headwords and their collocates extracted from a language for special purposes corpus. In the experiment, two methods were used: for the quantitative part of the evaluation, we used supervised machine learning with the area-under-the-curve (AUC) ROC score and support-vector machines (SVMs) algorithm, and in the qualitative part the ranking results of the two methods were evaluated by lexicographers. The results were somewhat inconsistent; while the quantitative evaluation confirmed that the machine-learning-based approach produced better collocate ranking results than the frequency-based one, lexicographers in most cases considered the listings of collocates of both methods very similar.","PeriodicalId":371035,"journal":{"name":"Slovenščina 2.0: empirical, applied and interdisciplinary research","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130593069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Mednarodni konferenci eLex (5.–7. julij 2021) in EURALEX (7.–9. september 2021)
Pub Date : 2021-12-29 DOI: 10.4312/slo2.0.2021.2.126-129
Magdalena Gapsa
Poročilo o dveh pomembnih leksikografskih konferencah, in sicer o sedmi bienalni konferenci združenja Electronic lexicography in the 21st century (na kratko: eLex), ki je potekala med 5. in 7. julijem 2021, ter devetnajsti bienalni konferenci Evropskega leksikografskega združenja (European Association for Lexicography, EURALEX), ki je potekala med 7. in 9. septembrom 2021.
{"title":"Mednarodni konferenci eLex (5.–7. julij 2021) in EURALEX (7.–9. september 2021)","authors":"Magdalena Gapsa","doi":"10.4312/slo2.0.2021.2.126-129","DOIUrl":"https://doi.org/10.4312/slo2.0.2021.2.126-129","url":null,"abstract":"Poročilo o dveh pomembnih leksikografskih konferencah, in sicer o sedmi bienalni konferenci združenja Electronic lexicography in the 21st century (na kratko: eLex), ki je potekala med 5. in 7. julijem 2021, ter devetnajsti bienalni konferenci Evropskega leksikografskega združenja (European Association for Lexicography, EURALEX), ki je potekala med 7. in 9. septembrom 2021.","PeriodicalId":371035,"journal":{"name":"Slovenščina 2.0: empirical, applied and interdisciplinary research","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115032306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Slovene and Croatian word embeddings in terms of gender occupational analogies 斯洛文尼亚语和克罗地亚语在性别职业类比方面的词嵌入
Pub Date : 2021-07-06 DOI: 10.4312/SLO2.0.2021.1.26-59
Matej Ulčar, Anka Supej, M. Robnik-Sikonja, Senja Pollak
In recent years, the use of deep neural networks and dense vector embeddings for text representation have led to excellent results in the field of computational understanding of natural language. It has also been shown that word embeddings often capture gender, racial and other types of bias. The article focuses on evaluating Slovene and Croatian word embeddings in terms of gender bias using word analogy calculations. We compiled a list of masculine and feminine nouns for occupations in Slovene and evaluated the gender bias of fastText, word2vec and ELMo embeddings with different configurations and different approaches to analogy calculations. The lowest occupational gender bias was observed with the fastText embeddings. Similarly, we compared different fastText embeddings on Croatian occupational analogies.
近年来,使用深度神经网络和密集向量嵌入进行文本表示在自然语言的计算理解领域取得了优异的成绩。研究还表明,词嵌入通常会捕捉到性别、种族和其他类型的偏见。本文着重于评估斯洛文尼亚语和克罗地亚语的词嵌入方面的性别偏见使用词类比计算。我们编制了斯洛文尼亚语职业的阳性和阴性名词列表,并评估了fastText、word2vec和ELMo嵌入在不同配置和不同类比计算方法下的性别偏见。快速文本嵌入的职业性别偏见最低。同样,我们比较了克罗地亚职业类比的不同fastText嵌入。
{"title":"Slovene and Croatian word embeddings in terms of gender occupational analogies","authors":"Matej Ulčar, Anka Supej, M. Robnik-Sikonja, Senja Pollak","doi":"10.4312/SLO2.0.2021.1.26-59","DOIUrl":"https://doi.org/10.4312/SLO2.0.2021.1.26-59","url":null,"abstract":"In recent years, the use of deep neural networks and dense vector embeddings for text representation have led to excellent results in the field of computational understanding of natural language. It has also been shown that word embeddings often capture gender, racial and other types of bias. The article focuses on evaluating Slovene and Croatian word embeddings in terms of gender bias using word analogy calculations. We compiled a list of masculine and feminine nouns for occupations in Slovene and evaluated the gender bias of fastText, word2vec and ELMo embeddings with different configurations and different approaches to analogy calculations. The lowest occupational gender bias was observed with the fastText embeddings. Similarly, we compared different fastText embeddings on Croatian occupational analogies.","PeriodicalId":371035,"journal":{"name":"Slovenščina 2.0: empirical, applied and interdisciplinary research","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120962954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Converting raw transcripts into an annotated and turn-aligned TEI-XML corpus: the example of the Corpus of Serbian Forms of Address 将原始抄本转换为带注释且对齐的TEI-XML语料库:塞尔维亚语地址形式语料库的示例
Pub Date : 2021-07-06 DOI: 10.4312/SLO2.0.2021.1.123-144
Dolores Lemmenmeier-Batinić
This paper describes the procedure of building a TEI-XML corpus of spoken Serbian starting from raw transcripts. The corpus consists of semi–structured interviews, which were gathered with the aim of investigating forms of address in Serbian. The interviews were thoroughly transcribed according to GAT transcribing conventions. However, the transcription was carried out without tools that would control the validity of the GAT syntax, or align the transcript with the audio records. In order to offer this resource to a broader audience, we resolved the inconsistencies in the original transcripts, normalised the semi-orthographic transcriptions and converted the corpus into a TEI-format for transcriptions of speech. Further, we enriched the corpus by tagging and lemmatising the data. Lastly, we aligned the corpus turns to the corresponding audio segments by using a force-alignment tool. In addition to presenting the main steps involved in converting the corpus to the XML-format, this paper also discusses current challenges in the processing of spoken data, and the implications of data re-use regarding transcriptions of speech. This corpus can be used for studying Serbian from the perspective of interactional linguistics, for investigating morphosyntax, grammar, lexicon and phonetics of spoken Serbian, for studying disfluencies, as well as for testing models for automatic speech recognition and forced alignment. The corpus is freely available for research purposes.
本文描述了从原始文本开始构建塞尔维亚语口语TEI-XML语料库的过程。语料库包括半结构化访谈,收集这些访谈的目的是调查塞尔维亚语的称呼形式。访谈完全按照GAT的记录惯例进行了记录。然而,转录是在没有工具的情况下进行的,这些工具可以控制GAT语法的有效性,或者将转录与音频记录对齐。为了将这一资源提供给更广泛的受众,我们解决了原始文本中的不一致之处,对半正字法转录进行了规范化,并将语料库转换为用于语音转录的tei格式。此外,我们通过标记和归纳数据来丰富语料库。最后,我们通过使用力对齐工具将语料库转向对齐到相应的音频片段。除了介绍将语料库转换为xml格式所涉及的主要步骤外,本文还讨论了语音数据处理中的当前挑战,以及语音转录方面数据重用的含义。该语料库可用于从互动语言学的角度研究塞尔维亚语,用于调查塞尔维亚语口语的词法、语法、词汇和语音,用于研究不流利,以及用于测试自动语音识别和强制对齐模型。该语料库可免费用于研究目的。
{"title":"Converting raw transcripts into an annotated and turn-aligned TEI-XML corpus: the example of the Corpus of Serbian Forms of Address","authors":"Dolores Lemmenmeier-Batinić","doi":"10.4312/SLO2.0.2021.1.123-144","DOIUrl":"https://doi.org/10.4312/SLO2.0.2021.1.123-144","url":null,"abstract":"This paper describes the procedure of building a TEI-XML corpus of spoken Serbian starting from raw transcripts. The corpus consists of semi–structured interviews, which were gathered with the aim of investigating forms of address in Serbian. The interviews were thoroughly transcribed according to GAT transcribing conventions. However, the transcription was carried out without tools that would control the validity of the GAT syntax, or align the transcript with the audio records. In order to offer this resource to a broader audience, we resolved the inconsistencies in the original transcripts, normalised the semi-orthographic transcriptions and converted the corpus into a TEI-format for transcriptions of speech. Further, we enriched the corpus by tagging and lemmatising the data. Lastly, we aligned the corpus turns to the corresponding audio segments by using a force-alignment tool. In addition to presenting the main steps involved in converting the corpus to the XML-format, this paper also discusses current challenges in the processing of spoken data, and the implications of data re-use regarding transcriptions of speech. This corpus can be used for studying Serbian from the perspective of interactional linguistics, for investigating morphosyntax, grammar, lexicon and phonetics of spoken Serbian, for studying disfluencies, as well as for testing models for automatic speech recognition and forced alignment. The corpus is freely available for research purposes.","PeriodicalId":371035,"journal":{"name":"Slovenščina 2.0: empirical, applied and interdisciplinary research","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114146521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Avtomatsko razpoznavanja slovenskega govora za dnevnoinformativne oddaje
Pub Date : 2021-07-06 DOI: 10.4312/SLO2.0.2021.1.60-89
Lucija Gril, Mirjam Sepesy Maučec, Gregor Donaj, Andrej Žgank
Na področju govornih in jezikovnih tehnologij predstavlja avtomatsko razpoznavanje govora enega izmed ključnih gradnikov. V prispevku bomo predstavili razvoj avtomatskega razpoznavalnika slovenskega govora za domeno dnevnoinformativnih oddaj. Arhitektura sistema je zasnovana na globokih nevronskih mrežah. Pri tem smo ob upoštevanju razpoložljivih govornih virov izvedli modeliranje z različnimi aktivacijskimi funkcijami. V postopku razvoja razpoznavalnika govora smo preverili tudi, kakšen je vpliv izgubnih govornih kodekov na rezultate razpoznavanja govora. Za učenje razpoznavalnika govora smo uporabili bazi UMB BNSI Broadcast News in IETK-TV. Skupni obseg govornih posnetkov je znašal 66 ur. Vzporedno z globokimi nevronskimi mrežami smo povečali slovar razpoznavanja govora, ki je tako znašal 250.000 besed. Na ta način smo znižali delež besed izven slovarja na 1,33 %. Z razpoznavanjem govora na testni množici smo dosegli najboljšo stopnjo napačno razpoznanih besed (WER) 15,17 %. Med procesom vrednotenja rezultatov smo izvedli tudi podrobnejšo analizo napak razpoznavanja govora na osnovi lem in F-razredov, ki v določeni meri pokažejo na zahtevnost slovenskega jezika za takšne scenarije uporabe tehnologije.
在语音和语言技术领域,自动语音识别是关键的组成部分之一。在本文中,我们将介绍针对每日新闻广播领域开发的斯洛文尼亚语自动语音识别器。该系统的架构基于深度神经网络。考虑到可用的语音资源,我们使用不同的激活函数进行了建模。在开发语音识别器的过程中,我们还研究了有损语音编解码器对语音识别结果的影响。我们使用 UMB BNSI 广播新闻和 IETK-TV 数据库来训练语音识别器。语音记录的总时长为 66 小时。在使用深度神经网络的同时,我们还增加了语音识别字典,使其达到 250,000 个单词。通过这种方式,我们将字典之外的单词比例降低到了 1.33%。 在测试集上进行的语音识别的最佳单词错误率(WER)为 15.17%。 在评估结果的过程中,我们还根据词性和 F 类对语音识别错误进行了更详细的分析,这在一定程度上显示了斯洛文尼亚语在此类技术使用场景下的复杂性。
{"title":"Avtomatsko razpoznavanja slovenskega govora za dnevnoinformativne oddaje","authors":"Lucija Gril, Mirjam Sepesy Maučec, Gregor Donaj, Andrej Žgank","doi":"10.4312/SLO2.0.2021.1.60-89","DOIUrl":"https://doi.org/10.4312/SLO2.0.2021.1.60-89","url":null,"abstract":"Na področju govornih in jezikovnih tehnologij predstavlja avtomatsko razpoznavanje govora enega izmed ključnih gradnikov. V prispevku bomo predstavili razvoj avtomatskega razpoznavalnika slovenskega govora za domeno dnevnoinformativnih oddaj. Arhitektura sistema je zasnovana na globokih nevronskih mrežah. Pri tem smo ob upoštevanju razpoložljivih govornih virov izvedli modeliranje z različnimi aktivacijskimi funkcijami. V postopku razvoja razpoznavalnika govora smo preverili tudi, kakšen je vpliv izgubnih govornih kodekov na rezultate razpoznavanja govora. Za učenje razpoznavalnika govora smo uporabili bazi UMB BNSI Broadcast News in IETK-TV. Skupni obseg govornih posnetkov je znašal 66 ur. Vzporedno z globokimi nevronskimi mrežami smo povečali slovar razpoznavanja govora, ki je tako znašal 250.000 besed. Na ta način smo znižali delež besed izven slovarja na 1,33 %. Z razpoznavanjem govora na testni množici smo dosegli najboljšo stopnjo napačno razpoznanih besed (WER) 15,17 %. Med procesom vrednotenja rezultatov smo izvedli tudi podrobnejšo analizo napak razpoznavanja govora na osnovi lem in F-razredov, ki v določeni meri pokažejo na zahtevnost slovenskega jezika za takšne scenarije uporabe tehnologije.","PeriodicalId":371035,"journal":{"name":"Slovenščina 2.0: empirical, applied and interdisciplinary research","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128788196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Učno E-okolje Slovenščina na dlani: izzivi in rešitve
Pub Date : 2021-07-06 DOI: 10.4312/SLO2.0.2021.1.181-215
Darinka Verdonik, Simona Majhenič, Špela Antloga, Sandi Majninger, Marko Ferme, Kaja Dobrovoljc, Simona Pulko, Mira Krajnc Ivič, Natalija Ulčnik
Prispevek izhaja iz treh izzivov, ki jih zaznavamo pri pouku slovenščine v višjih razredih osnovnih šol in v srednjih šolah: kako odpraviti napake knjižne norme, ki vztrajajo v pisnih izdelkih učencev; kako izboljšati frazeološko kompetenco; kako izboljšati sporazumevalno jezikovno zmožnost. Ti izzivi so osrednja točka razvoja sodobnega učnega e-okolja Slovenščina na dlani, ki temelji na jezikovnih in informacijsko-komunikacijskih tehnologijah ter prinaša podporo prožnim oblikam poučevanja, poučevanju na daljavo, lajša učiteljevo delo, omogoča pa tudi motiviranje učencev prek elementov igrifikacije. V prispevku predstavljamo zasnovo in izvedbo vsakega od štirih vsebinskih sklopov e-okolja: pravopis, slovnica, frazeologija in besedila.
本文基于我们在小学高年级和中学斯洛文尼亚语教学中发现的三个挑战:如何消除学生书面作业中长期存在的文学规范错误;如何提高短语能力;如何提高语言交际能力。这些挑战是开发现代电子学习环境 "掌上斯洛文尼亚语 "的核心所在。该环境基于语言和信息通信技术,支持灵活的教学形式、远程学习、方便教师的工作,并允许通过游戏化元素激励学生。在本文中,我们将介绍该电子环境四个内容领域的设计和实施情况:拼写、语法、短语和课文。
{"title":"Učno E-okolje Slovenščina na dlani: izzivi in rešitve","authors":"Darinka Verdonik, Simona Majhenič, Špela Antloga, Sandi Majninger, Marko Ferme, Kaja Dobrovoljc, Simona Pulko, Mira Krajnc Ivič, Natalija Ulčnik","doi":"10.4312/SLO2.0.2021.1.181-215","DOIUrl":"https://doi.org/10.4312/SLO2.0.2021.1.181-215","url":null,"abstract":"Prispevek izhaja iz treh izzivov, ki jih zaznavamo pri pouku slovenščine v višjih razredih osnovnih šol in v srednjih šolah: kako odpraviti napake knjižne norme, ki vztrajajo v pisnih izdelkih učencev; kako izboljšati frazeološko kompetenco; kako izboljšati sporazumevalno jezikovno zmožnost. Ti izzivi so osrednja točka razvoja sodobnega učnega e-okolja Slovenščina na dlani, ki temelji na jezikovnih in informacijsko-komunikacijskih tehnologijah ter prinaša podporo prožnim oblikam poučevanja, poučevanju na daljavo, lajša učiteljevo delo, omogoča pa tudi motiviranje učencev prek elementov igrifikacije. V prispevku predstavljamo zasnovo in izvedbo vsakega od štirih vsebinskih sklopov e-okolja: pravopis, slovnica, frazeologija in besedila.","PeriodicalId":371035,"journal":{"name":"Slovenščina 2.0: empirical, applied and interdisciplinary research","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130373487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sign language lexicography: a case study of an online dictionary 手语词典编纂:一个在线词典的案例研究
Pub Date : 2021-07-06 DOI: 10.4312/SLO2.0.2021.1.90-122
Lucia Vlášková, Hana Strachoňová
As a growing field of study within sign language linguistics, sign language lexicography faces many challenges that have already been answered for audio-oral language material. In this paper, we present some of these challenges and methods developed to help navigate the complex lexical classification field. The described methods and strategies are implemented in the first Czech sign language (ČZJ) online dictionary, a part of the platform Dictio, developed at Masaryk University in Brno. We cover the topic of lemmatisation and how to decide what constitutes a lexeme in sign language. We introduce four types of expressions that qualify for a dictionary entry: a simple lexeme, a compound, a derivative, and a set phrase. We address the question of the place of classifier constructions and shape and size specifiers in a dictionary, given their peculiar semantic status. We maintain the standard classification of classifiers (whole entity and holding classifiers) and size and shape specifiers (SASSes; static and tracing specifiers). We provide arguments for separating the category of specifiers from the category of classifiers. We discuss the proper treatment of mouthings and mouth gestures concerning citation forms, derivation and translation. We show why it is difficult in sign language to distinguish synonyms from variants and how our proposed phonological criteria can help. We explain how to construct a semantic definition in a sign language and what is the solution for multiple meanings of one form. We offer simple guidelines for forming proper examples of use in a sign language. And finally, we briefly comment on the process of the translation between sign and spoken languages. We conclude the paper with a summary of roles that Dictio plays in the ČZJ-signing community.
作为手语语言学中一个新兴的研究领域,手语词典编纂面临着许多挑战,而这些挑战在视听语言材料中已经得到了解决。在本文中,我们提出了其中的一些挑战和开发的方法,以帮助导航复杂的词汇分类领域。所描述的方法和策略在第一个捷克手语在线词典(ČZJ)中实现,该词典是Dictio平台的一部分,由布尔诺的马萨里克大学开发。我们涵盖的主题词素化和如何决定什么构成一个词素在手语。我们将介绍四种符合字典条目条件的表达式:简单词素、复合词、派生词和固定短语。我们解决了分类器结构和形状和大小说明符在字典中的位置问题,因为它们具有特殊的语义状态。我们维护分类器(整个实体和持有分类器)和尺寸和形状说明器(sass;静态和跟踪说明符)。我们提供了将说明符类别与分类器类别分开的参数。我们从引证形式、引证来源和翻译三个方面讨论了嘴型和嘴型手势的正确处理。我们展示了为什么在手语中区分同义词和变体是困难的,以及我们提出的语音标准是如何帮助的。我们解释了如何在手语中构建语义定义,以及一种形式的多重含义的解决方案。我们提供了简单的指导方针,以形成正确的使用手语的例子。最后,简要评述了手语与口语翻译的过程。最后,我们总结了Dictio在ČZJ-signing社区中扮演的角色。
{"title":"Sign language lexicography: a case study of an online dictionary","authors":"Lucia Vlášková, Hana Strachoňová","doi":"10.4312/SLO2.0.2021.1.90-122","DOIUrl":"https://doi.org/10.4312/SLO2.0.2021.1.90-122","url":null,"abstract":"As a growing field of study within sign language linguistics, sign language lexicography faces many challenges that have already been answered for audio-oral language material. In this paper, we present some of these challenges and methods developed to help navigate the complex lexical classification field. The described methods and strategies are implemented in the first Czech sign language (ČZJ) online dictionary, a part of the platform Dictio, developed at Masaryk University in Brno. We cover the topic of lemmatisation and how to decide what constitutes a lexeme in sign language. We introduce four types of expressions that qualify for a dictionary entry: a simple lexeme, a compound, a derivative, and a set phrase. We address the question of the place of classifier constructions and shape and size specifiers in a dictionary, given their peculiar semantic status. We maintain the standard classification of classifiers (whole entity and holding classifiers) and size and shape specifiers (SASSes; static and tracing specifiers). We provide arguments for separating the category of specifiers from the category of classifiers. We discuss the proper treatment of mouthings and mouth gestures concerning citation forms, derivation and translation. We show why it is difficult in sign language to distinguish synonyms from variants and how our proposed phonological criteria can help. We explain how to construct a semantic definition in a sign language and what is the solution for multiple meanings of one form. We offer simple guidelines for forming proper examples of use in a sign language. And finally, we briefly comment on the process of the translation between sign and spoken languages. We conclude the paper with a summary of roles that Dictio plays in the ČZJ-signing community.","PeriodicalId":371035,"journal":{"name":"Slovenščina 2.0: empirical, applied and interdisciplinary research","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122419001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Slovenščina 2.0: empirical, applied and interdisciplinary research
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1