首页 > 最新文献

Slovenščina 2.0: empirical, applied and interdisciplinary research最新文献

英文 中文
Učno E-okolje Slovenščina na dlani: izzivi in rešitve
Pub Date : 2021-07-06 DOI: 10.4312/SLO2.0.2021.1.181-215
Darinka Verdonik, Simona Majhenič, Špela Antloga, Sandi Majninger, Marko Ferme, Kaja Dobrovoljc, Simona Pulko, Mira Krajnc Ivič, Natalija Ulčnik
Prispevek izhaja iz treh izzivov, ki jih zaznavamo pri pouku slovenščine v višjih razredih osnovnih šol in v srednjih šolah: kako odpraviti napake knjižne norme, ki vztrajajo v pisnih izdelkih učencev; kako izboljšati frazeološko kompetenco; kako izboljšati sporazumevalno jezikovno zmožnost. Ti izzivi so osrednja točka razvoja sodobnega učnega e-okolja Slovenščina na dlani, ki temelji na jezikovnih in informacijsko-komunikacijskih tehnologijah ter prinaša podporo prožnim oblikam poučevanja, poučevanju na daljavo, lajša učiteljevo delo, omogoča pa tudi motiviranje učencev prek elementov igrifikacije. V prispevku predstavljamo zasnovo in izvedbo vsakega od štirih vsebinskih sklopov e-okolja: pravopis, slovnica, frazeologija in besedila.
本文基于我们在小学高年级和中学斯洛文尼亚语教学中发现的三个挑战:如何消除学生书面作业中长期存在的文学规范错误;如何提高短语能力;如何提高语言交际能力。这些挑战是开发现代电子学习环境 "掌上斯洛文尼亚语 "的核心所在。该环境基于语言和信息通信技术,支持灵活的教学形式、远程学习、方便教师的工作,并允许通过游戏化元素激励学生。在本文中,我们将介绍该电子环境四个内容领域的设计和实施情况:拼写、语法、短语和课文。
{"title":"Učno E-okolje Slovenščina na dlani: izzivi in rešitve","authors":"Darinka Verdonik, Simona Majhenič, Špela Antloga, Sandi Majninger, Marko Ferme, Kaja Dobrovoljc, Simona Pulko, Mira Krajnc Ivič, Natalija Ulčnik","doi":"10.4312/SLO2.0.2021.1.181-215","DOIUrl":"https://doi.org/10.4312/SLO2.0.2021.1.181-215","url":null,"abstract":"Prispevek izhaja iz treh izzivov, ki jih zaznavamo pri pouku slovenščine v višjih razredih osnovnih šol in v srednjih šolah: kako odpraviti napake knjižne norme, ki vztrajajo v pisnih izdelkih učencev; kako izboljšati frazeološko kompetenco; kako izboljšati sporazumevalno jezikovno zmožnost. Ti izzivi so osrednja točka razvoja sodobnega učnega e-okolja Slovenščina na dlani, ki temelji na jezikovnih in informacijsko-komunikacijskih tehnologijah ter prinaša podporo prožnim oblikam poučevanja, poučevanju na daljavo, lajša učiteljevo delo, omogoča pa tudi motiviranje učencev prek elementov igrifikacije. V prispevku predstavljamo zasnovo in izvedbo vsakega od štirih vsebinskih sklopov e-okolja: pravopis, slovnica, frazeologija in besedila.","PeriodicalId":371035,"journal":{"name":"Slovenščina 2.0: empirical, applied and interdisciplinary research","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130373487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Avtomatsko razpoznavanja slovenskega govora za dnevnoinformativne oddaje
Pub Date : 2021-07-06 DOI: 10.4312/SLO2.0.2021.1.60-89
Lucija Gril, Mirjam Sepesy Maučec, Gregor Donaj, Andrej Žgank
Na področju govornih in jezikovnih tehnologij predstavlja avtomatsko razpoznavanje govora enega izmed ključnih gradnikov. V prispevku bomo predstavili razvoj avtomatskega razpoznavalnika slovenskega govora za domeno dnevnoinformativnih oddaj. Arhitektura sistema je zasnovana na globokih nevronskih mrežah. Pri tem smo ob upoštevanju razpoložljivih govornih virov izvedli modeliranje z različnimi aktivacijskimi funkcijami. V postopku razvoja razpoznavalnika govora smo preverili tudi, kakšen je vpliv izgubnih govornih kodekov na rezultate razpoznavanja govora. Za učenje razpoznavalnika govora smo uporabili bazi UMB BNSI Broadcast News in IETK-TV. Skupni obseg govornih posnetkov je znašal 66 ur. Vzporedno z globokimi nevronskimi mrežami smo povečali slovar razpoznavanja govora, ki je tako znašal 250.000 besed. Na ta način smo znižali delež besed izven slovarja na 1,33 %. Z razpoznavanjem govora na testni množici smo dosegli najboljšo stopnjo napačno razpoznanih besed (WER) 15,17 %. Med procesom vrednotenja rezultatov smo izvedli tudi podrobnejšo analizo napak razpoznavanja govora na osnovi lem in F-razredov, ki v določeni meri pokažejo na zahtevnost slovenskega jezika za takšne scenarije uporabe tehnologije.
在语音和语言技术领域,自动语音识别是关键的组成部分之一。在本文中,我们将介绍针对每日新闻广播领域开发的斯洛文尼亚语自动语音识别器。该系统的架构基于深度神经网络。考虑到可用的语音资源,我们使用不同的激活函数进行了建模。在开发语音识别器的过程中,我们还研究了有损语音编解码器对语音识别结果的影响。我们使用 UMB BNSI 广播新闻和 IETK-TV 数据库来训练语音识别器。语音记录的总时长为 66 小时。在使用深度神经网络的同时,我们还增加了语音识别字典,使其达到 250,000 个单词。通过这种方式,我们将字典之外的单词比例降低到了 1.33%。 在测试集上进行的语音识别的最佳单词错误率(WER)为 15.17%。 在评估结果的过程中,我们还根据词性和 F 类对语音识别错误进行了更详细的分析,这在一定程度上显示了斯洛文尼亚语在此类技术使用场景下的复杂性。
{"title":"Avtomatsko razpoznavanja slovenskega govora za dnevnoinformativne oddaje","authors":"Lucija Gril, Mirjam Sepesy Maučec, Gregor Donaj, Andrej Žgank","doi":"10.4312/SLO2.0.2021.1.60-89","DOIUrl":"https://doi.org/10.4312/SLO2.0.2021.1.60-89","url":null,"abstract":"Na področju govornih in jezikovnih tehnologij predstavlja avtomatsko razpoznavanje govora enega izmed ključnih gradnikov. V prispevku bomo predstavili razvoj avtomatskega razpoznavalnika slovenskega govora za domeno dnevnoinformativnih oddaj. Arhitektura sistema je zasnovana na globokih nevronskih mrežah. Pri tem smo ob upoštevanju razpoložljivih govornih virov izvedli modeliranje z različnimi aktivacijskimi funkcijami. V postopku razvoja razpoznavalnika govora smo preverili tudi, kakšen je vpliv izgubnih govornih kodekov na rezultate razpoznavanja govora. Za učenje razpoznavalnika govora smo uporabili bazi UMB BNSI Broadcast News in IETK-TV. Skupni obseg govornih posnetkov je znašal 66 ur. Vzporedno z globokimi nevronskimi mrežami smo povečali slovar razpoznavanja govora, ki je tako znašal 250.000 besed. Na ta način smo znižali delež besed izven slovarja na 1,33 %. Z razpoznavanjem govora na testni množici smo dosegli najboljšo stopnjo napačno razpoznanih besed (WER) 15,17 %. Med procesom vrednotenja rezultatov smo izvedli tudi podrobnejšo analizo napak razpoznavanja govora na osnovi lem in F-razredov, ki v določeni meri pokažejo na zahtevnost slovenskega jezika za takšne scenarije uporabe tehnologije.","PeriodicalId":371035,"journal":{"name":"Slovenščina 2.0: empirical, applied and interdisciplinary research","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128788196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Converting raw transcripts into an annotated and turn-aligned TEI-XML corpus: the example of the Corpus of Serbian Forms of Address 将原始抄本转换为带注释且对齐的TEI-XML语料库:塞尔维亚语地址形式语料库的示例
Pub Date : 2021-07-06 DOI: 10.4312/SLO2.0.2021.1.123-144
Dolores Lemmenmeier-Batinić
This paper describes the procedure of building a TEI-XML corpus of spoken Serbian starting from raw transcripts. The corpus consists of semi–structured interviews, which were gathered with the aim of investigating forms of address in Serbian. The interviews were thoroughly transcribed according to GAT transcribing conventions. However, the transcription was carried out without tools that would control the validity of the GAT syntax, or align the transcript with the audio records. In order to offer this resource to a broader audience, we resolved the inconsistencies in the original transcripts, normalised the semi-orthographic transcriptions and converted the corpus into a TEI-format for transcriptions of speech. Further, we enriched the corpus by tagging and lemmatising the data. Lastly, we aligned the corpus turns to the corresponding audio segments by using a force-alignment tool. In addition to presenting the main steps involved in converting the corpus to the XML-format, this paper also discusses current challenges in the processing of spoken data, and the implications of data re-use regarding transcriptions of speech. This corpus can be used for studying Serbian from the perspective of interactional linguistics, for investigating morphosyntax, grammar, lexicon and phonetics of spoken Serbian, for studying disfluencies, as well as for testing models for automatic speech recognition and forced alignment. The corpus is freely available for research purposes.
本文描述了从原始文本开始构建塞尔维亚语口语TEI-XML语料库的过程。语料库包括半结构化访谈,收集这些访谈的目的是调查塞尔维亚语的称呼形式。访谈完全按照GAT的记录惯例进行了记录。然而,转录是在没有工具的情况下进行的,这些工具可以控制GAT语法的有效性,或者将转录与音频记录对齐。为了将这一资源提供给更广泛的受众,我们解决了原始文本中的不一致之处,对半正字法转录进行了规范化,并将语料库转换为用于语音转录的tei格式。此外,我们通过标记和归纳数据来丰富语料库。最后,我们通过使用力对齐工具将语料库转向对齐到相应的音频片段。除了介绍将语料库转换为xml格式所涉及的主要步骤外,本文还讨论了语音数据处理中的当前挑战,以及语音转录方面数据重用的含义。该语料库可用于从互动语言学的角度研究塞尔维亚语,用于调查塞尔维亚语口语的词法、语法、词汇和语音,用于研究不流利,以及用于测试自动语音识别和强制对齐模型。该语料库可免费用于研究目的。
{"title":"Converting raw transcripts into an annotated and turn-aligned TEI-XML corpus: the example of the Corpus of Serbian Forms of Address","authors":"Dolores Lemmenmeier-Batinić","doi":"10.4312/SLO2.0.2021.1.123-144","DOIUrl":"https://doi.org/10.4312/SLO2.0.2021.1.123-144","url":null,"abstract":"This paper describes the procedure of building a TEI-XML corpus of spoken Serbian starting from raw transcripts. The corpus consists of semi–structured interviews, which were gathered with the aim of investigating forms of address in Serbian. The interviews were thoroughly transcribed according to GAT transcribing conventions. However, the transcription was carried out without tools that would control the validity of the GAT syntax, or align the transcript with the audio records. In order to offer this resource to a broader audience, we resolved the inconsistencies in the original transcripts, normalised the semi-orthographic transcriptions and converted the corpus into a TEI-format for transcriptions of speech. Further, we enriched the corpus by tagging and lemmatising the data. Lastly, we aligned the corpus turns to the corresponding audio segments by using a force-alignment tool. In addition to presenting the main steps involved in converting the corpus to the XML-format, this paper also discusses current challenges in the processing of spoken data, and the implications of data re-use regarding transcriptions of speech. This corpus can be used for studying Serbian from the perspective of interactional linguistics, for investigating morphosyntax, grammar, lexicon and phonetics of spoken Serbian, for studying disfluencies, as well as for testing models for automatic speech recognition and forced alignment. The corpus is freely available for research purposes.","PeriodicalId":371035,"journal":{"name":"Slovenščina 2.0: empirical, applied and interdisciplinary research","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114146521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Nadgradnja Zgodovinarskega indeksa citiranosti
Pub Date : 2021-07-06 DOI: 10.4312/SLO2.0.2021.1.216-235
Katja Meden, Ana Cvek
Začetki Zgodovinarskega indeksa citiranja segajo v leto 2003, ko so raziskovalci Inštituta za novejšo zgodovino začeli spremljati in sistematično popisovati citate za prijave projektov in programov na ARRS. Citatni indeks je doživel nekaj nadgradenj, poskusov harmonizacije podatkov in prečiščevanja relacijskih baz, vendar je bilo v zadnjih letih ugotovljeno, da sistem ne zadostuje potrebam indeksatorjev in uporabnikov. Pred nadgradnjo smo izvedli analizo podatkov, kjer so se identificirale največje težave. Nadgradnja je potekala v dveh delih; v prvem delu smo nadgradili administrativni del, v drugem delu pa spletno aplikacijo. Zgodovinarski indeks citiranja je bil med nadgradnjo tehnično posodobljen in s tem oblikovan tako, da je intuitiven za indeksatorje in uporabnike.
{"title":"Nadgradnja Zgodovinarskega indeksa citiranosti","authors":"Katja Meden, Ana Cvek","doi":"10.4312/SLO2.0.2021.1.216-235","DOIUrl":"https://doi.org/10.4312/SLO2.0.2021.1.216-235","url":null,"abstract":"Začetki Zgodovinarskega indeksa citiranja segajo v leto 2003, ko so raziskovalci Inštituta za novejšo zgodovino začeli spremljati in sistematično popisovati citate za prijave projektov in programov na ARRS. Citatni indeks je doživel nekaj nadgradenj, poskusov harmonizacije podatkov in prečiščevanja relacijskih baz, vendar je bilo v zadnjih letih ugotovljeno, da sistem ne zadostuje potrebam indeksatorjev in uporabnikov. Pred nadgradnjo smo izvedli analizo podatkov, kjer so se identificirale največje težave. Nadgradnja je potekala v dveh delih; v prvem delu smo nadgradili administrativni del, v drugem delu pa spletno aplikacijo. Zgodovinarski indeks citiranja je bil med nadgradnjo tehnično posodobljen in s tem oblikovan tako, da je intuitiven za indeksatorje in uporabnike.","PeriodicalId":371035,"journal":{"name":"Slovenščina 2.0: empirical, applied and interdisciplinary research","volume":"774 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124088248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hedging modal adverbs in Slovenian academic discourse 斯洛文尼亚语学术语篇中的模糊情态副词
Pub Date : 2021-07-06 DOI: 10.4312/SLO2.0.2021.1.145-180
Jakob Lenardic, Darja Fišer
This paper first presents a comparative analysis of modal adverbs in doctoral theses in the humanities and social sciences on the one hand, and in natural and technical sciences on the other from the 1.7-billion-token corpus of Slovenian academic texts KAS (Erjavec et al., 2019a). Using a randomized concordance analysis, we observe the epistemic and non-epistemic usage of the modal adverbs and show that epistemic adverbs are more characteristic of the humanities and social sciences theses. We also show that the non-epistemic dispositional meaning of possibility, which is most commonly used in natural and technical sciences theses, is not used as a hedging device. In the second part of the paper we compare the usage of a selected set of modals in bachelor’s, master’s and doctoral theses in order to chart how researchers’ approach to stance-taking changes at different proficiency levels in academic writing, showing that the observed increase in hedging devices in doctoral theses seems to be less a function of an increased proficiency level in academic writing as such and more the result of conceptual differences between undergraduate and postgraduate theses, only the latter of which are original research contributions with extensive discussion of the results.
本文首先从斯洛文尼亚学术文本KAS (Erjavec et al., 2019a)的17亿token语料库中对人文社会科学和自然技术科学博士论文中的情态副词进行了比较分析。通过随机一致性分析,我们观察了情态副词的认知性和非认知性用法,发现认知性副词在人文社科论文中更具特色。我们还表明,在自然科学和技术科学论文中最常用的可能性的非认识论配置意义并没有被用作对冲手段。在论文的第二部分,我们比较了一组选定的情态在学士、硕士和博士论文中的使用情况,以图表研究人员在不同熟练程度的学术写作中采取立场的方法是如何变化的。表明在博士论文中观察到的模糊措辞的增加似乎不是学术写作水平提高的功能,而是本科和研究生论文之间概念差异的结果,只有后者是原创性研究贡献,并对结果进行了广泛的讨论。
{"title":"Hedging modal adverbs in Slovenian academic discourse","authors":"Jakob Lenardic, Darja Fišer","doi":"10.4312/SLO2.0.2021.1.145-180","DOIUrl":"https://doi.org/10.4312/SLO2.0.2021.1.145-180","url":null,"abstract":"This paper first presents a comparative analysis of modal adverbs in doctoral theses in the humanities and social sciences on the one hand, and in natural and technical sciences on the other from the 1.7-billion-token corpus of Slovenian academic texts KAS (Erjavec et al., 2019a). Using a randomized concordance analysis, we observe the epistemic and non-epistemic usage of the modal adverbs and show that epistemic adverbs are more characteristic of the humanities and social sciences theses. We also show that the non-epistemic dispositional meaning of possibility, which is most commonly used in natural and technical sciences theses, is not used as a hedging device. In the second part of the paper we compare the usage of a selected set of modals in bachelor’s, master’s and doctoral theses in order to chart how researchers’ approach to stance-taking changes at different proficiency levels in academic writing, showing that the observed increase in hedging devices in doctoral theses seems to be less a function of an increased proficiency level in academic writing as such and more the result of conceptual differences between undergraduate and postgraduate theses, only the latter of which are original research contributions with extensive discussion of the results.","PeriodicalId":371035,"journal":{"name":"Slovenščina 2.0: empirical, applied and interdisciplinary research","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116962347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Tri spletne aplikacije o slovenskih narečjih
Pub Date : 2021-07-06 DOI: 10.4312/SLO2.0.2021.1.236-261
Rok Mrvič, Špela Zupančič
Potreba po večji prisotnosti narečnih vsebin na spletu in njihovi interaktivni multimedijski predstavitvi, predvsem strokovno zasnovanih dialektoloških virov in orodij, je spodbudila interdisciplinarno sodelovanje različnih fakultet Univerze v Ljubljani, zlasti Filozofske fakultete (FF) in Fakultete za računalništvo in informatiko (FRI), ki je v letih 2017 in 2018 obrodilo sadove v obliki treh prostodostopnih in odprtokodnih spletnih aplikacij o slovenskih narečjih – to so Slovenski narečni atlas (SNA, 2017), Interaktivna karta slovenskih narečnih besedil (IKNB, 2018) in Slovar starega orodja v govoru Loškega Potoka (SSOLP, 2018). Članek v prvem delu prinaša splošen pregled slovenskih spletnih dialektoloških virov in orodij, v drugem delu pa podrobnejšo predstavitev funkcionalnosti navedenih treh aplikacij, ki so uporabnikom trenutno na voljo. V diskusijskem delu pregleda je izpostavljen del okoliščin nastanka obravnavanih aplikacij in z nastankom povezanih omejitev, nakazane pa so tudi možne rešitve, ki bi jih veljalo preudariti za zagotovitev njihovega dolgoročnega razvoja.
{"title":"Tri spletne aplikacije o slovenskih narečjih","authors":"Rok Mrvič, Špela Zupančič","doi":"10.4312/SLO2.0.2021.1.236-261","DOIUrl":"https://doi.org/10.4312/SLO2.0.2021.1.236-261","url":null,"abstract":"Potreba po večji prisotnosti narečnih vsebin na spletu in njihovi interaktivni multimedijski predstavitvi, predvsem strokovno zasnovanih dialektoloških virov in orodij, je spodbudila interdisciplinarno sodelovanje različnih fakultet Univerze v Ljubljani, zlasti Filozofske fakultete (FF) in Fakultete za računalništvo in informatiko (FRI), ki je v letih 2017 in 2018 obrodilo sadove v obliki treh prostodostopnih in odprtokodnih spletnih aplikacij o slovenskih narečjih – to so Slovenski narečni atlas (SNA, 2017), Interaktivna karta slovenskih narečnih besedil (IKNB, 2018) in Slovar starega orodja v govoru Loškega Potoka (SSOLP, 2018). Članek v prvem delu prinaša splošen pregled slovenskih spletnih dialektoloških virov in orodij, v drugem delu pa podrobnejšo predstavitev funkcionalnosti navedenih treh aplikacij, ki so uporabnikom trenutno na voljo. V diskusijskem delu pregleda je izpostavljen del okoliščin nastanka obravnavanih aplikacij in z nastankom povezanih omejitev, nakazane pa so tudi možne rešitve, ki bi jih veljalo preudariti za zagotovitev njihovega dolgoročnega razvoja.","PeriodicalId":371035,"journal":{"name":"Slovenščina 2.0: empirical, applied and interdisciplinary research","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123236694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Sign language lexicography: a case study of an online dictionary 手语词典编纂:一个在线词典的案例研究
Pub Date : 2021-07-06 DOI: 10.4312/SLO2.0.2021.1.90-122
Lucia Vlášková, Hana Strachoňová
As a growing field of study within sign language linguistics, sign language lexicography faces many challenges that have already been answered for audio-oral language material. In this paper, we present some of these challenges and methods developed to help navigate the complex lexical classification field. The described methods and strategies are implemented in the first Czech sign language (ČZJ) online dictionary, a part of the platform Dictio, developed at Masaryk University in Brno. We cover the topic of lemmatisation and how to decide what constitutes a lexeme in sign language. We introduce four types of expressions that qualify for a dictionary entry: a simple lexeme, a compound, a derivative, and a set phrase. We address the question of the place of classifier constructions and shape and size specifiers in a dictionary, given their peculiar semantic status. We maintain the standard classification of classifiers (whole entity and holding classifiers) and size and shape specifiers (SASSes; static and tracing specifiers). We provide arguments for separating the category of specifiers from the category of classifiers. We discuss the proper treatment of mouthings and mouth gestures concerning citation forms, derivation and translation. We show why it is difficult in sign language to distinguish synonyms from variants and how our proposed phonological criteria can help. We explain how to construct a semantic definition in a sign language and what is the solution for multiple meanings of one form. We offer simple guidelines for forming proper examples of use in a sign language. And finally, we briefly comment on the process of the translation between sign and spoken languages. We conclude the paper with a summary of roles that Dictio plays in the ČZJ-signing community.
作为手语语言学中一个新兴的研究领域,手语词典编纂面临着许多挑战,而这些挑战在视听语言材料中已经得到了解决。在本文中,我们提出了其中的一些挑战和开发的方法,以帮助导航复杂的词汇分类领域。所描述的方法和策略在第一个捷克手语在线词典(ČZJ)中实现,该词典是Dictio平台的一部分,由布尔诺的马萨里克大学开发。我们涵盖的主题词素化和如何决定什么构成一个词素在手语。我们将介绍四种符合字典条目条件的表达式:简单词素、复合词、派生词和固定短语。我们解决了分类器结构和形状和大小说明符在字典中的位置问题,因为它们具有特殊的语义状态。我们维护分类器(整个实体和持有分类器)和尺寸和形状说明器(sass;静态和跟踪说明符)。我们提供了将说明符类别与分类器类别分开的参数。我们从引证形式、引证来源和翻译三个方面讨论了嘴型和嘴型手势的正确处理。我们展示了为什么在手语中区分同义词和变体是困难的,以及我们提出的语音标准是如何帮助的。我们解释了如何在手语中构建语义定义,以及一种形式的多重含义的解决方案。我们提供了简单的指导方针,以形成正确的使用手语的例子。最后,简要评述了手语与口语翻译的过程。最后,我们总结了Dictio在ČZJ-signing社区中扮演的角色。
{"title":"Sign language lexicography: a case study of an online dictionary","authors":"Lucia Vlášková, Hana Strachoňová","doi":"10.4312/SLO2.0.2021.1.90-122","DOIUrl":"https://doi.org/10.4312/SLO2.0.2021.1.90-122","url":null,"abstract":"As a growing field of study within sign language linguistics, sign language lexicography faces many challenges that have already been answered for audio-oral language material. In this paper, we present some of these challenges and methods developed to help navigate the complex lexical classification field. The described methods and strategies are implemented in the first Czech sign language (ČZJ) online dictionary, a part of the platform Dictio, developed at Masaryk University in Brno. We cover the topic of lemmatisation and how to decide what constitutes a lexeme in sign language. We introduce four types of expressions that qualify for a dictionary entry: a simple lexeme, a compound, a derivative, and a set phrase. We address the question of the place of classifier constructions and shape and size specifiers in a dictionary, given their peculiar semantic status. We maintain the standard classification of classifiers (whole entity and holding classifiers) and size and shape specifiers (SASSes; static and tracing specifiers). We provide arguments for separating the category of specifiers from the category of classifiers. We discuss the proper treatment of mouthings and mouth gestures concerning citation forms, derivation and translation. We show why it is difficult in sign language to distinguish synonyms from variants and how our proposed phonological criteria can help. We explain how to construct a semantic definition in a sign language and what is the solution for multiple meanings of one form. We offer simple guidelines for forming proper examples of use in a sign language. And finally, we briefly comment on the process of the translation between sign and spoken languages. We conclude the paper with a summary of roles that Dictio plays in the ČZJ-signing community.","PeriodicalId":371035,"journal":{"name":"Slovenščina 2.0: empirical, applied and interdisciplinary research","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122419001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Slovenščina 2.0: Language Technologies and Digital Humanities Slovenščina 2.0:语言技术和数字人文
Pub Date : 2021-07-06 DOI: 10.4312/SLO2.0.2021.1.I-VI
Darja Fišer, Tomaž Erjavec, Ajda Pretnar
{"title":"Slovenščina 2.0: Language Technologies and Digital Humanities","authors":"Darja Fišer, Tomaž Erjavec, Ajda Pretnar","doi":"10.4312/SLO2.0.2021.1.I-VI","DOIUrl":"https://doi.org/10.4312/SLO2.0.2021.1.I-VI","url":null,"abstract":"","PeriodicalId":371035,"journal":{"name":"Slovenščina 2.0: empirical, applied and interdisciplinary research","volume":"285 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133003390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-lingual transfer of sentiment classifiers 情感分类器的跨语言迁移
Pub Date : 2021-07-06 DOI: 10.4312/SLO2.0.2021.1.1-25
M. Robnik-Sikonja, Kristjan Reba, I. Mozetič
Word embeddings represent words in a numeric space so that semantic relations between words are represented as distances and directions in the vector space. Cross-lingual word embeddings transform vector spaces of different languages so that similar words are aligned. This is done by mapping one language’s vector space to the vector space of another language or by construction of a joint vector space for multiple languages. Cross-lingual embeddings can be used to transfer machine learning models between languages, thereby compensating for insufficient data in less-resourced languages. We use cross-lingual word embeddings to transfer machine learning prediction models for Twitter sentiment between 13 languages. We focus on two transfer mechanisms that recently show superior transfer performance. The first mechanism uses the trained models whose input is the joint numerical space for many languages as implemented in the LASER library. The second mechanism uses large pretrained multilingual BERT language models. Our experiments show that the transfer of models between similar languages is sensible, even with no target language data. The performance of cross-lingual models obtained with the multilingual BERT and LASER library is comparable, and the differences are language-dependent. The transfer with CroSloEngual BERT, pretrained on only three languages, is superior on these and some closely related languages.
单词嵌入在数字空间中表示单词,以便单词之间的语义关系在向量空间中表示为距离和方向。跨语言词嵌入变换不同语言的向量空间,使相似的词对齐。这是通过将一种语言的向量空间映射到另一种语言的向量空间或通过构建多种语言的联合向量空间来实现的。跨语言嵌入可用于在语言之间传输机器学习模型,从而补偿资源较少的语言中的数据不足。我们使用跨语言词嵌入在13种语言之间转移Twitter情绪的机器学习预测模型。我们关注两种最近表现出优异转移性能的转移机制。第一种机制使用经过训练的模型,其输入是LASER库中实现的多种语言的联合数值空间。第二种机制使用大型预训练的多语言BERT语言模型。我们的实验表明,即使没有目标语言数据,相似语言之间的模型迁移也是合理的。使用多语言BERT和LASER库获得的跨语言模型的性能具有可比性,并且差异与语言有关。使用只对三种语言进行预训练的crosloenal BERT进行迁移,在这些语言和一些密切相关的语言上更胜一筹。
{"title":"Cross-lingual transfer of sentiment classifiers","authors":"M. Robnik-Sikonja, Kristjan Reba, I. Mozetič","doi":"10.4312/SLO2.0.2021.1.1-25","DOIUrl":"https://doi.org/10.4312/SLO2.0.2021.1.1-25","url":null,"abstract":"Word embeddings represent words in a numeric space so that semantic relations between words are represented as distances and directions in the vector space. Cross-lingual word embeddings transform vector spaces of different languages so that similar words are aligned. This is done by mapping one language’s vector space to the vector space of another language or by construction of a joint vector space for multiple languages. Cross-lingual embeddings can be used to transfer machine learning models between languages, thereby compensating for insufficient data in less-resourced languages. We use cross-lingual word embeddings to transfer machine learning prediction models for Twitter sentiment between 13 languages. We focus on two transfer mechanisms that recently show superior transfer performance. The first mechanism uses the trained models whose input is the joint numerical space for many languages as implemented in the LASER library. The second mechanism uses large pretrained multilingual BERT language models. Our experiments show that the transfer of models between similar languages is sensible, even with no target language data. The performance of cross-lingual models obtained with the multilingual BERT and LASER library is comparable, and the differences are language-dependent. The transfer with CroSloEngual BERT, pretrained on only three languages, is superior on these and some closely related languages.","PeriodicalId":371035,"journal":{"name":"Slovenščina 2.0: empirical, applied and interdisciplinary research","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127193710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Okrogla miza »(Bližnja) srečanja oblikovalcev jezikovne politike«
Pub Date : 2020-12-21 DOI: 10.4312/slo2.0.2020.1.92-112
I. Ferbežar, Igor Cetina, Alojz Ihan, Marko Stabej, Lana Zdravković, Tina Zupančič
V Ljubljani sta med 6. in 8. 11. 2019 potekala 54. srečanje in javni posvet ALTE (Association of Language Testers in Europe). Srečanje na temo Enojezično testiranje v večjezični realnosti: jezikovne ideologije in njihov vpliv na jezikovno testiranje sta organizirala Univerza v Ljubljani, Filozofska fakulteta in njen Center za slovenščino kot drugi in tuji jezik pri Oddelku za slovenistiko. V tem okviru je 8. 11. 2019 potekala okrogla miza (Bližnja) srečanja oblikovalcev jezikovne politike. Objavljamo zapis posnetka pogovora sodelujočih na dogodku.
{"title":"Okrogla miza »(Bližnja) srečanja oblikovalcev jezikovne politike«","authors":"I. Ferbežar, Igor Cetina, Alojz Ihan, Marko Stabej, Lana Zdravković, Tina Zupančič","doi":"10.4312/slo2.0.2020.1.92-112","DOIUrl":"https://doi.org/10.4312/slo2.0.2020.1.92-112","url":null,"abstract":"V Ljubljani sta med 6. in 8. 11. 2019 potekala 54. srečanje in javni posvet ALTE (Association of Language Testers in Europe). Srečanje na temo Enojezično testiranje v večjezični realnosti: jezikovne ideologije in njihov vpliv na jezikovno testiranje sta organizirala Univerza v Ljubljani, Filozofska fakulteta in njen Center za slovenščino kot drugi in tuji jezik pri Oddelku za slovenistiko. V tem okviru je 8. 11. 2019 potekala okrogla miza (Bližnja) srečanja oblikovalcev jezikovne politike. Objavljamo zapis posnetka pogovora sodelujočih na dogodku.","PeriodicalId":371035,"journal":{"name":"Slovenščina 2.0: empirical, applied and interdisciplinary research","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117049880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Slovenščina 2.0: empirical, applied and interdisciplinary research
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1