首页 > 最新文献

Slovenscina 2.0最新文献

英文 中文
DirKorp DirKorp
Q2 Arts and Humanities Pub Date : 2023-09-12 DOI: 10.4312/slo2.0.2023.1.189-217
Petra Bago, Virna Karlić
In this paper, we present recent developments on a new version (v3.0) of DirKorp (Korpus direktivnih govornih činova hrvatskoga jezika), the first Croatian corpus of directive speech acts developed for the purposes of pragmatic research. The corpus contains 800 elicited speech acts collected via an online questionnaire with role-playing tasks, a method of simulated communication that is implemented under pre-set conditions. This method is suitable for researching speech acts due to the ability to collect a great number of examples of such acts of equal propositional content and illocutionary purpose used in the same controlled situations. The presented situations are classified into two categories with regard to the relationship between the participants of the communication act: (1) situations involving interlocutors who are not in a familiar relationship; (2) situations involving interlocutors in a familiar relationship. Assignments of the two categories are organized into four pairs, asking respondents to share a speech act of similar propositional content. The respondents were 100 Croatian speakers, all undergraduate (63%) or graduate students (37%) of the Faculty of Humanities and Social Sciences (University of Zagreb). The corpus has been manually annotated on the speech act level, each speech act containing up to 14 features: (1) respondent ID, (2) familiarity/unfamiliarity, (3) utterance type, (4) directive performative verb in 1st person, (5) illocutionary force, (6) propositional content, (7) T/V form, (8) exhortative, (9) lexical marker of request, (10) lexical marker of apology, (11) lexical marker of gratitude, (12) honorific title, (13) grammatical mood, and (14) modal verb in 2nd person. It contains 12,676 tokens and 1,692 types. The corpus is encoded according to the TEI P5: Guidelines for Electronic Text Encoding and Interchange, developed and maintained by the Text Encoding Initiative Consortium (TEI). DirKorp is available for download under the CC BY-SA 4.0 license from GitHub in TEI format. We describe applied pragmatic annotation as well as the structure of the corpus.
在本文中,我们介绍了DirKorp (Korpus direcktivnih govornih inova hrvatskoga jezika)新版本(v3.0)的最新进展,这是为语用研究目的开发的第一个克罗地亚指示性言语行为语料库。该语料库包含800个引出的言语行为,这些行为是通过一个带有角色扮演任务的在线问卷收集的,这是一种在预先设定的条件下实施的模拟交流方法。这种方法适合研究言语行为,因为它能够收集到大量在相同的控制情境中使用的具有相同命题内容和言外目的的言语行为的例子。根据交际行为参与者之间的关系,所呈现的情景分为两类:(1)不熟悉关系的对话者的情景;(2)对话者处于熟悉关系的情景。这两类任务被分成四对,要求被调查者分享一个命题内容相似的言语行为。受访者是100名说克罗地亚语的人,都是萨格勒布大学人文与社会科学学院的本科生(63%)或研究生(37%)。语料库在言语行为层面进行了人工标注,每个言语行为包含多达14个特征:(1)应答者身份,(2)熟悉/不熟悉,(3)话语类型,(4)第一人称指示行为动词,(5)言外之力,(6)命题内容,(7)T/V形式,(8)劝诫,(9)请求词汇标记,(10)道歉词汇标记,(11)感激词汇标记,(12)敬语标题,(13)语法语气,(14)第二人称情态动词。它包含12,676个令牌和1,692种类型。语料库按照TEI P5:电子文本编码和交换指南进行编码,该指南由文本编码倡议联盟(TEI)开发和维护。DirKorp可以在GitHub的CC BY-SA 4.0许可下以TEI格式下载。我们描述了应用的语用注释以及语料库的结构。
{"title":"DirKorp","authors":"Petra Bago, Virna Karlić","doi":"10.4312/slo2.0.2023.1.189-217","DOIUrl":"https://doi.org/10.4312/slo2.0.2023.1.189-217","url":null,"abstract":"In this paper, we present recent developments on a new version (v3.0) of DirKorp (Korpus direktivnih govornih činova hrvatskoga jezika), the first Croatian corpus of directive speech acts developed for the purposes of pragmatic research. The corpus contains 800 elicited speech acts collected via an online questionnaire with role-playing tasks, a method of simulated communication that is implemented under pre-set conditions. This method is suitable for researching speech acts due to the ability to collect a great number of examples of such acts of equal propositional content and illocutionary purpose used in the same controlled situations. The presented situations are classified into two categories with regard to the relationship between the participants of the communication act: (1) situations involving interlocutors who are not in a familiar relationship; (2) situations involving interlocutors in a familiar relationship. Assignments of the two categories are organized into four pairs, asking respondents to share a speech act of similar propositional content. The respondents were 100 Croatian speakers, all undergraduate (63%) or graduate students (37%) of the Faculty of Humanities and Social Sciences (University of Zagreb). The corpus has been manually annotated on the speech act level, each speech act containing up to 14 features: (1) respondent ID, (2) familiarity/unfamiliarity, (3) utterance type, (4) directive performative verb in 1st person, (5) illocutionary force, (6) propositional content, (7) T/V form, (8) exhortative, (9) lexical marker of request, (10) lexical marker of apology, (11) lexical marker of gratitude, (12) honorific title, (13) grammatical mood, and (14) modal verb in 2nd person. It contains 12,676 tokens and 1,692 types. The corpus is encoded according to the TEI P5: Guidelines for Electronic Text Encoding and Interchange, developed and maintained by the Text Encoding Initiative Consortium (TEI). DirKorp is available for download under the CC BY-SA 4.0 license from GitHub in TEI format. We describe applied pragmatic annotation as well as the structure of the corpus.","PeriodicalId":36888,"journal":{"name":"Slovenscina 2.0","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135826730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Negativno zaznamovano besedišče v Slovarju sopomenk sodobne slovenščine 2.0 现代斯洛文尼亚语同义词词典 2.0》中带有否定标记的词汇
Q2 Arts and Humanities Pub Date : 2023-09-12 DOI: 10.4312/slo2.0.2023.1.8-32
Špela Arhar Holdt, Iztok Kosem, Eva Pori, Vojko Gorjanc, Simon Krek, Polona Gantar
V prispevku predstavljamo rešitve za prepoznavanje in označevanje zaznamovanega besedišča v okviru koncepta odzivnega Slovarja sopomenk sodobne slovenščine. Ker gre za prvi tovrstni projekt, so pripravljene rešitve v veliki meri inovativne, umeščene pa v okvir problematike avtomatske strojne izdelave slovarja, njegove odprtosti in vključenosti uporabniške skupnosti. Prispevek prikazuje postopek prepoznavanja sovražnega in grobega besedišča ter pripis oznak, opozorilnih ikon in daljših pojasnil. Ukvarjamo se tako s tehničnimi kot vsebinskimi vprašanji označevanja. Vsebinsko oznake temeljijo na sporočanjskem namenu in učinku, pri čemer je njihovo bistvo informacija o možnih posledicah rabe, pri tehničnih rešitvah pa veliko pozornost posvečamo digitalnemu mediju in vizualizaciji rešitev v njem. Ker je odzivnost eden ključnih konceptov slovarja, se pri rešitvah glede označevanja zavedamo pomembnosti sodelovanja z uporabniško skupnostjo, zato tudi pri dodajanju oznak predlagamo rešitve za sodelovanje s skupnostjo. Izhodiščni konferenčni prispevek je bil razširjen v vseh poglavjih, dodano pa je povsem novo poglavje o obdelavi večpomenskih iztočnic, njihovi pomenski členitvi in pomenskem opisovanju z zgledi pomenov z negativno zaznamovanostjo.
在本文中,我们提出了在现代斯洛文尼亚语同义词词典的概念范围内识别和标记标记词汇的解决方案。由于这是首个同类项目,因此这些解决方案在很大程度上具有创新性,并结合了词典的机器自动生成、开放性和用户社区参与等问题。本文阐述了识别敌意和辱骂词汇的过程,以及分配标签、警告图标和更长解释的过程。我们讨论了标记的技术和语境问题。就内容而言,标签以交流意图和效果为基础,其本质是关于使用可能产生的后果的信息,而就技术解决方案而言,我们非常重视数字媒介和解决方案在其中的可视化。由于响应性是词典的关键概念之一,我们意识到在标签解决方案中与用户社区互动的重要性,我们还提出了在添加标签时与社区互动的解决方案。最初的会议论文在所有章节中都进行了扩充,并增加了一个全新的章节,内容涉及多义线索的处理、语义衔接和语义注释,并举例说明了负面标注的含义。
{"title":"Negativno zaznamovano besedišče v Slovarju sopomenk sodobne slovenščine 2.0","authors":"Špela Arhar Holdt, Iztok Kosem, Eva Pori, Vojko Gorjanc, Simon Krek, Polona Gantar","doi":"10.4312/slo2.0.2023.1.8-32","DOIUrl":"https://doi.org/10.4312/slo2.0.2023.1.8-32","url":null,"abstract":"V prispevku predstavljamo rešitve za prepoznavanje in označevanje zaznamovanega besedišča v okviru koncepta odzivnega Slovarja sopomenk sodobne slovenščine. Ker gre za prvi tovrstni projekt, so pripravljene rešitve v veliki meri inovativne, umeščene pa v okvir problematike avtomatske strojne izdelave slovarja, njegove odprtosti in vključenosti uporabniške skupnosti. Prispevek prikazuje postopek prepoznavanja sovražnega in grobega besedišča ter pripis oznak, opozorilnih ikon in daljših pojasnil. Ukvarjamo se tako s tehničnimi kot vsebinskimi vprašanji označevanja. Vsebinsko oznake temeljijo na sporočanjskem namenu in učinku, pri čemer je njihovo bistvo informacija o možnih posledicah rabe, pri tehničnih rešitvah pa veliko pozornost posvečamo digitalnemu mediju in vizualizaciji rešitev v njem. Ker je odzivnost eden ključnih konceptov slovarja, se pri rešitvah glede označevanja zavedamo pomembnosti sodelovanja z uporabniško skupnostjo, zato tudi pri dodajanju oznak predlagamo rešitve za sodelovanje s skupnostjo. Izhodiščni konferenčni prispevek je bil razširjen v vseh poglavjih, dodano pa je povsem novo poglavje o obdelavi večpomenskih iztočnic, njihovi pomenski členitvi in pomenskem opisovanju z zgledi pomenov z negativno zaznamovanostjo.","PeriodicalId":36888,"journal":{"name":"Slovenscina 2.0","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135826749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Identifikacija metafore in metonimije v jezikovnih korpusih 识别语言库中的隐喻和转喻
Q2 Arts and Humanities Pub Date : 2023-09-12 DOI: 10.4312/slo2.0.2023.1.91-117
Špela Antloga
Z jezikom nismo vedno zmožni neposredno ubesediti vsega, kar mislimo, zato za razlago pojavnosti uporabljamo različne jezikovno-kognitivne postopke, med drugim metafore in metonimije. Prepoznavanje vrednosti in razširjenosti metaforičnih in metonimičnih izrazov v jeziku je v zadnjih dvajsetih letih vodilo k povečanemu zanimanju za sistematično identifikacijo in luščenje tovrstnih figurativnih izrazov v korpusih posameznih jezikov. Izraze, pri katerih potekajo konceptualne preslikave, ki sodelujejo pri metaforičnih in metonimičnih procesih, je namreč težko izluščiti iz korpusa, ki niso posebej označeni za namene raziskovanja figurativnega jezika. V članku opredelim razumevanje konceptualne metafore in konceptualne metonimije, predstavim najpogostejše metode luščenja metaforičnih in metonimičnih izrazov iz jezikovnih korpusov ter na primeru korpusa g-KOMET, ki je ročno označen za metaforične izraze in metonimične prenose, ponazarjam poskus sistematizacije nekaterih najbolj prisotnih metonimičnih prenosov v slovenskem govorjenem jeziku.
我们并非总能直接用语言表达我们的一切想法,因此我们使用各种语言认知程序,包括隐喻和转喻,来解释各种现象。由于认识到隐喻和转喻在语言中的价值和普遍性,近二十年来,人们越来越关注在各种语言的语料库中系统地识别和提取这种形象化的表达方式。事实上,要从没有专门标记为形象语言研究目的的语料库中提取隐喻和转喻过程中涉及概念映射的术语是很困难的。在本文中,我定义了对概念隐喻和概念转喻的理解,介绍了从语言语料库中提取隐喻和转喻表达的最常用方法,并以 g-KOMET 语料库(一个人工标记了隐喻表达和转喻的语料库)为例,说明了将斯洛文尼亚口语中一些最常见的转喻系统化的尝试。
{"title":"Identifikacija metafore in metonimije v jezikovnih korpusih","authors":"Špela Antloga","doi":"10.4312/slo2.0.2023.1.91-117","DOIUrl":"https://doi.org/10.4312/slo2.0.2023.1.91-117","url":null,"abstract":"Z jezikom nismo vedno zmožni neposredno ubesediti vsega, kar mislimo, zato za razlago pojavnosti uporabljamo različne jezikovno-kognitivne postopke, med drugim metafore in metonimije. Prepoznavanje vrednosti in razširjenosti metaforičnih in metonimičnih izrazov v jeziku je v zadnjih dvajsetih letih vodilo k povečanemu zanimanju za sistematično identifikacijo in luščenje tovrstnih figurativnih izrazov v korpusih posameznih jezikov. Izraze, pri katerih potekajo konceptualne preslikave, ki sodelujejo pri metaforičnih in metonimičnih procesih, je namreč težko izluščiti iz korpusa, ki niso posebej označeni za namene raziskovanja figurativnega jezika. V članku opredelim razumevanje konceptualne metafore in konceptualne metonimije, predstavim najpogostejše metode luščenja metaforičnih in metonimičnih izrazov iz jezikovnih korpusov ter na primeru korpusa g-KOMET, ki je ročno označen za metaforične izraze in metonimične prenose, ponazarjam poskus sistematizacije nekaterih najbolj prisotnih metonimičnih prenosov v slovenskem govorjenem jeziku.","PeriodicalId":36888,"journal":{"name":"Slovenscina 2.0","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135878149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Uvodnik v tematsko številko o Digitalnem jezikoslovju 数字语言学》专题中的社论
Q2 Arts and Humanities Pub Date : 2023-09-12 DOI: 10.4312/slo2.0.2023.1.1-6
Darja Fišer, Tomaž Erjavec
Pričujoča tematska številka revije Slovenščina 2.0 se posveča digitalnemu jezikoslovju, hitro rastočemu interdisciplinarnemu področju raziskav na stičišču tradicionalnega jezikoslovja, informacijskih tehnologij in družboslovnih ved. V ospredju digitalnojezikoslovnih raziskav je ohranjanje, analiza in uporaba jezikovnih podatkov, digitalnih artefaktov z jezikom kot nosilcem medčloveškega sporazumevanja. Digitalno jezikoslovje tako pri nas kot po svetu postaja vse pomembnejše ne samo v akademskih in izobraževalnih krogih, temveč tudi v javnem in zasebnem sektorju, ki za uspešno delovanje v sodobni družbi in gospodarstvu vse bolj potrebujeta strokovnjake, vešče upravljanja z digitalnimi jezikovnimi podatki.
本期《斯洛文尼亚语言学》(Slovene Linguistics 2.0)专题聚焦数字语言学,这是一个在传统语言学、信息技术和社会科学交界处迅速发展的跨学科研究领域。数字语言学研究的重点是语言数据的保存、分析和使用,以及以语言作为人际交流媒介的数字人工制品。无论是在国内还是在世界范围内,数字语言学都变得越来越重要,不仅在学术界和教育界如此,在公共和私营部门也是如此,这些部门越来越需要精通数字语言数据管理的专业人员,以便在现代社会和经济中成功运作。
{"title":"Uvodnik v tematsko številko o Digitalnem jezikoslovju","authors":"Darja Fišer, Tomaž Erjavec","doi":"10.4312/slo2.0.2023.1.1-6","DOIUrl":"https://doi.org/10.4312/slo2.0.2023.1.1-6","url":null,"abstract":"Pričujoča tematska številka revije Slovenščina 2.0 se posveča digitalnemu jezikoslovju, hitro rastočemu interdisciplinarnemu področju raziskav na stičišču tradicionalnega jezikoslovja, informacijskih tehnologij in družboslovnih ved. V ospredju digitalnojezikoslovnih raziskav je ohranjanje, analiza in uporaba jezikovnih podatkov, digitalnih artefaktov z jezikom kot nosilcem medčloveškega sporazumevanja. Digitalno jezikoslovje tako pri nas kot po svetu postaja vse pomembnejše ne samo v akademskih in izobraževalnih krogih, temveč tudi v javnem in zasebnem sektorju, ki za uspešno delovanje v sodobni družbi in gospodarstvu vse bolj potrebujeta strokovnjake, vešče upravljanja z digitalnimi jezikovnimi podatki.","PeriodicalId":36888,"journal":{"name":"Slovenscina 2.0","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135878330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Universal Dependencies za slovenščino 英语的通用依存关系
Q2 Arts and Humanities Pub Date : 2023-09-12 DOI: 10.4312/slo2.0.2023.1.218-246
Kaja Dobrovoljc, Luka Terčon, Nikola Ljubešić
Universal Dependencies (UD) je mednarodno usklajena označevalna shema za medjezikovno primerljivo oblikoslovno in skladenjsko označevanje besedil po načelih odvisnostne slovnice, ki je bila ob več kot 130 drugih svetovnih jezikih uspešno uporabljena tudi za označevanje besedil v slovenščini. V prispevku predstavimo rezultate nedavnih aktivnosti v povezavi s shemo UD znotraj projekta Razvoj slovenščine v digitalnem okolju, v okviru katerega smo obstoječo infrastrukturo nadgradili s prenovo in podrobno dokumentacijo označevalnih smernic UD za slovenščino, razširitvijo drevesnice SSJ-UD za pisno slovenščino z novimi povedmi iz korpusov ssj500k in ELEXIS-WSD, izdelavo testne množice iz besedil korpusa SentiCoref za spletni portal SloBENCH ter polavtomatsko pretvorbo oblikoslovnih oznak referenčnih učnih korpusov SUK in Janes-Tag. Na razširjeni drevesnici SSJ-UD je bil naučen tudi novi napovedni model za skladenjsko razčlenjevanje v orodju CLASSLA-Stanza, ki ga v prispevku v podporo nadaljnjim jezikoslovnim aplikacijam podrobneje ovrednotimo z vidika splošne natančnosti razčlenjevanja in najpogostejših tipov napak.
通用依存(UD)是一种国际统一的标记方案,用于根据依存语法原则对文本的形式和语法进行跨语言可比标记。在本文中,我们介绍了最近在数字环境中的斯洛文尼亚语发展项目中与 UD 计划相关的活动成果,在这些活动中,我们对现有的基础设施进行了升级,对斯洛文尼亚语的 UD 标记指南进行了修改和详细记录、用来自 ssj500k 和 ELEXIS-WSD 语料库的新句子扩展了用于斯洛文尼亚语书面语的 SSJ-UD 树,为 SloBENCH 门户网站创建了来自 SentiCoref 语料库文本的测试集,并半自动转换了 SUK 和 Janes-Tag 参考学习语料库的格式标签。此外,还在扩展的 SSJ-UD 树库上学习了 CLASSLA-Stanza 工具中用于句法分析的新预测模型,本文将从整体分析准确性和最常见的错误类型方面对该模型进行更详细的评估,以支持进一步的语言应用。
{"title":"Universal Dependencies za slovenščino","authors":"Kaja Dobrovoljc, Luka Terčon, Nikola Ljubešić","doi":"10.4312/slo2.0.2023.1.218-246","DOIUrl":"https://doi.org/10.4312/slo2.0.2023.1.218-246","url":null,"abstract":"Universal Dependencies (UD) je mednarodno usklajena označevalna shema za medjezikovno primerljivo oblikoslovno in skladenjsko označevanje besedil po načelih odvisnostne slovnice, ki je bila ob več kot 130 drugih svetovnih jezikih uspešno uporabljena tudi za označevanje besedil v slovenščini. V prispevku predstavimo rezultate nedavnih aktivnosti v povezavi s shemo UD znotraj projekta Razvoj slovenščine v digitalnem okolju, v okviru katerega smo obstoječo infrastrukturo nadgradili s prenovo in podrobno dokumentacijo označevalnih smernic UD za slovenščino, razširitvijo drevesnice SSJ-UD za pisno slovenščino z novimi povedmi iz korpusov ssj500k in ELEXIS-WSD, izdelavo testne množice iz besedil korpusa SentiCoref za spletni portal SloBENCH ter polavtomatsko pretvorbo oblikoslovnih oznak referenčnih učnih korpusov SUK in Janes-Tag. Na razširjeni drevesnici SSJ-UD je bil naučen tudi novi napovedni model za skladenjsko razčlenjevanje v orodju CLASSLA-Stanza, ki ga v prispevku v podporo nadaljnjim jezikoslovnim aplikacijam podrobneje ovrednotimo z vidika splošne natančnosti razčlenjevanja in najpogostejših tipov napak.","PeriodicalId":36888,"journal":{"name":"Slovenscina 2.0","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135826547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adapting an English Corpus and a Question Answering System for Slovene 面向斯洛文尼亚语的英语语料库与问答系统
Q2 Arts and Humanities Pub Date : 2023-09-12 DOI: 10.4312/slo2.0.2023.1.247-274
Uroš Šmajdek, Matjaž Zupanič, Maj Zirkelbach, Meta Jazbinšek
Developing effective question answering (QA) models for less-resourced languages like Slovene is challenging due to the lack of proper training data. Modern machine translation tools can address this issue, but this presents another challenge: the given answers must be found in their exact form within the given context since the model is trained to locate answers and not generate them. To address this challenge, we propose a method that embeds the answers within the context before translation and evaluate its effectiveness on the SQuAD 2.0 dataset translated using both eTranslation and Google Cloud translator. The results show that by employing our method we can reduce the rate at which answers were not found in the context from 56% to 7%. We then assess the translated datasets using various transformer-based QA models, examining the differences between the datasets and model configurations. To ensure that our models produce realistic results, we test them on a small subset of the original data that was human-translated. The results indicate that the primary advantages of using machine-translated data lie in refining smaller multilingual and monolingual models. For instance, the multilingual CroSloEngual BERT model fine-tuned and tested on Slovene data achieved nearly equivalent performance to one fine-tuned and tested on English data, with 70.2% and 73.3% questions answered, respectively. While larger models, such as RemBERT, achieved comparable results, correctly answering questions in 77.9% of cases when fine-tuned and tested on Slovene compared to 81.1% on English, fine-tuning with English and testing with Slovene data also yielded similar performance.
由于缺乏适当的训练数据,为斯洛文尼亚语等资源较少的语言开发有效的问答(QA)模型具有挑战性。现代机器翻译工具可以解决这个问题,但这带来了另一个挑战:给定的答案必须在给定的上下文中以精确的形式找到,因为模型被训练为定位答案而不是生成答案。为了解决这一挑战,我们提出了一种方法,在翻译之前将答案嵌入到上下文中,并评估其在使用eTranslation和Google Cloud翻译器翻译的SQuAD 2.0数据集上的有效性。结果表明,采用该方法可以将上下文中未找到答案的比率从56%降低到7%。然后,我们使用各种基于转换器的QA模型评估翻译后的数据集,检查数据集和模型配置之间的差异。为了确保我们的模型产生真实的结果,我们在人工翻译的原始数据的一小部分上测试它们。结果表明,使用机器翻译数据的主要优势在于提炼更小的多语言和单语言模型。例如,对斯洛文尼亚语数据进行微调和测试的多语言CroSloEngual BERT模型取得了与对英语数据进行微调和测试的模型几乎相同的性能,分别回答了70.2%和73.3%的问题。虽然较大的模型,如RemBERT,取得了类似的结果,在斯洛文尼亚语的微调和测试中,77.9%的情况下正确回答了问题,而在英语的微调和斯洛文尼亚数据的测试中,这一比例为81.1%,英语微调和斯洛文尼亚数据的测试也产生了类似的表现。
{"title":"Adapting an English Corpus and a Question Answering System for Slovene","authors":"Uroš Šmajdek, Matjaž Zupanič, Maj Zirkelbach, Meta Jazbinšek","doi":"10.4312/slo2.0.2023.1.247-274","DOIUrl":"https://doi.org/10.4312/slo2.0.2023.1.247-274","url":null,"abstract":"Developing effective question answering (QA) models for less-resourced languages like Slovene is challenging due to the lack of proper training data. Modern machine translation tools can address this issue, but this presents another challenge: the given answers must be found in their exact form within the given context since the model is trained to locate answers and not generate them. To address this challenge, we propose a method that embeds the answers within the context before translation and evaluate its effectiveness on the SQuAD 2.0 dataset translated using both eTranslation and Google Cloud translator. The results show that by employing our method we can reduce the rate at which answers were not found in the context from 56% to 7%. We then assess the translated datasets using various transformer-based QA models, examining the differences between the datasets and model configurations. To ensure that our models produce realistic results, we test them on a small subset of the original data that was human-translated. The results indicate that the primary advantages of using machine-translated data lie in refining smaller multilingual and monolingual models. For instance, the multilingual CroSloEngual BERT model fine-tuned and tested on Slovene data achieved nearly equivalent performance to one fine-tuned and tested on English data, with 70.2% and 73.3% questions answered, respectively. While larger models, such as RemBERT, achieved comparable results, correctly answering questions in 77.9% of cases when fine-tuned and tested on Slovene compared to 81.1% on English, fine-tuning with English and testing with Slovene data also yielded similar performance.","PeriodicalId":36888,"journal":{"name":"Slovenscina 2.0","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135878314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Grammatical and Pragmatic Aspects of Slovenian Modality in Socially Unacceptable Facebook Comments 社交上不可接受的Facebook评论中斯洛文尼亚语态的语法和语用方面
Q2 Arts and Humanities Pub Date : 2023-09-12 DOI: 10.4312/slo2.0.2023.1.33-68
Jakob Lenardič, Kristina Pahor de Maiti
This paper investigates the grammatical and pragmatic uses of epistemic and deontic modal expressions in a corpus of Slovenian socially acceptable and unacceptable Facebook comments. We propose a set of modals that do not interpretatively vary in their modality type in order to enable robust corpus searches and reliable quantification of the results. We show that deontic, but not epistemic, modals are significantly more frequent in socially unacceptable comments, and specifically that they favour violent discourse. We complement the quantitative findings with a qualitative analysis of the discursive roles played by the modals. We explore how pragmatic communicative strategies such as hedging, boosting, and face-saving arise from the underlying syntactic and semantic properties of the modal expressions, such as the modal force and clausal syntax.
本文研究了斯洛文尼亚社会可接受和不可接受的Facebook评论语料库中认知和道义情态表达的语法和语用。我们提出了一组模态,它们的模态类型不会发生解释性变化,以便实现健壮的语料库搜索和可靠的结果量化。我们表明道义情态,而不是认知情态,在社会不可接受的评论中明显更频繁,特别是它们倾向于暴力话语。我们通过对情态动词所扮演的话语角色的定性分析来补充定量研究结果。我们探讨了语用交际策略,如模棱两可、促进和保全面子,是如何从情态表达的潜在句法和语义特性中产生的,如情态力量和小句语法。
{"title":"Grammatical and Pragmatic Aspects of Slovenian Modality in Socially Unacceptable Facebook Comments","authors":"Jakob Lenardič, Kristina Pahor de Maiti","doi":"10.4312/slo2.0.2023.1.33-68","DOIUrl":"https://doi.org/10.4312/slo2.0.2023.1.33-68","url":null,"abstract":"This paper investigates the grammatical and pragmatic uses of epistemic and deontic modal expressions in a corpus of Slovenian socially acceptable and unacceptable Facebook comments. We propose a set of modals that do not interpretatively vary in their modality type in order to enable robust corpus searches and reliable quantification of the results. We show that deontic, but not epistemic, modals are significantly more frequent in socially unacceptable comments, and specifically that they favour violent discourse. We complement the quantitative findings with a qualitative analysis of the discursive roles played by the modals. We explore how pragmatic communicative strategies such as hedging, boosting, and face-saving arise from the underlying syntactic and semantic properties of the modal expressions, such as the modal force and clausal syntax.","PeriodicalId":36888,"journal":{"name":"Slovenscina 2.0","volume":"133 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135826554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Govoriš nevronsko? 你会说神经元语吗?
Q2 Arts and Humanities Pub Date : 2023-09-12 DOI: 10.4312/slo2.0.2023.1.138-159
David Bordon
Namen prispevka je predstaviti raziskavo preverjanja razumljivosti nerevidiranih strojno prevedenih spletnih besedil. Primarni udeleženci v raziskavi so bili splošni bralci in ne izurjeni prevajalci ali popravljalci strojnih prevodov. Gre za prvo tovrstno raziskavo, ki je bila izvedena za slovenski jezik. Cilj raziskave je bil preveriti, v kolikšni meri so nerevidirani strojni prevodi razumljivi splošnemu bralstvu, pri čemer sem se posvetil tudi vplivu besedilnega in slikovnega konteksta. Preverjal sem prevode prevajalnikov Google Translate in eTranslation. Raziskava je bila izvedena z anketo, v kateri so udeleženci odgovarjali na vprašanja, ki so preverjala razumevanje spremljajočega besedilnega segmenta, v katerem je bila napaka. Rezultati nudijo vpogled v trenutno stopnjo razvoja strojnih prevajalnikov, ne z vidika storilnosti pri njihovem popravljanju, ampak z vidika, koliko jih razume ciljno bralstvo. Na koncu članka nudim novo evalvacijo izvornih segmentov, ki sem jih v začetku leta 2023 ponovno prevedel, tokrat še s prevajalnikom DeepL.
本文旨在介绍一项关于未经编辑的机器翻译网络文本可理解性的研究。这项研究的主要参与者是普通读者,而非训练有素的机器译员或校对人员,这也是首次针对斯洛文尼亚语开展的同类研究。这项研究的目的是考察未经编辑的机器翻译在多大程度上可以为普通读者所理解,我还考察了文本和图片语境的影响。我对谷歌翻译和 eTranslation 的译文进行了检查,并通过调查的方式开展了研究,参与者在回答问题时要测试他们对发生错误的随文段落的理解程度。研究结果让我们了解了机器翻译的发展现状,这并不是指机器翻译的纠错性能,而是指目标读者对机器翻译的理解程度。在本文末尾,我对 2023 年初再次使用 DeepL 编译器翻译的源语段进行了新的评估。
{"title":"Govoriš nevronsko?","authors":"David Bordon","doi":"10.4312/slo2.0.2023.1.138-159","DOIUrl":"https://doi.org/10.4312/slo2.0.2023.1.138-159","url":null,"abstract":"Namen prispevka je predstaviti raziskavo preverjanja razumljivosti nerevidiranih strojno prevedenih spletnih besedil. Primarni udeleženci v raziskavi so bili splošni bralci in ne izurjeni prevajalci ali popravljalci strojnih prevodov. Gre za prvo tovrstno raziskavo, ki je bila izvedena za slovenski jezik. Cilj raziskave je bil preveriti, v kolikšni meri so nerevidirani strojni prevodi razumljivi splošnemu bralstvu, pri čemer sem se posvetil tudi vplivu besedilnega in slikovnega konteksta. Preverjal sem prevode prevajalnikov Google Translate in eTranslation. Raziskava je bila izvedena z anketo, v kateri so udeleženci odgovarjali na vprašanja, ki so preverjala razumevanje spremljajočega besedilnega segmenta, v katerem je bila napaka. Rezultati nudijo vpogled v trenutno stopnjo razvoja strojnih prevajalnikov, ne z vidika storilnosti pri njihovem popravljanju, ampak z vidika, koliko jih razume ciljno bralstvo. Na koncu članka nudim novo evalvacijo izvornih segmentov, ki sem jih v začetku leta 2023 ponovno prevedel, tokrat še s prevajalnikom DeepL.","PeriodicalId":36888,"journal":{"name":"Slovenscina 2.0","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135826728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Named Entities in Modernist Literary Texts 现代主义文学文本中的命名实体
Q2 Arts and Humanities Pub Date : 2023-09-12 DOI: 10.4312/slo2.0.2023.1.118-137
Andrejka Žejn, Mojca Šorli
This paper is a follow-up and elaboration of the paper published in the JTDH 2022 Conference Proceedings on manual semantic annotation of named entities based on a proposed set of annotations for a corpus of modernist literary texts. We first briefly describe the corpus and introduce the annotation scheme, then focus on the results of additional analyses, and conclude with further challenges and issues we identified with respect to established NER systems and practices of related projects. Overall, we identify several categories of proper names, foreign language elements, and bibliographic citations, but focus here on the challenges of annotating names of literary characters and place names, and provide examples of the results of preliminary analyses of these entities in the corpus.
本文是在JTDH 2022会议论文集上发表的关于基于现代主义文学文本语料库的一组拟议注释的命名实体的手动语义注释的论文的后续和详细阐述。我们首先简要描述了语料库并介绍了注释方案,然后重点介绍了其他分析的结果,并总结了我们在已建立的NER系统和相关项目的实践中发现的进一步挑战和问题。总的来说,我们确定了专有名称、外语元素和书目引文的几类,但这里重点关注文学人物和地名的注释挑战,并提供了语料库中这些实体的初步分析结果的示例。
{"title":"Named Entities in Modernist Literary Texts","authors":"Andrejka Žejn, Mojca Šorli","doi":"10.4312/slo2.0.2023.1.118-137","DOIUrl":"https://doi.org/10.4312/slo2.0.2023.1.118-137","url":null,"abstract":"This paper is a follow-up and elaboration of the paper published in the JTDH 2022 Conference Proceedings on manual semantic annotation of named entities based on a proposed set of annotations for a corpus of modernist literary texts. We first briefly describe the corpus and introduce the annotation scheme, then focus on the results of additional analyses, and conclude with further challenges and issues we identified with respect to established NER systems and practices of related projects. Overall, we identify several categories of proper names, foreign language elements, and bibliographic citations, but focus here on the challenges of annotating names of literary characters and place names, and provide examples of the results of preliminary analyses of these entities in the corpus.","PeriodicalId":36888,"journal":{"name":"Slovenscina 2.0","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135826731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Referencing the Public by Populist and Non-populist Parties in the Slovene Parliament 斯洛文尼亚议会中民粹主义和非民粹主义政党对公众的引用
Q2 Arts and Humanities Pub Date : 2023-09-12 DOI: 10.4312/slo2.0.2023.1.69-90
Darja Fišer, Tjaša Konovšek, Andrej Pančur
The present moment raises many questions about the workings and resilience of parliamentary democracy in Western-type democracies, including the former socialist states of the East Central European region, where various forms of populism and illiberal democracy are taking shape. Among these, Slovenia is taken as a case study, since it is not only a former socialist state, but was also for a long time acknowledged as a post-socialist success story. Focusing on the central state institution in systems of parliamentary democracy, i.e. the parliament, and its members (MPs) this paper considers speech as performed during parliamentary sessions by MPs from populist and non-populist political parties between the years 1992 and 2018, the period of a fully democratic Slovene national parliament. It combines the methodological approaches of cultural history with corpus linguistics in order to map any possible differences in populist and non-populist discourse of MPs. Special attention is given to situations where MPs mentioned the public, thus testing the hypothesis that populist MPs engage more with the public as a part of their populist political style.
目前的情况提出了许多关于西方式民主国家议会民主的运作和恢复能力的问题,包括中东欧地区的前社会主义国家,在那里各种形式的民粹主义和非自由民主正在形成。在这些国家中,斯洛文尼亚被作为一个案例来研究,因为它不仅是一个前社会主义国家,而且在很长一段时间内也被认为是一个后社会主义的成功故事。本文关注议会民主制度中的中央国家机构,即议会及其成员(MPs),研究1992年至2018年(斯洛文尼亚完全民主的国民议会时期)民粹主义和非民粹主义政党的议员在议会会议期间的讲话。它将文化史的方法论方法与语料库语言学相结合,以便绘制国会议员的民粹主义和非民粹主义话语中的任何可能的差异。特别关注议员提到公众的情况,从而验证民粹主义议员更多地与公众接触作为其民粹主义政治风格的一部分的假设。
{"title":"Referencing the Public by Populist and Non-populist Parties in the Slovene Parliament","authors":"Darja Fišer, Tjaša Konovšek, Andrej Pančur","doi":"10.4312/slo2.0.2023.1.69-90","DOIUrl":"https://doi.org/10.4312/slo2.0.2023.1.69-90","url":null,"abstract":"The present moment raises many questions about the workings and resilience of parliamentary democracy in Western-type democracies, including the former socialist states of the East Central European region, where various forms of populism and illiberal democracy are taking shape. Among these, Slovenia is taken as a case study, since it is not only a former socialist state, but was also for a long time acknowledged as a post-socialist success story. Focusing on the central state institution in systems of parliamentary democracy, i.e. the parliament, and its members (MPs) this paper considers speech as performed during parliamentary sessions by MPs from populist and non-populist political parties between the years 1992 and 2018, the period of a fully democratic Slovene national parliament. It combines the methodological approaches of cultural history with corpus linguistics in order to map any possible differences in populist and non-populist discourse of MPs. Special attention is given to situations where MPs mentioned the public, thus testing the hypothesis that populist MPs engage more with the public as a part of their populist political style.","PeriodicalId":36888,"journal":{"name":"Slovenscina 2.0","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135826895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Slovenscina 2.0
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1