Seimo posėdžių stenogramų tekstynas autorystės nustatymo bei autoriaus profilio sudarymo tyrimams

Kalbotyra Pub Date : 2016-03-30 DOI:10.15388/KLBT.2014.7674

Jurgita Kapočiūtė-Dzikienė, Andrius Utka, Ligita Šarkutė

{"title":"Seimo posėdžių stenogramų tekstynas autorystės nustatymo bei autoriaus profilio sudarymo tyrimams","authors":"Jurgita Kapočiūtė-Dzikienė, Andrius Utka, Ligita Šarkutė","doi":"10.15388/KLBT.2014.7674","DOIUrl":null,"url":null,"abstract":"In our paper we present a corpus of transcribed Lithuanian parliamentary speeches. The corpus is prepared in a specific format, appropriate for different authorship identification tasks. The corpus consists of approximately 111 thousand texts (24 million words). Each text matches one parliamentary speech produced during an ordinary session from the period of 7 parliamentary terms starting on March 10, 1990 and ending on December 23, 2013. The texts are grouped into 147 categories corresponding to individual authors, therefore they can be used for authorship attribution tasks; besides, these texts are also grouped according to age, gender and political views, therefore they are also suitable for author profiling tasks. Whereas short texts complicate recognition of author speaking style and are ambiguous in relation to the style of other authors, we incorporated only texts containing not less than 100 words into the corpus. In order to make each category as comprehensive and representative as possible, we included only those authors, who produced speeches at least 200 times. All the texts are lemmatized, morphologically and syntactically annotated, tokenized into the character n-grams. The statistical information of the corpus is also available. We have also demonstrated that the created corpus can be effectively used in authorship attribution and author profiling tasks with supervised machine learning methods. The corpus structure also allows using it with unsupervised machine learning methods and can be used for creation of rule-based methods, as well as in different linguistic analyses.","PeriodicalId":30274,"journal":{"name":"Kalbotyra","volume":"66 1","pages":"27-45"},"PeriodicalIF":0.0000,"publicationDate":"2016-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Kalbotyra","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15388/KLBT.2014.7674","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

In our paper we present a corpus of transcribed Lithuanian parliamentary speeches. The corpus is prepared in a specific format, appropriate for different authorship identification tasks. The corpus consists of approximately 111 thousand texts (24 million words). Each text matches one parliamentary speech produced during an ordinary session from the period of 7 parliamentary terms starting on March 10, 1990 and ending on December 23, 2013. The texts are grouped into 147 categories corresponding to individual authors, therefore they can be used for authorship attribution tasks; besides, these texts are also grouped according to age, gender and political views, therefore they are also suitable for author profiling tasks. Whereas short texts complicate recognition of author speaking style and are ambiguous in relation to the style of other authors, we incorporated only texts containing not less than 100 words into the corpus. In order to make each category as comprehensive and representative as possible, we included only those authors, who produced speeches at least 200 times. All the texts are lemmatized, morphologically and syntactically annotated, tokenized into the character n-grams. The statistical information of the corpus is also available. We have also demonstrated that the created corpus can be effectively used in authorship attribution and author profiling tasks with supervised machine learning methods. The corpus structure also allows using it with unsupervised machine learning methods and can be used for creation of rule-based methods, as well as in different linguistic analyses.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于作者识别和作者简介研究的地震会话速记图

在我们的论文中，我们提出了立陶宛议会演讲的转录语料库。语料库以特定的格式准备，适用于不同的作者身份识别任务。该语料库包含约11.1万篇文本(2400万字)。从1990年3月10日开始至2013年12月23日结束的七届议会任期内，每一段文本都与议会常会上发表的一篇演讲相匹配。这些文本被分为147个类别，对应于每个作者，因此它们可以用于作者归属任务;此外，这些文本还根据年龄，性别和政治观点分组，因此它们也适合作者分析任务。鉴于短文本使作者说话风格的识别变得复杂，并且与其他作者的风格有歧义，我们仅将不少于100个单词的文本纳入语料库。为了使每个类别尽可能全面和具有代表性，我们只包括那些发表过至少200次演讲的作者。所有文本都被语法化，形态学和语法注释，标记为字符n-图。语料库的统计信息也可用。我们还证明了创建的语料库可以通过监督机器学习方法有效地用于作者归属和作者分析任务。语料库结构还允许将其与无监督机器学习方法一起使用，并可用于创建基于规则的方法，以及不同的语言分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊