Seimo posėdžių stenogramų tekstynas autorystės nustatymo bei autoriaus profilio sudarymo tyrimams

Kalbotyra Pub Date : 2016-03-30 DOI:10.15388/KLBT.2014.7674
Jurgita Kapočiūtė-Dzikienė, Andrius Utka, Ligita Šarkutė
{"title":"Seimo posėdžių stenogramų tekstynas autorystės nustatymo bei autoriaus profilio sudarymo tyrimams","authors":"Jurgita Kapočiūtė-Dzikienė, Andrius Utka, Ligita Šarkutė","doi":"10.15388/KLBT.2014.7674","DOIUrl":null,"url":null,"abstract":"In our paper we present a corpus of transcribed Lithuanian parliamentary speeches. The corpus is prepared in a specific format, appropriate for different authorship identification tasks. The corpus consists of approximately 111 thousand texts (24 million words). Each text matches one parliamentary speech produced during an ordinary session from the period of 7 parliamentary terms starting on March 10, 1990 and ending on December 23, 2013. The texts are grouped into 147 categories corresponding to individual authors, therefore they can be used for authorship attribution tasks; besides, these texts are also grouped according to age, gender and political views, therefore they are also suitable for author profiling tasks. Whereas short texts complicate recognition of author speaking style and are ambiguous in relation to the style of other authors, we incorporated only texts containing not less than 100 words into the corpus. In order to make each category as comprehensive and representative as possible, we included only those authors, who produced speeches at least 200 times. All the texts are lemmatized, morphologically and syntactically annotated, tokenized into the character n-grams. The statistical information of the corpus is also available. We have also demonstrated that the created corpus can be effectively used in authorship attribution and author profiling tasks with supervised machine learning methods. The corpus structure also allows using it with unsupervised machine learning methods and can be used for creation of rule-based methods, as well as in different linguistic analyses.","PeriodicalId":30274,"journal":{"name":"Kalbotyra","volume":"66 1","pages":"27-45"},"PeriodicalIF":0.0000,"publicationDate":"2016-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Kalbotyra","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15388/KLBT.2014.7674","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

In our paper we present a corpus of transcribed Lithuanian parliamentary speeches. The corpus is prepared in a specific format, appropriate for different authorship identification tasks. The corpus consists of approximately 111 thousand texts (24 million words). Each text matches one parliamentary speech produced during an ordinary session from the period of 7 parliamentary terms starting on March 10, 1990 and ending on December 23, 2013. The texts are grouped into 147 categories corresponding to individual authors, therefore they can be used for authorship attribution tasks; besides, these texts are also grouped according to age, gender and political views, therefore they are also suitable for author profiling tasks. Whereas short texts complicate recognition of author speaking style and are ambiguous in relation to the style of other authors, we incorporated only texts containing not less than 100 words into the corpus. In order to make each category as comprehensive and representative as possible, we included only those authors, who produced speeches at least 200 times. All the texts are lemmatized, morphologically and syntactically annotated, tokenized into the character n-grams. The statistical information of the corpus is also available. We have also demonstrated that the created corpus can be effectively used in authorship attribution and author profiling tasks with supervised machine learning methods. The corpus structure also allows using it with unsupervised machine learning methods and can be used for creation of rule-based methods, as well as in different linguistic analyses.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
用于作者识别和作者简介研究的地震会话速记图
在我们的论文中,我们提出了立陶宛议会演讲的转录语料库。语料库以特定的格式准备,适用于不同的作者身份识别任务。该语料库包含约11.1万篇文本(2400万字)。从1990年3月10日开始至2013年12月23日结束的七届议会任期内,每一段文本都与议会常会上发表的一篇演讲相匹配。这些文本被分为147个类别,对应于每个作者,因此它们可以用于作者归属任务;此外,这些文本还根据年龄,性别和政治观点分组,因此它们也适合作者分析任务。鉴于短文本使作者说话风格的识别变得复杂,并且与其他作者的风格有歧义,我们仅将不少于100个单词的文本纳入语料库。为了使每个类别尽可能全面和具有代表性,我们只包括那些发表过至少200次演讲的作者。所有文本都被语法化,形态学和语法注释,标记为字符n-图。语料库的统计信息也可用。我们还证明了创建的语料库可以通过监督机器学习方法有效地用于作者归属和作者分析任务。语料库结构还允许将其与无监督机器学习方法一起使用,并可用于创建基于规则的方法,以及不同的语言分析。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
0.40
自引率
0.00%
发文量
0
审稿时长
19 weeks
期刊最新文献
Metadiscourse in Lithuanian linguistics research articles: A study of interactive and interactional features Poetic and theatrical occasionalisms: Creation of new morphologically complex words by Joseph von Eichendorff, Johann Nepomuk Nestroy, Peter Handke and Arno Schmidt A corpus-based analysis of light verb constructions with MAKE and DO in British English Rytą or ryte? Vakarą or vakare? A corpus analysis of Lithuanian time expressions denoting parts of the day A parallel corpus-based study of the French verb tomber ‘to fall’: Its semantic plurivocity and equivalents in Polish and Lithuanian
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1