Pub Date : 2014-07-01DOI: 10.21248/jlcl.29.2014.183
Katerina Veselovská, Jan Hajic, J. Šindlerová
The aim of this paper is to introduce the Czech subjectivity lexicon, a new lexical resource for sentiment analysis in Czech. We describe particular stages of the manual refinement of the lexicon and demonstrate its use in the state-of-the art polarity classifiers, namely the Maximum Entropy classifier. We test the success rate of the system enriched with the dictionary on different data sets, compare the results and suggest some further improvements of the lexicon-based classification system.
{"title":"Subjectivity Lexicon for Czech: Implementation and Improvements","authors":"Katerina Veselovská, Jan Hajic, J. Šindlerová","doi":"10.21248/jlcl.29.2014.183","DOIUrl":"https://doi.org/10.21248/jlcl.29.2014.183","url":null,"abstract":"The aim of this paper is to introduce the Czech subjectivity lexicon, a new lexical resource for sentiment analysis in Czech. We describe particular stages of the manual refinement of the lexicon and demonstrate its use in the state-of-the art polarity classifiers, namely the Maximum Entropy classifier. We test the success rate of the system enriched with the dictionary on different data sets, compare the results and suggest some further improvements of the lexicon-based classification system.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132017642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-07-01DOI: 10.21248/jlcl.29.2014.191
H. V. Halteren, N. Oostdijk
In this paper, we attempt to estimate which proportion of the word tokens in Dutch tweets are not covered by standard resources and can therefore be expected to cause problems for standard NLP applications. We fully annotated and analysed a small pilot corpus. We also used the corpus to calibrate automatic estimation procedures for proportions of non-word tokens and of out-of-vocabulary words, after which we applied these procedures to about 2 billion Dutch tweets. We find that the proportion of possibly problematic tokens is so high (e.g. an estimate of 15% of the words being problematic in the full tweet collection, and the annotated sample with death-threat-related tweets showing problematic words in three out of four tweets) that any NLP application designed/created for standard Dutch can be expected to be seriously hampered in its processing. We suggest a few approaches to alleviate the problem, but none of them will solve the problem completely.
{"title":"Variability in Dutch Tweets. An estimate of the proportion of deviant word tokens","authors":"H. V. Halteren, N. Oostdijk","doi":"10.21248/jlcl.29.2014.191","DOIUrl":"https://doi.org/10.21248/jlcl.29.2014.191","url":null,"abstract":"In this paper, we attempt to estimate which proportion of the word tokens in Dutch tweets are not covered by standard resources and can therefore be expected to cause problems for standard NLP applications. We fully annotated and analysed a small pilot corpus. We also used the corpus to calibrate automatic estimation procedures for proportions of non-word tokens and of out-of-vocabulary words, after which we applied these procedures to about 2 billion Dutch tweets. We find that the proportion of possibly problematic tokens is so high (e.g. an estimate of 15% of the words being problematic in the full tweet collection, and the annotated sample with death-threat-related tweets showing problematic words in three out of four tweets) that any NLP application designed/created for standard Dutch can be expected to be seriously hampered in its processing. We suggest a few approaches to alleviate the problem, but none of them will solve the problem completely.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131130980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-07-01DOI: 10.21248/jlcl.29.2014.188
A. Glaznieks, Egon W. Stemle
This paper introduces the project DiDi in which we collect and analyze German data of computer-mediated communication (CMC) written by internet users from the Italian province of Bolzano – South Tyrol. The project focuses on quasi-public and private messages posted on Facebook, and analyses how L1 German speakers in South Tyrol use different varieties of German (e.g. South Tyrolean Dialect vs Standard German) and other languages (esp. Italian) to communicate on social network sites. A particular interest of the study is the writers’ age. We assume that users of different age groups can be distinguished by their linguistic behavior. Our comprehension of age is based on two conceptions: a person’s regular numerical age and her/his digital age, i.e. the number of years a person is actively involved in using new media. The paper describes the project as well as its diverse challenges and problems of data collection and corpus building. Finally, we will also discuss possible ways of how these challenges can be met. 1 Language in computer-mediated communication There is a wealth of studies in the corpus linguistic literature on the particularities of language used in computer-mediated communication (CMC) (e.g. for German Bader 2002, Demuth and Schulz 2010, Durscheid et al. 2010, Gunthner and Schmidt 2002, Harvelid 2007, Kessler 2008, Kleinberger Gunther and Spiegel 2006, Siebenhaar 2006, Siever 2005, Salomonsson 2011). Especially, the use of “netspeak” phenomena (Crystal 2001) such as emoticons, acronyms and abbreviations, interaction words, iteration of letters, etc. have attracted attention. The studies describe different functions of such phenomena within CMC. Features transferred from spoken language, such as discourse particles, vernacular and dialectal expressions are frequently mentioned characteristics of CMC. They serve to transmit informality of a given message, comment, or status post. Writers often use emoticons, interaction words (e.g. *grin*), abbreviations (e.g. lol), and spelling changes such as the iteration of letters (e.g. coooooll) to compensate for the absence of facial expressions, gestures and other kinesic features, and prosody. Many emoticons, interaction words, and abbreviations are “verbal glosses” for performed actions and aspects of specific situations. In addition, there are also particularities in spelling that people use without the aim of representing features of spoken language and that deviate from the standard variety. To cover such phenomena (e.g. n8 for ‘night’), we will follow Androutsopoulos (2007; 2011) and use the term “graphostylistics”. Finally, all forms of shortening (e.g. lol, n8, and thx for thanks) are often used for economic reasons to perform speedy conversations in chats and instant messages. The use of shortenings can also be motivated due to character restrictions of the used services. Differences between the use of language in CMC and in traditional written genres were often described with respect
本文介绍了DiDi项目,在该项目中,我们收集并分析了意大利博尔扎诺-南蒂罗尔省互联网用户撰写的计算机媒介通信(CMC)的德国数据。该项目侧重于Facebook上发布的准公开和私人信息,并分析南蒂罗尔的L1德语使用者如何使用不同种类的德语(例如南蒂罗尔方言与标准德语)和其他语言(特别是意大利语)在社交网站上进行交流。这项研究的一个特别兴趣点是作者的年龄。我们假设不同年龄段的用户可以通过他们的语言行为来区分。我们对年龄的理解基于两个概念:一个人的正常数字年龄和她/他的数字年龄,即一个人积极参与使用新媒体的年数。本文描述了该项目及其在数据收集和语料库建设方面面临的各种挑战和问题。最后,我们还将讨论如何应对这些挑战的可能方法。在语料语言学文献中,对计算机媒介传播中使用的语言的特殊性进行了大量的研究(例如,German Bader 2002, Demuth and Schulz 2010, Durscheid et al. 2010, Gunthner and Schmidt 2002, Harvelid 2007, Kessler 2008, Kleinberger Gunther and Spiegel 2006, Siebenhaar 2006, Siever 2005, Salomonsson 2011)。尤其是“网络语言”现象(Crystal 2001)的使用,如表情符号、缩略语、互动词、字母迭代等,引起了人们的关注。这些研究描述了这种现象在CMC中的不同功能。从口语中迁移而来的特征,如语篇小品、白话和方言表达,是CMC经常被提及的特征。它们用于传递给定信息、评论或状态帖子的非正式性。作者经常使用表情符号、互动词(如咧嘴笑)、缩写词(如lol)和拼写变化(如字母的重复(如coooooll))来弥补面部表情、手势和其他肢体特征以及韵律的缺失。许多表情符号、互动词和缩写都是对所执行的动作和特定情况方面的“口头解释”。此外,人们在使用拼写时也存在一些特殊性,这些特殊性并不是为了表现口语的特征,而是偏离了标准的变体。为了涵盖这些现象(例如n8代表“night”),我们将遵循Androutsopoulos (2007;2011),并使用术语“文字文体学”。最后,所有形式的缩写(例如lol, n8和thx表示感谢)通常出于经济原因在聊天和即时消息中执行快速对话。由于所使用的服务的字符限制,也可以激发缩短的使用。CMC和传统书面体裁中语言使用的差异通常根据Koch/Oesterreicher (1985;2008)。该模型
{"title":"Challenges of building a CMC corpus for analyzing writer's style by age: The DiDi project","authors":"A. Glaznieks, Egon W. Stemle","doi":"10.21248/jlcl.29.2014.188","DOIUrl":"https://doi.org/10.21248/jlcl.29.2014.188","url":null,"abstract":"This paper introduces the project DiDi in which we collect and analyze German data of computer-mediated communication (CMC) written by internet users from the Italian province of Bolzano – South Tyrol. The project focuses on quasi-public and private messages posted on Facebook, and analyses how L1 German speakers in South Tyrol use different varieties of German (e.g. South Tyrolean Dialect vs Standard German) and other languages (esp. Italian) to communicate on social network sites. A particular interest of the study is the writers’ age. We assume that users of different age groups can be distinguished by their linguistic behavior. Our comprehension of age is based on two conceptions: a person’s regular numerical age and her/his digital age, i.e. the number of years a person is actively involved in using new media. The paper describes the project as well as its diverse challenges and problems of data collection and corpus building. Finally, we will also discuss possible ways of how these challenges can be met. 1 Language in computer-mediated communication There is a wealth of studies in the corpus linguistic literature on the particularities of language used in computer-mediated communication (CMC) (e.g. for German Bader 2002, Demuth and Schulz 2010, Durscheid et al. 2010, Gunthner and Schmidt 2002, Harvelid 2007, Kessler 2008, Kleinberger Gunther and Spiegel 2006, Siebenhaar 2006, Siever 2005, Salomonsson 2011). Especially, the use of “netspeak” phenomena (Crystal 2001) such as emoticons, acronyms and abbreviations, interaction words, iteration of letters, etc. have attracted attention. The studies describe different functions of such phenomena within CMC. Features transferred from spoken language, such as discourse particles, vernacular and dialectal expressions are frequently mentioned characteristics of CMC. They serve to transmit informality of a given message, comment, or status post. Writers often use emoticons, interaction words (e.g. *grin*), abbreviations (e.g. lol), and spelling changes such as the iteration of letters (e.g. coooooll) to compensate for the absence of facial expressions, gestures and other kinesic features, and prosody. Many emoticons, interaction words, and abbreviations are “verbal glosses” for performed actions and aspects of specific situations. In addition, there are also particularities in spelling that people use without the aim of representing features of spoken language and that deviate from the standard variety. To cover such phenomena (e.g. n8 for ‘night’), we will follow Androutsopoulos (2007; 2011) and use the term “graphostylistics”. Finally, all forms of shortening (e.g. lol, n8, and thx for thanks) are often used for economic reasons to perform speedy conversations in chats and instant messages. The use of shortenings can also be motivated due to character restrictions of the used services. Differences between the use of language in CMC and in traditional written genres were often described with respect ","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"179 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116143500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-07-01DOI: 10.21248/jlcl.29.2014.184
Stefania Degaetano-Ortlieb, Hannah Kermes, E. Teich
We present a semi-automatic approach to study expressions of evaluation in academic writing as well as targets evaluated. The aim is to uncover the linguistic properties of evaluative expressions used in this genre, i.e. investigate which lexico- grammatical patterns are used to attribute an evaluation towards a target. The approach encompasses pattern detection and the semi-automatic annotation of the patterns in the SciTex Corpus (Teich and Fankhauser, 2010; Degaetano-Ortlieb et al., 2013). We exemplify the procedures by investigating the notion of importance expressed in academic writing. By extracting distributional information provided by the annotation, we analyze how this notion might dier across academic disciplines and sections of research articles.
我们提出了一种半自动的方法来研究学术写作中的评价表达以及被评价的对象。目的是揭示在这一体裁中使用的评价表达的语言特性,即调查哪些词汇语法模式被用于将评价归因于目标。该方法包括模式检测和sciitex语料库中模式的半自动注释(Teich和Fankhauser, 2010;Degaetano-Ortlieb et al., 2013)。我们通过调查学术写作中表达的重要性概念来举例说明这些程序。通过提取注释提供的分布信息,我们分析了这个概念如何在学科和研究文章的部分之间传播。
{"title":"The notion of importance in academic writing: detection, linguistic properties and targets","authors":"Stefania Degaetano-Ortlieb, Hannah Kermes, E. Teich","doi":"10.21248/jlcl.29.2014.184","DOIUrl":"https://doi.org/10.21248/jlcl.29.2014.184","url":null,"abstract":"We present a semi-automatic approach to study expressions of evaluation in academic writing as well as targets evaluated. The aim is to uncover the linguistic properties of evaluative expressions used in this genre, i.e. investigate which lexico- grammatical patterns are used to attribute an evaluation towards a target. The approach encompasses pattern detection and the semi-automatic annotation of the patterns in the SciTex Corpus (Teich and Fankhauser, 2010; Degaetano-Ortlieb et al., 2013). We exemplify the procedures by investigating the notion of importance expressed in academic writing. By extracting distributional information provided by the annotation, we analyze how this notion might dier across academic disciplines and sections of research articles.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128736599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-01-11DOI: 10.21248/jlcl.29.2014.187
T. Chanier, Céline Poudat, Benoît Sagot, G. Antoniadis, Ciara R. Wigham, Linda Hriba, Julien Longhi, Djamé Seddah
The CoMeRe project aims to build a kernel corpus of different Computer-Mediated Com-munication (CMC) genres with interactions in French as the main language, by assembling interactions stemming from networks such as the Internet or telecommunication, as well as mono and multimodal, synchronous and asynchronous communications. Corpora are assem-bled using a standard, thanks to the TEI (Text Encoding Initiative) format. This implies extending, through a European endeavor, the TEI model of text, in order to encompass the richest and the more complex CMC genres. This paper presents the Interaction Space model. We explain how this model has been encoded within the TEI corpus header and body. The model is then instantiated through the first four corpora we have processed: three corpora where interactions occurred in single-modality environments (text chat, or SMS systems) and a fourth corpus where text chat, email and forum modalities were used simultaneously. The CoMeRe project has two main research perspectives: Discourse Analysis, only alluded to in this paper, and the linguistic study of idiolects occurring in different CMC genres. As NLP algorithms are an indispensable prerequisite for such research, we present our motiva-tions for applying an automatic annotation process to the CoMeRe corpora. Our wish to guarantee generic annotations meant we did not consider any processing beyond morphosyn-tactic labelling, but prioritized the automatic annotation of any freely variant elements within the corpora. We then turn to decisions made concerning which annotations to make for which units and describe the processing pipeline for adding these. All CoMeRe corpora are verified, thanks to a staged quality control process, designed to allow corpora to move from one project phase to the next. Public release of the CoMeRe corpora is a short-term goal: corpora will be integrated into the forthcoming French National Reference Corpus, and disseminated through the national linguistic infrastructure ORTOLANG. We, therefore, highlight issues and decisions made concerning the OpenData perspective.
{"title":"The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres","authors":"T. Chanier, Céline Poudat, Benoît Sagot, G. Antoniadis, Ciara R. Wigham, Linda Hriba, Julien Longhi, Djamé Seddah","doi":"10.21248/jlcl.29.2014.187","DOIUrl":"https://doi.org/10.21248/jlcl.29.2014.187","url":null,"abstract":"The CoMeRe project aims to build a kernel corpus of different Computer-Mediated Com-munication (CMC) genres with interactions in French as the main language, by assembling interactions stemming from networks such as the Internet or telecommunication, as well as mono and multimodal, synchronous and asynchronous communications. Corpora are assem-bled using a standard, thanks to the TEI (Text Encoding Initiative) format. This implies extending, through a European endeavor, the TEI model of text, in order to encompass the richest and the more complex CMC genres. This paper presents the Interaction Space model. We explain how this model has been encoded within the TEI corpus header and body. The model is then instantiated through the first four corpora we have processed: three corpora where interactions occurred in single-modality environments (text chat, or SMS systems) and a fourth corpus where text chat, email and forum modalities were used simultaneously. The CoMeRe project has two main research perspectives: Discourse Analysis, only alluded to in this paper, and the linguistic study of idiolects occurring in different CMC genres. As NLP algorithms are an indispensable prerequisite for such research, we present our motiva-tions for applying an automatic annotation process to the CoMeRe corpora. Our wish to guarantee generic annotations meant we did not consider any processing beyond morphosyn-tactic labelling, but prioritized the automatic annotation of any freely variant elements within the corpora. We then turn to decisions made concerning which annotations to make for which units and describe the processing pipeline for adding these. All CoMeRe corpora are verified, thanks to a staged quality control process, designed to allow corpora to move from one project phase to the next. Public release of the CoMeRe corpora is a short-term goal: corpora will be integrated into the forthcoming French National Reference Corpus, and disseminated through the national linguistic infrastructure ORTOLANG. We, therefore, highlight issues and decisions made concerning the OpenData perspective.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131509701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-07-01DOI: 10.21248/jlcl.28.2013.172
Thomas Bartz, Michael Beißwenger, Angelika Storrer
empirisch gezeigt (vgl. z. stellen neben Repräsentationsstandards ein weiteres
实践经验(反对拿来比较)苏只代表标准
{"title":"Optimierung des Stuttgart-Tübingen-Tagset für die linguistische Annotation von Korpora zur internetbasierten Kommunikation: Phänomene, Herausforderungen, Erweiterungsvorschläge","authors":"Thomas Bartz, Michael Beißwenger, Angelika Storrer","doi":"10.21248/jlcl.28.2013.172","DOIUrl":"https://doi.org/10.21248/jlcl.28.2013.172","url":null,"abstract":"empirisch gezeigt (vgl. z. stellen neben Repräsentationsstandards ein weiteres","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122245659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-07-01DOI: 10.21248/jlcl.28.2013.173
Ines Rehbein, Sören Schalowski
The Stuttgart-Tubingen Tag Set (STTS) (Schiller et al., 1995) has long been established as a quasi-standard for part-of-speech (POS) tagging of German. It has been used, with minor modifications, for the annotation of three German newspaper treebanks, the NEGRA treebank (Skut et al., 1997), the TiGer treebank (Brants et al., 2002) and the TuBa-D/Z (Telljohann et al., 2004). One major drawback, however, is the lack of tags for the analysis of language phenomena from domains other than the newspaper domain. A case in point is spoken language, which displays a wide range of phenomena which do not (or only very rarely) occur in newspaper text.
Stuttgart-Tubingen标签集(STTS) (Schiller et al., 1995)早已被确立为德语词性标注的准标准。经过少量修改,它已被用于三个德国报纸树库的注释,即NEGRA树库(Skut等人,1997),TiGer树库(Brants等人,2002)和TuBa-D/Z (Telljohann等人,2004)。然而,一个主要的缺点是缺乏用于分析除报纸领域以外的其他领域的语言现象的标签。一个恰当的例子是口语,它展示了大量在报纸文本中不会(或很少)出现的现象。
{"title":"STTS goes Kiez - Experiments on Annotating and Tagging Urban Youth Language","authors":"Ines Rehbein, Sören Schalowski","doi":"10.21248/jlcl.28.2013.173","DOIUrl":"https://doi.org/10.21248/jlcl.28.2013.173","url":null,"abstract":"The Stuttgart-Tubingen Tag Set (STTS) (Schiller et al., 1995) has long been established as a quasi-standard for part-of-speech (POS) tagging of German. It has been used, with minor modifications, for the annotation of three German newspaper treebanks, the NEGRA treebank (Skut et al., 1997), the TiGer treebank (Brants et al., 2002) and the TuBa-D/Z (Telljohann et al., 2004). One major drawback, however, is the lack of tags for the analysis of language phenomena from domains other than the newspaper domain. A case in point is spoken language, which displays a wide range of phenomena which do not (or only very rarely) occur in newspaper text.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133828735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-07-01DOI: 10.21248/jlcl.28.2013.167
Sandra Kübler, Wolfgang Maier
Lange Zeit konzentrierte sich die Forschung im datengetriebenen statistischen Konstituenzparsing auf die Entwicklung von Parsingmodellen für das Englische, genauer gesagt, für die Penn Treebank (Marcus et al., 1993). Einer der Gründe dafür, warum sich solche Modelle nicht ohne Weiteres auf andere Sprachen generalisieren lassen, ist die eher schwach ausgeprägte Morphologie des Englischen: Probleme, die sich bei Parsen einer morphologisch reichen Sprache wie z.B. Arabisch oder Deutsch stellen, stellen sich für das Englische nicht. Vor allem in den letzten Jahren erfuhr die Forschung zu Parsingproblemen, die sich auf komplexe Morphologie beziehen, ein gesteigertes Interesse (Kübler und Penn, 2008; Seddah et al., 2010, 2011; Apidianaki et al., 2012). In einer Baumbank sind Wörter im allgemeinen Information annotiert, die Auskunft über die Wortart (Part-of-Speech, POS) und morphologischen Eigenschaften eines Wortes gibt. Wo, sofern vorhanden, die Trennlinie zwischen Wortart und morphologischer Information gezogen wird und wie detailliert annotiert wird, hängt von der Einzelsprache und dem Annotationsschema ab. In einigen Baumbanken gibt es keine separate morphologische Annotation (wie z.B. in der Penn Treebank), in anderen sind Part-of-Speechund Morphologie-Tagsets getrennt (z.B. in den deutschen Baumbanken TiGer (Brants et al., 2002) und NeGra (Skut et al., 1997)), und in anderen ist wiederum nur ein Tagset vorhanden, das sowohl POSals auch Morphologie-Information enthält (z.B. in der Szeged Treebank (Csendes et al., 2005)). Die Anzahl verschiedener Tags für Sprachen mit einer komplexen Morphologie kann in die Tausende gehen, so z.B. für Tschechisch (Hajič et al., 2000), während für die Modellierung der Wortarten von Sprachen mit wenig bis keiner Morphologie nur wenige Tags ausreichen, z.B. 33 Tags für die Penn Chinese Treebank (Xia, 2000). Wir schließen der Einfachheit halber alle Annotationstypen ein, wenn wir ab hier von Part-of-Speech-Annotation sprechen. Die Part-of-Speech-Tags nehmen eine Schlüsselrolle beim Parsen ein als Schnittstelle zwischen lexikalischer Ebene und dem eigentlichen Syntax-Baum: Während des Parsingvorgangs wird der eigentliche Konstituenzbaum nicht direkt über den Wörtern, sondern über der Part-of-Speech-Annotation erstellt. Ein Part-of-Speech-Tag kann als eine Äquivalenzklasse von Wörtern mit ähnlichen distributionellen Charakteristika angesehen werden, die über die individuellen Wörter abstrahiert und damit die Anzahl der Parameter beschränkt, für die Wahrscheinlichkeiten gelernt werden müssen. Die eigentlichen Wörter finden bei lexikalisierten Parsern Eingang in das Wahrscheinlichkeitsmodell. Es ist offensichtlich, dass die Part-of-Speech-Annotation direkten Einfluss auf die Qualität des Parsebaums hat. Nicht nur die Qualität des Taggers spielt hierbei eine Rolle, sondern auch die Granularität des Tagsets an sich. Es muss ein Kompromiss
在很长一段时间里,在数据驱动的统计符号中,研究集中在开发英国的停车模型上,更确切地说,是彭岸银行的停车模型(Marcus等人,1993年)。然而,这些形态不能轻易地泛泛到其他语言的原因之一,有一个原因是英语形态较差:阿拉伯语或德语等形态丰富的语言遇到的麻烦与英语无关。近年来,研究和复杂形态学有关的麻痹症很有兴趣。(2010年,2011年;2012年)树裙带应画出应该完整的故事,这应该画出应该用的字型和形态特征的故事。加以散发之间的地方做了解释和morphologischer信息以及报告详细annotiert将取决于Einzelsprache和Annotationsschema .某些Baumbanken没有独立的形态Annotation(比如Penn Treebank)在另一些人Part-of-Speechund Morphologie-Tagsets分离(例如德国Baumbanken老虎(Brants等人.,2002)和舞后(Skut等人.,1997)),另一些学生则只有包含posger和形态学信息的标签(比如在csende le le, 2005年)。数量不同语言进行一项复杂的形态学标志的可以去成千上万,比如捷克(Hajič的al ., 2000),而对于模拟Wortarten语言很少有最后一个形态学白天很少有足够,比如33 .宾州中国人Treebank标志(刘霞,2000).应该尽可能的避开所有操作只要画停在这里这张停车停车是过滤器的重要操作。这是字典层和句树间的交互通道:在此期间,出柜真实操作是由舍监莫的推意识画出来的。此摩尔日的操作可以看成是一个具备类似分配特征的单词的等量齐观,区分出个体词汇,从而限制了可能需要学习的参数数量。因此,词汇的后面是从词汇分析得出的概率模型。这公园的画应该不会影响停车公园的质量这里重要的不仅是白昼的质量,也是白昼本身的格兰性。在这里不能妥协
{"title":"Über den Einfluss von Part-of-Speech-Tags auf Parsing-Ergebnisse","authors":"Sandra Kübler, Wolfgang Maier","doi":"10.21248/jlcl.28.2013.167","DOIUrl":"https://doi.org/10.21248/jlcl.28.2013.167","url":null,"abstract":"Lange Zeit konzentrierte sich die Forschung im datengetriebenen statistischen Konstituenzparsing auf die Entwicklung von Parsingmodellen für das Englische, genauer gesagt, für die Penn Treebank (Marcus et al., 1993). Einer der Gründe dafür, warum sich solche Modelle nicht ohne Weiteres auf andere Sprachen generalisieren lassen, ist die eher schwach ausgeprägte Morphologie des Englischen: Probleme, die sich bei Parsen einer morphologisch reichen Sprache wie z.B. Arabisch oder Deutsch stellen, stellen sich für das Englische nicht. Vor allem in den letzten Jahren erfuhr die Forschung zu Parsingproblemen, die sich auf komplexe Morphologie beziehen, ein gesteigertes Interesse (Kübler und Penn, 2008; Seddah et al., 2010, 2011; Apidianaki et al., 2012). In einer Baumbank sind Wörter im allgemeinen Information annotiert, die Auskunft über die Wortart (Part-of-Speech, POS) und morphologischen Eigenschaften eines Wortes gibt. Wo, sofern vorhanden, die Trennlinie zwischen Wortart und morphologischer Information gezogen wird und wie detailliert annotiert wird, hängt von der Einzelsprache und dem Annotationsschema ab. In einigen Baumbanken gibt es keine separate morphologische Annotation (wie z.B. in der Penn Treebank), in anderen sind Part-of-Speechund Morphologie-Tagsets getrennt (z.B. in den deutschen Baumbanken TiGer (Brants et al., 2002) und NeGra (Skut et al., 1997)), und in anderen ist wiederum nur ein Tagset vorhanden, das sowohl POSals auch Morphologie-Information enthält (z.B. in der Szeged Treebank (Csendes et al., 2005)). Die Anzahl verschiedener Tags für Sprachen mit einer komplexen Morphologie kann in die Tausende gehen, so z.B. für Tschechisch (Hajič et al., 2000), während für die Modellierung der Wortarten von Sprachen mit wenig bis keiner Morphologie nur wenige Tags ausreichen, z.B. 33 Tags für die Penn Chinese Treebank (Xia, 2000). Wir schließen der Einfachheit halber alle Annotationstypen ein, wenn wir ab hier von Part-of-Speech-Annotation sprechen. Die Part-of-Speech-Tags nehmen eine Schlüsselrolle beim Parsen ein als Schnittstelle zwischen lexikalischer Ebene und dem eigentlichen Syntax-Baum: Während des Parsingvorgangs wird der eigentliche Konstituenzbaum nicht direkt über den Wörtern, sondern über der Part-of-Speech-Annotation erstellt. Ein Part-of-Speech-Tag kann als eine Äquivalenzklasse von Wörtern mit ähnlichen distributionellen Charakteristika angesehen werden, die über die individuellen Wörter abstrahiert und damit die Anzahl der Parameter beschränkt, für die Wahrscheinlichkeiten gelernt werden müssen. Die eigentlichen Wörter finden bei lexikalisierten Parsern Eingang in das Wahrscheinlichkeitsmodell. Es ist offensichtlich, dass die Part-of-Speech-Annotation direkten Einfluss auf die Qualität des Parsebaums hat. Nicht nur die Qualität des Taggers spielt hierbei eine Rolle, sondern auch die Granularität des Tagsets an sich. Es muss ein Kompromiss","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124224128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-07-01DOI: 10.21248/jlcl.28.2013.175
Chris Biemann, Felix Bildhauer, S. Evert, Dirk Goldhahn, U. Quasthoff, R. Schäfer, Johannes Simon, Leonard Swiezinski, Torsten Zesch
In this article, we give an overview about the necessary steps to construct high-quality corpora from web texts. We first focus on web crawling and the pros and cons of the existing crawling strategies. Then, we describe how the crawled data can be linguistically pre-processed in a parallelized way that allows the processing of web-scale input data. As we are working with web data, controlling the quality of the resulting corpus is an important issue, which we address by showing how corpus statistics and a linguistic evaluation can be used to assess the quality of corpora. Finally, we show how the availability of extremely large, high-quality corpora opens up new directions for research in various fields of linguistics, computational linguistics, and natural language processing.
{"title":"Scalable Construction of High-Quality Web Corpora","authors":"Chris Biemann, Felix Bildhauer, S. Evert, Dirk Goldhahn, U. Quasthoff, R. Schäfer, Johannes Simon, Leonard Swiezinski, Torsten Zesch","doi":"10.21248/jlcl.28.2013.175","DOIUrl":"https://doi.org/10.21248/jlcl.28.2013.175","url":null,"abstract":"In this article, we give an overview about the necessary steps to construct high-quality corpora from web texts. We first focus on web crawling and the pros and cons of the existing crawling strategies. Then, we describe how the crawled data can be linguistically pre-processed in a parallelized way that allows the processing of web-scale input data. As we are working with web data, controlling the quality of the resulting corpus is an important issue, which we address by showing how corpus statistics and a linguistic evaluation can be used to assess the quality of corpora. Finally, we show how the availability of extremely large, high-quality corpora opens up new directions for research in various fields of linguistics, computational linguistics, and natural language processing.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133428730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-07-01DOI: 10.21248/jlcl.28.2013.168
S. Clematide
Die Identifizierung von Kasus bei kasustragenden, deklinierbaren Wörtern (Pronomen, Artikel, Nomen, Adjektive) ist eine entscheidende Anforderung an die Sprachverarbeitung für flektierende Sprachen wie Deutsch. Neben grundlegenden syntaktischen Funktionen (Subjekte im Nominativ, Objekte im Akkusativ, Dativ oder Genitiv), welche vom Verb regiert werden und nominalen Modifikatoren im Genitiv sind Präpositionen mit ihren Rektionseigenschaften kasusbestimmend bzw. kasusregierend. Im folgenden Beispiel sind alle Präpositionen und alle kasustragenden Wörter mit entsprechenden Kasus-Tags markiert1:
{"title":"Wozu Kasusrektion auszeichnen bei Präpositionen?","authors":"S. Clematide","doi":"10.21248/jlcl.28.2013.168","DOIUrl":"https://doi.org/10.21248/jlcl.28.2013.168","url":null,"abstract":"Die Identifizierung von Kasus bei kasustragenden, deklinierbaren Wörtern (Pronomen, Artikel, Nomen, Adjektive) ist eine entscheidende Anforderung an die Sprachverarbeitung für flektierende Sprachen wie Deutsch. Neben grundlegenden syntaktischen Funktionen (Subjekte im Nominativ, Objekte im Akkusativ, Dativ oder Genitiv), welche vom Verb regiert werden und nominalen Modifikatoren im Genitiv sind Präpositionen mit ihren Rektionseigenschaften kasusbestimmend bzw. kasusregierend. Im folgenden Beispiel sind alle Präpositionen und alle kasustragenden Wörter mit entsprechenden Kasus-Tags markiert1:","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124405886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}