J. Lang. Technol. Comput. Linguistics最新文献

英文中文

Subjectivity Lexicon for Czech: Implementation and Improvements 捷克语主体性词汇:实施与改进

J. Lang. Technol. Comput. Linguistics

Pub Date : 2014-07-01 DOI: 10.21248/jlcl.29.2014.183

Katerina Veselovská, Jan Hajic, J. Šindlerová

The aim of this paper is to introduce the Czech subjectivity lexicon, a new lexical resource for sentiment analysis in Czech. We describe particular stages of the manual refinement of the lexicon and demonstrate its use in the state-of-the art polarity classifiers, namely the Maximum Entropy classifier. We test the success rate of the system enriched with the dictionary on different data sets, compare the results and suggest some further improvements of the lexicon-based classification system.

本文的目的是介绍捷克语主体性词汇，这是一种新的捷克语情感分析词汇资源。我们描述了人工细化词典的特定阶段，并演示了它在最先进的极性分类器(即最大熵分类器)中的使用。我们在不同的数据集上测试了基于词典的分类系统的成功率，比较了结果，并对基于词典的分类系统提出了一些进一步的改进建议。

引用次数: 5

Variability in Dutch Tweets. An estimate of the proportion of deviant word tokens 荷兰语推文的可变性。对偏差词标记的比例的估计

J. Lang. Technol. Comput. Linguistics

Pub Date : 2014-07-01 DOI: 10.21248/jlcl.29.2014.191

H. V. Halteren, N. Oostdijk

In this paper, we attempt to estimate which proportion of the word tokens in Dutch tweets are not covered by standard resources and can therefore be expected to cause problems for standard NLP applications. We fully annotated and analysed a small pilot corpus. We also used the corpus to calibrate automatic estimation procedures for proportions of non-word tokens and of out-of-vocabulary words, after which we applied these procedures to about 2 billion Dutch tweets. We find that the proportion of possibly problematic tokens is so high (e.g. an estimate of 15% of the words being problematic in the full tweet collection, and the annotated sample with death-threat-related tweets showing problematic words in three out of four tweets) that any NLP application designed/created for standard Dutch can be expected to be seriously hampered in its processing. We suggest a few approaches to alleviate the problem, but none of them will solve the problem completely.

在本文中，我们试图估计荷兰语推文中有多少比例的词令牌没有被标准资源覆盖，因此可能会对标准NLP应用程序造成问题。我们完全注释和分析了一个小的试点语料库。我们还使用语料库校准非单词标记和词汇外单词比例的自动估计程序，之后我们将这些程序应用于大约20亿条荷兰语推文。我们发现可能有问题的标记的比例如此之高(例如，在整个tweet集合中估计有15%的单词是有问题的，并且带有死亡威胁相关tweet的注释样本在四分之三的tweet中显示有问题的单词)，任何为标准荷兰语设计/创建的NLP应用程序都可能在其处理中受到严重阻碍。我们提出了几种方法来缓解这个问题，但没有一种能彻底解决问题。

引用次数: 5

Challenges of building a CMC corpus for analyzing writer's style by age: The DiDi project 构建CMC语料库分析作家年龄风格的挑战:DiDi项目

J. Lang. Technol. Comput. Linguistics

Pub Date : 2014-07-01 DOI: 10.21248/jlcl.29.2014.188

A. Glaznieks, Egon W. Stemle

This paper introduces the project DiDi in which we collect and analyze German data of computer-mediated communication (CMC) written by internet users from the Italian province of Bolzano – South Tyrol. The project focuses on quasi-public and private messages posted on Facebook, and analyses how L1 German speakers in South Tyrol use different varieties of German (e.g. South Tyrolean Dialect vs Standard German) and other languages (esp. Italian) to communicate on social network sites. A particular interest of the study is the writers’ age. We assume that users of different age groups can be distinguished by their linguistic behavior. Our comprehension of age is based on two conceptions: a person’s regular numerical age and her/his digital age, i.e. the number of years a person is actively involved in using new media. The paper describes the project as well as its diverse challenges and problems of data collection and corpus building. Finally, we will also discuss possible ways of how these challenges can be met. 1 Language in computer-mediated communication There is a wealth of studies in the corpus linguistic literature on the particularities of language used in computer-mediated communication (CMC) (e.g. for German Bader 2002, Demuth and Schulz 2010, Durscheid et al. 2010, Gunthner and Schmidt 2002, Harvelid 2007, Kessler 2008, Kleinberger Gunther and Spiegel 2006, Siebenhaar 2006, Siever 2005, Salomonsson 2011). Especially, the use of “netspeak” phenomena (Crystal 2001) such as emoticons, acronyms and abbreviations, interaction words, iteration of letters, etc. have attracted attention. The studies describe different functions of such phenomena within CMC. Features transferred from spoken language, such as discourse particles, vernacular and dialectal expressions are frequently mentioned characteristics of CMC. They serve to transmit informality of a given message, comment, or status post. Writers often use emoticons, interaction words (e.g. *grin*), abbreviations (e.g. lol), and spelling changes such as the iteration of letters (e.g. coooooll) to compensate for the absence of facial expressions, gestures and other kinesic features, and prosody. Many emoticons, interaction words, and abbreviations are “verbal glosses” for performed actions and aspects of specific situations. In addition, there are also particularities in spelling that people use without the aim of representing features of spoken language and that deviate from the standard variety. To cover such phenomena (e.g. n8 for ‘night’), we will follow Androutsopoulos (2007; 2011) and use the term “graphostylistics”. Finally, all forms of shortening (e.g. lol, n8, and thx for thanks) are often used for economic reasons to perform speedy conversations in chats and instant messages. The use of shortenings can also be motivated due to character restrictions of the used services. Differences between the use of language in CMC and in traditional written genres were often described with respect

本文介绍了DiDi项目，在该项目中，我们收集并分析了意大利博尔扎诺-南蒂罗尔省互联网用户撰写的计算机媒介通信(CMC)的德国数据。该项目侧重于Facebook上发布的准公开和私人信息，并分析南蒂罗尔的L1德语使用者如何使用不同种类的德语(例如南蒂罗尔方言与标准德语)和其他语言(特别是意大利语)在社交网站上进行交流。这项研究的一个特别兴趣点是作者的年龄。我们假设不同年龄段的用户可以通过他们的语言行为来区分。我们对年龄的理解基于两个概念:一个人的正常数字年龄和她/他的数字年龄，即一个人积极参与使用新媒体的年数。本文描述了该项目及其在数据收集和语料库建设方面面临的各种挑战和问题。最后，我们还将讨论如何应对这些挑战的可能方法。在语料语言学文献中，对计算机媒介传播中使用的语言的特殊性进行了大量的研究(例如，German Bader 2002, Demuth and Schulz 2010, Durscheid et al. 2010, Gunthner and Schmidt 2002, Harvelid 2007, Kessler 2008, Kleinberger Gunther and Spiegel 2006, Siebenhaar 2006, Siever 2005, Salomonsson 2011)。尤其是“网络语言”现象(Crystal 2001)的使用，如表情符号、缩略语、互动词、字母迭代等，引起了人们的关注。这些研究描述了这种现象在CMC中的不同功能。从口语中迁移而来的特征，如语篇小品、白话和方言表达，是CMC经常被提及的特征。它们用于传递给定信息、评论或状态帖子的非正式性。作者经常使用表情符号、互动词(如咧嘴笑)、缩写词(如lol)和拼写变化(如字母的重复(如coooooll))来弥补面部表情、手势和其他肢体特征以及韵律的缺失。许多表情符号、互动词和缩写都是对所执行的动作和特定情况方面的“口头解释”。此外，人们在使用拼写时也存在一些特殊性，这些特殊性并不是为了表现口语的特征，而是偏离了标准的变体。为了涵盖这些现象(例如n8代表“night”)，我们将遵循Androutsopoulos (2007;2011)，并使用术语“文字文体学”。最后，所有形式的缩写(例如lol, n8和thx表示感谢)通常出于经济原因在聊天和即时消息中执行快速对话。由于所使用的服务的字符限制，也可以激发缩短的使用。CMC和传统书面体裁中语言使用的差异通常根据Koch/Oesterreicher (1985;2008)。该模型

{"title":"Challenges of building a CMC corpus for analyzing writer's style by age: The DiDi project","authors":"A. Glaznieks, Egon W. Stemle","doi":"10.21248/jlcl.29.2014.188","DOIUrl":"https://doi.org/10.21248/jlcl.29.2014.188","url":null,"abstract":"This paper introduces the project DiDi in which we collect and analyze German data of computer-mediated communication (CMC) written by internet users from the Italian province of Bolzano – South Tyrol. The project focuses on quasi-public and private messages posted on Facebook, and analyses how L1 German speakers in South Tyrol use different varieties of German (e.g. South Tyrolean Dialect vs Standard German) and other languages (esp. Italian) to communicate on social network sites. A particular interest of the study is the writers’ age. We assume that users of different age groups can be distinguished by their linguistic behavior. Our comprehension of age is based on two conceptions: a person’s regular numerical age and her/his digital age, i.e. the number of years a person is actively involved in using new media. The paper describes the project as well as its diverse challenges and problems of data collection and corpus building. Finally, we will also discuss possible ways of how these challenges can be met. 1 Language in computer-mediated communication There is a wealth of studies in the corpus linguistic literature on the particularities of language used in computer-mediated communication (CMC) (e.g. for German Bader 2002, Demuth and Schulz 2010, Durscheid et al. 2010, Gunthner and Schmidt 2002, Harvelid 2007, Kessler 2008, Kleinberger Gunther and Spiegel 2006, Siebenhaar 2006, Siever 2005, Salomonsson 2011). Especially, the use of “netspeak” phenomena (Crystal 2001) such as emoticons, acronyms and abbreviations, interaction words, iteration of letters, etc. have attracted attention. The studies describe different functions of such phenomena within CMC. Features transferred from spoken language, such as discourse particles, vernacular and dialectal expressions are frequently mentioned characteristics of CMC. They serve to transmit informality of a given message, comment, or status post. Writers often use emoticons, interaction words (e.g. *grin*), abbreviations (e.g. lol), and spelling changes such as the iteration of letters (e.g. coooooll) to compensate for the absence of facial expressions, gestures and other kinesic features, and prosody. Many emoticons, interaction words, and abbreviations are “verbal glosses” for performed actions and aspects of specific situations. In addition, there are also particularities in spelling that people use without the aim of representing features of spoken language and that deviate from the standard variety. To cover such phenomena (e.g. n8 for ‘night’), we will follow Androutsopoulos (2007; 2011) and use the term “graphostylistics”. Finally, all forms of shortening (e.g. lol, n8, and thx for thanks) are often used for economic reasons to perform speedy conversations in chats and instant messages. The use of shortenings can also be motivated due to character restrictions of the used services. Differences between the use of language in CMC and in traditional written genres were often described with respect ","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"179 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116143500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

The notion of importance in academic writing: detection, linguistic properties and targets 学术写作中的重要性概念:检测、语言特性和目标

J. Lang. Technol. Comput. Linguistics

Pub Date : 2014-07-01 DOI: 10.21248/jlcl.29.2014.184

Stefania Degaetano-Ortlieb, Hannah Kermes, E. Teich

We present a semi-automatic approach to study expressions of evaluation in academic writing as well as targets evaluated. The aim is to uncover the linguistic properties of evaluative expressions used in this genre, i.e. investigate which lexico- grammatical patterns are used to attribute an evaluation towards a target. The approach encompasses pattern detection and the semi-automatic annotation of the patterns in the SciTex Corpus (Teich and Fankhauser, 2010; Degaetano-Ortlieb et al., 2013). We exemplify the procedures by investigating the notion of importance expressed in academic writing. By extracting distributional information provided by the annotation, we analyze how this notion might dier across academic disciplines and sections of research articles.

我们提出了一种半自动的方法来研究学术写作中的评价表达以及被评价的对象。目的是揭示在这一体裁中使用的评价表达的语言特性，即调查哪些词汇语法模式被用于将评价归因于目标。该方法包括模式检测和sciitex语料库中模式的半自动注释(Teich和Fankhauser, 2010;Degaetano-Ortlieb et al.， 2013)。我们通过调查学术写作中表达的重要性概念来举例说明这些程序。通过提取注释提供的分布信息，我们分析了这个概念如何在学科和研究文章的部分之间传播。

引用次数: 2

The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres CoMeRe法语语料库:结构和注释异构CMC体裁

J. Lang. Technol. Comput. Linguistics

Pub Date : 2014-01-11 DOI: 10.21248/jlcl.29.2014.187

T. Chanier, Céline Poudat, Benoît Sagot, G. Antoniadis, Ciara R. Wigham, Linda Hriba, Julien Longhi, Djamé Seddah

The CoMeRe project aims to build a kernel corpus of different Computer-Mediated Com-munication (CMC) genres with interactions in French as the main language, by assembling interactions stemming from networks such as the Internet or telecommunication, as well as mono and multimodal, synchronous and asynchronous communications. Corpora are assem-bled using a standard, thanks to the TEI (Text Encoding Initiative) format. This implies extending, through a European endeavor, the TEI model of text, in order to encompass the richest and the more complex CMC genres. This paper presents the Interaction Space model. We explain how this model has been encoded within the TEI corpus header and body. The model is then instantiated through the first four corpora we have processed: three corpora where interactions occurred in single-modality environments (text chat, or SMS systems) and a fourth corpus where text chat, email and forum modalities were used simultaneously. The CoMeRe project has two main research perspectives: Discourse Analysis, only alluded to in this paper, and the linguistic study of idiolects occurring in different CMC genres. As NLP algorithms are an indispensable prerequisite for such research, we present our motiva-tions for applying an automatic annotation process to the CoMeRe corpora. Our wish to guarantee generic annotations meant we did not consider any processing beyond morphosyn-tactic labelling, but prioritized the automatic annotation of any freely variant elements within the corpora. We then turn to decisions made concerning which annotations to make for which units and describe the processing pipeline for adding these. All CoMeRe corpora are verified, thanks to a staged quality control process, designed to allow corpora to move from one project phase to the next. Public release of the CoMeRe corpora is a short-term goal: corpora will be integrated into the forthcoming French National Reference Corpus, and disseminated through the national linguistic infrastructure ORTOLANG. We, therefore, highlight issues and decisions made concerning the OpenData perspective.

CoMeRe项目旨在通过整合来自互联网或电信等网络的交互，以及单模式和多模式、同步和异步通信的交互，建立以法语为主要语言的不同计算机媒介通信(CMC)类型的核心语料库。由于TEI(文本编码倡议)格式，语料库使用标准进行组装。这意味着通过欧洲的努力，扩展TEI文本模型，以包含最丰富和更复杂的CMC体裁。本文提出了交互空间模型。我们解释了这个模型是如何在TEI语料库标题和主体中编码的。然后通过我们处理的前四个语料库实例化该模型:三个语料库中交互发生在单一模态环境中(文本聊天或SMS系统)，第四个语料库中文本聊天、电子邮件和论坛模式同时使用。CoMeRe项目有两个主要的研究视角:语篇分析(仅在本文中提及)和不同CMC体裁中出现的方言的语言学研究。由于NLP算法是此类研究不可或缺的先决条件，我们提出了将自动标注过程应用于CoMeRe语料库的动机。我们希望保证通用注释，这意味着我们不考虑除形态策略标记之外的任何处理，而是优先考虑对语料库中任何自由变体元素的自动注释。然后，我们转向有关为哪些单元做哪些注释的决策，并描述添加这些注释的处理管道。所有的CoMeRe语料库都经过验证，这要归功于阶段性的质量控制过程，旨在允许语料库从一个项目阶段转移到下一个项目阶段。公开发布CoMeRe语料库是一个短期目标:语料库将被纳入即将出版的法国国家参考语料库，并通过国家语言基础设施ORTOLANG进行传播。因此，我们强调了有关开放数据视角的问题和决策。

{"title":"The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres","authors":"T. Chanier, Céline Poudat, Benoît Sagot, G. Antoniadis, Ciara R. Wigham, Linda Hriba, Julien Longhi, Djamé Seddah","doi":"10.21248/jlcl.29.2014.187","DOIUrl":"https://doi.org/10.21248/jlcl.29.2014.187","url":null,"abstract":"The CoMeRe project aims to build a kernel corpus of different Computer-Mediated Com-munication (CMC) genres with interactions in French as the main language, by assembling interactions stemming from networks such as the Internet or telecommunication, as well as mono and multimodal, synchronous and asynchronous communications. Corpora are assem-bled using a standard, thanks to the TEI (Text Encoding Initiative) format. This implies extending, through a European endeavor, the TEI model of text, in order to encompass the richest and the more complex CMC genres. This paper presents the Interaction Space model. We explain how this model has been encoded within the TEI corpus header and body. The model is then instantiated through the first four corpora we have processed: three corpora where interactions occurred in single-modality environments (text chat, or SMS systems) and a fourth corpus where text chat, email and forum modalities were used simultaneously. The CoMeRe project has two main research perspectives: Discourse Analysis, only alluded to in this paper, and the linguistic study of idiolects occurring in different CMC genres. As NLP algorithms are an indispensable prerequisite for such research, we present our motiva-tions for applying an automatic annotation process to the CoMeRe corpora. Our wish to guarantee generic annotations meant we did not consider any processing beyond morphosyn-tactic labelling, but prioritized the automatic annotation of any freely variant elements within the corpora. We then turn to decisions made concerning which annotations to make for which units and describe the processing pipeline for adding these. All CoMeRe corpora are verified, thanks to a staged quality control process, designed to allow corpora to move from one project phase to the next. Public release of the CoMeRe corpora is a short-term goal: corpora will be integrated into the forthcoming French National Reference Corpus, and disseminated through the national linguistic infrastructure ORTOLANG. We, therefore, highlight issues and decisions made concerning the OpenData perspective.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131509701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 46

Optimierung des Stuttgart-Tübingen-Tagset für die linguistische Annotation von Korpora zur internetbasierten Kommunikation: Phänomene, Herausforderungen, Erweiterungsvorschläge 改进公司语言整合到基于网络的通信:现象、挑战，扩展建议

J. Lang. Technol. Comput. Linguistics

Pub Date : 2013-07-01 DOI: 10.21248/jlcl.28.2013.172

Thomas Bartz, Michael Beißwenger, Angelika Storrer

empirisch gezeigt (vgl. z. stellen neben Repräsentationsstandards ein weiteres

实践经验(反对拿来比较)苏只代表标准

引用次数: 36

STTS goes Kiez - Experiments on Annotating and Tagging Urban Youth Language 城市青年语言标注与标注实验

J. Lang. Technol. Comput. Linguistics

Pub Date : 2013-07-01 DOI: 10.21248/jlcl.28.2013.173

Ines Rehbein, Sören Schalowski

The Stuttgart-Tubingen Tag Set (STTS) (Schiller et al., 1995) has long been established as a quasi-standard for part-of-speech (POS) tagging of German. It has been used, with minor modifications, for the annotation of three German newspaper treebanks, the NEGRA treebank (Skut et al., 1997), the TiGer treebank (Brants et al., 2002) and the TuBa-D/Z (Telljohann et al., 2004). One major drawback, however, is the lack of tags for the analysis of language phenomena from domains other than the newspaper domain. A case in point is spoken language, which displays a wide range of phenomena which do not (or only very rarely) occur in newspaper text.

Stuttgart-Tubingen标签集(STTS) (Schiller et al.， 1995)早已被确立为德语词性标注的准标准。经过少量修改，它已被用于三个德国报纸树库的注释，即NEGRA树库(Skut等人，1997)，TiGer树库(Brants等人，2002)和TuBa-D/Z (Telljohann等人，2004)。然而，一个主要的缺点是缺乏用于分析除报纸领域以外的其他领域的语言现象的标签。一个恰当的例子是口语，它展示了大量在报纸文本中不会(或很少)出现的现象。

引用次数: 8

Über den Einfluss von Part-of-Speech-Tags auf Parsing-Ergebnisse 这是独特的莫勾操作

J. Lang. Technol. Comput. Linguistics

Pub Date : 2013-07-01 DOI: 10.21248/jlcl.28.2013.167

Sandra Kübler, Wolfgang Maier

Lange Zeit konzentrierte sich die Forschung im datengetriebenen statistischen Konstituenzparsing auf die Entwicklung von Parsingmodellen für das Englische, genauer gesagt, für die Penn Treebank (Marcus et al., 1993). Einer der Gründe dafür, warum sich solche Modelle nicht ohne Weiteres auf andere Sprachen generalisieren lassen, ist die eher schwach ausgeprägte Morphologie des Englischen: Probleme, die sich bei Parsen einer morphologisch reichen Sprache wie z.B. Arabisch oder Deutsch stellen, stellen sich für das Englische nicht. Vor allem in den letzten Jahren erfuhr die Forschung zu Parsingproblemen, die sich auf komplexe Morphologie beziehen, ein gesteigertes Interesse (Kübler und Penn, 2008; Seddah et al., 2010, 2011; Apidianaki et al., 2012). In einer Baumbank sind Wörter im allgemeinen Information annotiert, die Auskunft über die Wortart (Part-of-Speech, POS) und morphologischen Eigenschaften eines Wortes gibt. Wo, sofern vorhanden, die Trennlinie zwischen Wortart und morphologischer Information gezogen wird und wie detailliert annotiert wird, hängt von der Einzelsprache und dem Annotationsschema ab. In einigen Baumbanken gibt es keine separate morphologische Annotation (wie z.B. in der Penn Treebank), in anderen sind Part-of-Speechund Morphologie-Tagsets getrennt (z.B. in den deutschen Baumbanken TiGer (Brants et al., 2002) und NeGra (Skut et al., 1997)), und in anderen ist wiederum nur ein Tagset vorhanden, das sowohl POSals auch Morphologie-Information enthält (z.B. in der Szeged Treebank (Csendes et al., 2005)). Die Anzahl verschiedener Tags für Sprachen mit einer komplexen Morphologie kann in die Tausende gehen, so z.B. für Tschechisch (Hajič et al., 2000), während für die Modellierung der Wortarten von Sprachen mit wenig bis keiner Morphologie nur wenige Tags ausreichen, z.B. 33 Tags für die Penn Chinese Treebank (Xia, 2000). Wir schließen der Einfachheit halber alle Annotationstypen ein, wenn wir ab hier von Part-of-Speech-Annotation sprechen. Die Part-of-Speech-Tags nehmen eine Schlüsselrolle beim Parsen ein als Schnittstelle zwischen lexikalischer Ebene und dem eigentlichen Syntax-Baum: Während des Parsingvorgangs wird der eigentliche Konstituenzbaum nicht direkt über den Wörtern, sondern über der Part-of-Speech-Annotation erstellt. Ein Part-of-Speech-Tag kann als eine Äquivalenzklasse von Wörtern mit ähnlichen distributionellen Charakteristika angesehen werden, die über die individuellen Wörter abstrahiert und damit die Anzahl der Parameter beschränkt, für die Wahrscheinlichkeiten gelernt werden müssen. Die eigentlichen Wörter finden bei lexikalisierten Parsern Eingang in das Wahrscheinlichkeitsmodell. Es ist offensichtlich, dass die Part-of-Speech-Annotation direkten Einfluss auf die Qualität des Parsebaums hat. Nicht nur die Qualität des Taggers spielt hierbei eine Rolle, sondern auch die Granularität des Tagsets an sich. Es muss ein Kompromiss

在很长一段时间里，在数据驱动的统计符号中，研究集中在开发英国的停车模型上，更确切地说，是彭岸银行的停车模型(Marcus等人，1993年)。然而，这些形态不能轻易地泛泛到其他语言的原因之一，有一个原因是英语形态较差:阿拉伯语或德语等形态丰富的语言遇到的麻烦与英语无关。近年来，研究和复杂形态学有关的麻痹症很有兴趣。(2010年，2011年;2012年)树裙带应画出应该完整的故事，这应该画出应该用的字型和形态特征的故事。加以散发之间的地方做了解释和morphologischer信息以及报告详细annotiert将取决于Einzelsprache和Annotationsschema .某些Baumbanken没有独立的形态Annotation(比如Penn Treebank)在另一些人Part-of-Speechund Morphologie-Tagsets分离(例如德国Baumbanken老虎(Brants等人.,2002)和舞后(Skut等人.,1997)),另一些学生则只有包含posger和形态学信息的标签(比如在csende le le, 2005年)。数量不同语言进行一项复杂的形态学标志的可以去成千上万,比如捷克(Hajič的al ., 2000),而对于模拟Wortarten语言很少有最后一个形态学白天很少有足够,比如33 .宾州中国人Treebank标志(刘霞,2000).应该尽可能的避开所有操作只要画停在这里这张停车停车是过滤器的重要操作。这是字典层和句树间的交互通道:在此期间，出柜真实操作是由舍监莫的推意识画出来的。此摩尔日的操作可以看成是一个具备类似分配特征的单词的等量齐观，区分出个体词汇，从而限制了可能需要学习的参数数量。因此，词汇的后面是从词汇分析得出的概率模型。这公园的画应该不会影响停车公园的质量这里重要的不仅是白昼的质量，也是白昼本身的格兰性。在这里不能妥协

{"title":"Über den Einfluss von Part-of-Speech-Tags auf Parsing-Ergebnisse","authors":"Sandra Kübler, Wolfgang Maier","doi":"10.21248/jlcl.28.2013.167","DOIUrl":"https://doi.org/10.21248/jlcl.28.2013.167","url":null,"abstract":"Lange Zeit konzentrierte sich die Forschung im datengetriebenen statistischen Konstituenzparsing auf die Entwicklung von Parsingmodellen für das Englische, genauer gesagt, für die Penn Treebank (Marcus et al., 1993). Einer der Gründe dafür, warum sich solche Modelle nicht ohne Weiteres auf andere Sprachen generalisieren lassen, ist die eher schwach ausgeprägte Morphologie des Englischen: Probleme, die sich bei Parsen einer morphologisch reichen Sprache wie z.B. Arabisch oder Deutsch stellen, stellen sich für das Englische nicht. Vor allem in den letzten Jahren erfuhr die Forschung zu Parsingproblemen, die sich auf komplexe Morphologie beziehen, ein gesteigertes Interesse (Kübler und Penn, 2008; Seddah et al., 2010, 2011; Apidianaki et al., 2012). In einer Baumbank sind Wörter im allgemeinen Information annotiert, die Auskunft über die Wortart (Part-of-Speech, POS) und morphologischen Eigenschaften eines Wortes gibt. Wo, sofern vorhanden, die Trennlinie zwischen Wortart und morphologischer Information gezogen wird und wie detailliert annotiert wird, hängt von der Einzelsprache und dem Annotationsschema ab. In einigen Baumbanken gibt es keine separate morphologische Annotation (wie z.B. in der Penn Treebank), in anderen sind Part-of-Speechund Morphologie-Tagsets getrennt (z.B. in den deutschen Baumbanken TiGer (Brants et al., 2002) und NeGra (Skut et al., 1997)), und in anderen ist wiederum nur ein Tagset vorhanden, das sowohl POSals auch Morphologie-Information enthält (z.B. in der Szeged Treebank (Csendes et al., 2005)). Die Anzahl verschiedener Tags für Sprachen mit einer komplexen Morphologie kann in die Tausende gehen, so z.B. für Tschechisch (Hajič et al., 2000), während für die Modellierung der Wortarten von Sprachen mit wenig bis keiner Morphologie nur wenige Tags ausreichen, z.B. 33 Tags für die Penn Chinese Treebank (Xia, 2000). Wir schließen der Einfachheit halber alle Annotationstypen ein, wenn wir ab hier von Part-of-Speech-Annotation sprechen. Die Part-of-Speech-Tags nehmen eine Schlüsselrolle beim Parsen ein als Schnittstelle zwischen lexikalischer Ebene und dem eigentlichen Syntax-Baum: Während des Parsingvorgangs wird der eigentliche Konstituenzbaum nicht direkt über den Wörtern, sondern über der Part-of-Speech-Annotation erstellt. Ein Part-of-Speech-Tag kann als eine Äquivalenzklasse von Wörtern mit ähnlichen distributionellen Charakteristika angesehen werden, die über die individuellen Wörter abstrahiert und damit die Anzahl der Parameter beschränkt, für die Wahrscheinlichkeiten gelernt werden müssen. Die eigentlichen Wörter finden bei lexikalisierten Parsern Eingang in das Wahrscheinlichkeitsmodell. Es ist offensichtlich, dass die Part-of-Speech-Annotation direkten Einfluss auf die Qualität des Parsebaums hat. Nicht nur die Qualität des Taggers spielt hierbei eine Rolle, sondern auch die Granularität des Tagsets an sich. Es muss ein Kompromiss","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124224128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Scalable Construction of High-Quality Web Corpora 高质量Web语料库的可扩展构建

J. Lang. Technol. Comput. Linguistics

Pub Date : 2013-07-01 DOI: 10.21248/jlcl.28.2013.175

Chris Biemann, Felix Bildhauer, S. Evert, Dirk Goldhahn, U. Quasthoff, R. Schäfer, Johannes Simon, Leonard Swiezinski, Torsten Zesch

In this article, we give an overview about the necessary steps to construct high-quality corpora from web texts. We first focus on web crawling and the pros and cons of the existing crawling strategies. Then, we describe how the crawled data can be linguistically pre-processed in a parallelized way that allows the processing of web-scale input data. As we are working with web data, controlling the quality of the resulting corpus is an important issue, which we address by showing how corpus statistics and a linguistic evaluation can be used to assess the quality of corpora. Finally, we show how the availability of extremely large, high-quality corpora opens up new directions for research in various fields of linguistics, computational linguistics, and natural language processing.

在本文中，我们概述了从网络文本构建高质量语料库的必要步骤。我们首先关注网络爬行和现有爬行策略的利弊。然后，我们描述了如何以并行的方式对抓取的数据进行语言预处理，从而可以处理网络规模的输入数据。当我们处理网络数据时，控制结果语料库的质量是一个重要的问题，我们通过展示语料库统计和语言评估如何用于评估语料库的质量来解决这个问题。最后，我们展示了超大质量语料库的可用性如何为语言学、计算语言学和自然语言处理等各个领域的研究开辟了新的方向。

引用次数: 48

Wozu Kasusrektion auszeichnen bei Präpositionen?

J. Lang. Technol. Comput. Linguistics

Pub Date : 2013-07-01 DOI: 10.21248/jlcl.28.2013.168

S. Clematide

Die Identifizierung von Kasus bei kasustragenden, deklinierbaren Wörtern (Pronomen, Artikel, Nomen, Adjektive) ist eine entscheidende Anforderung an die Sprachverarbeitung für flektierende Sprachen wie Deutsch. Neben grundlegenden syntaktischen Funktionen (Subjekte im Nominativ, Objekte im Akkusativ, Dativ oder Genitiv), welche vom Verb regiert werden und nominalen Modifikatoren im Genitiv sind Präpositionen mit ihren Rektionseigenschaften kasusbestimmend bzw. kasusregierend. Im folgenden Beispiel sind alle Präpositionen und alle kasustragenden Wörter mit entsprechenden Kasus-Tags markiert1:

在主语、散音、名词、形容词等流质语言中识别Kasus是对德语的语音请求。除了基本的组合功能(第四格、第四格、第四格或第三格中的对象)，基本的动词支配着。在总体上，名义现代性是决定性的。下列例子显示，所有在迦得人出生之前或之后出生的词语都有象征意义。

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

J. Lang. Technol. Comput. Linguistics

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀