首页 > 最新文献

J. Lang. Technol. Comput. Linguistics最新文献

英文 中文
Document-level school lesson quality classification based on German transcripts 基于德语成绩单的文件级学校课程质量分类
Pub Date : 2015-07-01 DOI: 10.21248/jlcl.30.2015.197
Lucie Flekova, Tahir Sousa, Margot Mieskes, Iryna Gurevych
Analyzing large-bodies of audiovisual information with respect to discoursepragmatic categories is a time-consuming, manual activity, yet of growing importance in a wide variety of domains. Given the transcription of the audiovisual recordings, we propose to model the task of assigning discoursepragmatic categories as supervised machine learning task. By analyzing the effects of a wide variety of feature classes, we can trace back the discoursepragmatic ratings to low-level language phenomena and better understand their dependency. The major contribution of this article is thus a rich feature set to analyze the relationship between the language and the discoursepragmatic categories assigned to an analyzed audiovisual unit. As one particular application of our methodology, we focus on modelling the quality of lessons according to a set of discourse-pragmatic dimensions. We examine multiple lesson quality dimensions relevant for educational researchers, e.g. to which extent teachers provide objective feedback, encourage cooperation and pursue thinking pathways of students. Using the transcripts of real classroom interactions recorded in Germany and Switzerland, we identify a wide range of lexical, stylistic and discourse-pragmatic phenomena, which affect the perception of lesson quality, and we interpret our findings together with the educational experts. Our results show that especially features focusing on discourse and cognitive processes are beneficial for this novel classification task, and that this task has a high potential for automated assistance.
根据语篇语用范畴分析大量视听信息是一项耗时的手工活动,但在许多领域中却越来越重要。考虑到视听记录的转录,我们建议将分配话语语用类别的任务建模为监督机器学习任务。通过分析各种特征类的影响,我们可以将语篇语用等级追溯到低级语言现象,并更好地理解它们的依赖性。因此,本文的主要贡献在于提供了丰富的特征集来分析语言与分配给被分析的视听单元的话语语用范畴之间的关系。作为我们方法论的一个特殊应用,我们专注于根据一组话语语用维度对课程质量进行建模。我们考察了与教育研究者相关的多个课程质量维度,如教师在多大程度上提供客观反馈、鼓励合作和追求学生的思维路径。利用在德国和瑞士记录的真实课堂互动的文本,我们发现了影响课堂质量感知的广泛的词汇、风格和话语语用现象,我们与教育专家一起解释了我们的发现。我们的研究结果表明,特别关注话语和认知过程的特征对这种新的分类任务是有益的,并且该任务具有很高的自动化辅助潜力。
{"title":"Document-level school lesson quality classification based on German transcripts","authors":"Lucie Flekova, Tahir Sousa, Margot Mieskes, Iryna Gurevych","doi":"10.21248/jlcl.30.2015.197","DOIUrl":"https://doi.org/10.21248/jlcl.30.2015.197","url":null,"abstract":"Analyzing large-bodies of audiovisual information with respect to discoursepragmatic categories is a time-consuming, manual activity, yet of growing importance in a wide variety of domains. Given the transcription of the audiovisual recordings, we propose to model the task of assigning discoursepragmatic categories as supervised machine learning task. By analyzing the effects of a wide variety of feature classes, we can trace back the discoursepragmatic ratings to low-level language phenomena and better understand their dependency. The major contribution of this article is thus a rich feature set to analyze the relationship between the language and the discoursepragmatic categories assigned to an analyzed audiovisual unit. As one particular application of our methodology, we focus on modelling the quality of lessons according to a set of discourse-pragmatic dimensions. We examine multiple lesson quality dimensions relevant for educational researchers, e.g. to which extent teachers provide objective feedback, encourage cooperation and pursue thinking pathways of students. Using the transcripts of real classroom interactions recorded in Germany and Switzerland, we identify a wide range of lexical, stylistic and discourse-pragmatic phenomena, which affect the perception of lesson quality, and we interpret our findings together with the educational experts. Our results show that especially features focusing on discourse and cognitive processes are beneficial for this novel classification task, and that this task has a high potential for automated assistance.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125566781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A relational database model and prototype for storing diverse discrete linguistic data 一种用于存储各种离散语言数据的关系数据库模型和原型
Pub Date : 2015-07-01 DOI: 10.21248/jlcl.30.2015.194
Alexander Magidow
This article describes a model for storing multiple forms of linguistic data within a relational database as developed and tested through a prototype database for storing data from Arabic dialects. A challenge that typically confronts linguistic documentation projects is the need for a flexible data model that can be adapted to the growing needs of a project (Dimitriadis, 2006). Contributors to linguistic databases typically cannot predict exactly which attributes of their data they will need to store, and therefore the initial design of the database may need to change over time. Many projects take advantage of the flexibility of XML and RDF to allow for continuing revisions to the data model. For some projects, there may be a compelling need to use a relational database system, though some approaches to relational database design may not flexible enough to allow for adaptation over time (Dimitriadis, 2006). The goal of this article is to describe a relational database model which can adapt easily to storing new data types as a project evolves. It both describes a general data model and shows its implementation within a working project. The model is primarily intended for storing discrete linguistic elements (phonemes, morphemes including general lexical data, sentences) as opposed to text corpora, and would be expected to store data on the order of thousands to hundreds of thousands of rows.1 The relational model described in this paper is centered around the linguistic datum, encoded as a string of characters, associated in a many-to-many relationship with ‘tags,’ and in many-to-many named relationships with other datums.2 For this reason, the model will be referred to as the ‘tag-and-relationship’ model. The combination of tags and relationships allows the database to store a wide variety of linguistic data. This data model was developed in tandem with a project to encode linguistic data from Arabic dialects (the “Database of Arabic Dialects”, DAD).3 Arabic is an extremely diverse language group, with a dialects stretching from Mauritania to Afghanistan,
本文描述了一个用于在关系数据库中存储多种形式的语言数据的模型,并通过一个用于存储来自阿拉伯方言的数据的原型数据库进行了开发和测试。语言文档项目通常面临的一个挑战是需要一个灵活的数据模型,以适应项目不断增长的需求(Dimitriadis, 2006)。语言数据库的贡献者通常不能准确地预测他们需要存储的数据的哪些属性,因此数据库的初始设计可能需要随着时间的推移而改变。许多项目利用XML和RDF的灵活性来允许对数据模型进行持续修订。对于某些项目,可能迫切需要使用关系数据库系统,尽管关系数据库设计的一些方法可能不够灵活,无法随着时间的推移进行调整(Dimitriadis, 2006)。本文的目标是描述一种关系数据库模型,该模型可以随着项目的发展轻松适应存储新数据类型。它既描述了一个通用的数据模型,又展示了它在一个工作项目中的实现。该模型主要用于存储离散的语言元素(音素、包括一般词汇数据的语素、句子),而不是文本语料库,并且预计将存储数千到数十万行的数据本文中描述的关系模型以语言数据为中心,编码为一串字符,与“标签”以多对多关系关联,并与其他数据以多对多命名关系关联由于这个原因,该模型将被称为“标记-关系”模型。标记和关系的组合允许数据库存储各种各样的语言数据。这个数据模型是与一个对来自阿拉伯方言的语言数据进行编码的项目(“阿拉伯方言数据库”,DAD)一起开发的阿拉伯语是一个极其多样化的语言群体,其方言从毛里塔尼亚延伸到阿富汗,
{"title":"A relational database model and prototype for storing diverse discrete linguistic data","authors":"Alexander Magidow","doi":"10.21248/jlcl.30.2015.194","DOIUrl":"https://doi.org/10.21248/jlcl.30.2015.194","url":null,"abstract":"This article describes a model for storing multiple forms of linguistic data within a relational database as developed and tested through a prototype database for storing data from Arabic dialects. A challenge that typically confronts linguistic documentation projects is the need for a flexible data model that can be adapted to the growing needs of a project (Dimitriadis, 2006). Contributors to linguistic databases typically cannot predict exactly which attributes of their data they will need to store, and therefore the initial design of the database may need to change over time. Many projects take advantage of the flexibility of XML and RDF to allow for continuing revisions to the data model. For some projects, there may be a compelling need to use a relational database system, though some approaches to relational database design may not flexible enough to allow for adaptation over time (Dimitriadis, 2006). The goal of this article is to describe a relational database model which can adapt easily to storing new data types as a project evolves. It both describes a general data model and shows its implementation within a working project. The model is primarily intended for storing discrete linguistic elements (phonemes, morphemes including general lexical data, sentences) as opposed to text corpora, and would be expected to store data on the order of thousands to hundreds of thousands of rows.1 The relational model described in this paper is centered around the linguistic datum, encoded as a string of characters, associated in a many-to-many relationship with ‘tags,’ and in many-to-many named relationships with other datums.2 For this reason, the model will be referred to as the ‘tag-and-relationship’ model. The combination of tags and relationships allows the database to store a wide variety of linguistic data. This data model was developed in tandem with a project to encode linguistic data from Arabic dialects (the “Database of Arabic Dialects”, DAD).3 Arabic is an extremely diverse language group, with a dialects stretching from Mauritania to Afghanistan,","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"27 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113974572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Discourse Segmentation of German Texts 德语语篇的语篇分割
Pub Date : 2015-07-01 DOI: 10.21248/jlcl.30.2015.196
Wladimir Sidorenko, A. Peldszus, Manfred Stede
This paper addresses the problem of segmenting German texts into minimal discourse units, as they are needed, for example, in RST-based discourse parsing. We discuss relevant variants of the problem, introduce the design of our annotation guidelines, and provide the results of an extensive interannotator agreement study of the corpus. Afterwards, we report on our experiments with three automatic classifiers that rely on the output of state-of-the-art parsers and use different amounts and kinds of syntactic knowledge: constituent parsing versus dependency parsing; tree-structure classification versus sequence labeling. Finally, we compare our approaches with the recent discourse segmentation methods proposed for English.
本文解决了将德语文本分割成最小语篇单元的问题,因为它们在基于rst的语篇解析中是必要的。我们讨论了该问题的相关变体,介绍了注释指南的设计,并提供了对语料库进行广泛的注释器间协议研究的结果。之后,我们报告了我们使用三种自动分类器的实验,这些自动分类器依赖于最先进的解析器的输出,并使用不同数量和种类的语法知识:成分解析与依赖解析;树结构分类与序列标记。最后,我们将我们的方法与最近提出的英语语篇分割方法进行了比较。
{"title":"Discourse Segmentation of German Texts","authors":"Wladimir Sidorenko, A. Peldszus, Manfred Stede","doi":"10.21248/jlcl.30.2015.196","DOIUrl":"https://doi.org/10.21248/jlcl.30.2015.196","url":null,"abstract":"This paper addresses the problem of segmenting German texts into minimal discourse units, as they are needed, for example, in RST-based discourse parsing. We discuss relevant variants of the problem, introduce the design of our annotation guidelines, and provide the results of an extensive interannotator agreement study of the corpus. Afterwards, we report on our experiments with three automatic classifiers that rely on the output of state-of-the-art parsers and use different amounts and kinds of syntactic knowledge: constituent parsing versus dependency parsing; tree-structure classification versus sequence labeling. Finally, we compare our approaches with the recent discourse segmentation methods proposed for English.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128065606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Sentiment Classification at Discourse Segment Level: Experiments on multi-domain Arabic corpus 语段层面的情感分类:多领域阿拉伯语语料库实验
Pub Date : 2015-07-01 DOI: 10.21248/jlcl.30.2015.193
Amine Bayoudhi, Hatem Ghorbel, Houssem Koubaa, Lamia Hadrich Belguith
Sentiment classification aims to determine whether the semantic orientation of a text is positive, negative or neutral. It can be tackled at several levels of granularity: expression or phrase level, sentence level, and document level. In the scope of this research, we are interested in the sentence and sub-sentential level classification which can provide very useful trends for information retrieval and extraction applications, Question Answering systems and summarization tasks. In the context of our work, we address the problem of Arabic sentiment classification at sub-sentential level by (i) building a high coverage sentiment lexicon with semi-automatic approach; (ii) creating a large multi-domain annotated sentiment corpus segmented into discourse segments in order to evaluate our sentiment approach; and (iii) applying a lexicon-based approach with an aggregation model taking into account advanced linguistic phenomena such as negation and intensification. The results that we obtained are considered good and close to state of the art results in English language.
情感分类的目的是确定文本的语义取向是积极的、消极的还是中性的。它可以在几个粒度级别上进行处理:表达式或短语级别、句子级别和文档级别。在本研究的范围内,我们对句子和子句子级别的分类感兴趣,这可以为信息检索和提取应用、问答系统和摘要任务提供非常有用的趋势。在我们的工作背景下,我们通过以下方法解决了亚句子级别的阿拉伯语情感分类问题:(i)使用半自动方法构建高覆盖率的情感词典;(ii)创建一个大型的多领域标注情感语料库,将其分割成话语段,以评估我们的情感方法;(三)运用基于词典的方法和考虑到高级语言现象(如否定和强化)的聚合模型。我们获得的结果被认为是良好的,接近最先进的英语语言结果。
{"title":"Sentiment Classification at Discourse Segment Level: Experiments on multi-domain Arabic corpus","authors":"Amine Bayoudhi, Hatem Ghorbel, Houssem Koubaa, Lamia Hadrich Belguith","doi":"10.21248/jlcl.30.2015.193","DOIUrl":"https://doi.org/10.21248/jlcl.30.2015.193","url":null,"abstract":"Sentiment classification aims to determine whether the semantic orientation of a text is positive, negative or neutral. It can be tackled at several levels of granularity: expression or phrase level, sentence level, and document level. In the scope of this research, we are interested in the sentence and sub-sentential level classification which can provide very useful trends for information retrieval and extraction applications, Question Answering systems and summarization tasks. In the context of our work, we address the problem of Arabic sentiment classification at sub-sentential level by (i) building a high coverage sentiment lexicon with semi-automatic approach; (ii) creating a large multi-domain annotated sentiment corpus segmented into discourse segments in order to evaluate our sentiment approach; and (iii) applying a lexicon-based approach with an aggregation model taking into account advanced linguistic phenomena such as negation and intensification. The results that we obtained are considered good and close to state of the art results in English language.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131315218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Building Linguistic Corpora from Wikipedia Articles and Discussions 从维基百科文章和讨论建立语言语料库
Pub Date : 2014-07-01 DOI: 10.21248/jlcl.29.2014.189
Eliza Margaretha, H. Lüngen
Wikipedia is a valuable resource, useful as a lingustic corpus or a dataset for many kinds of research. We built corpora from Wikipedia articles and talk pages in the I5 format, a TEI customisation used in the German Reference Corpus (Deutsches Referenzkorpus DeReKo). Our approach is a two-stage conversion combining parsing using the Sweble parser, and transformation using XSLT stylesheets. The conversion approach is able to successfully generate rich and valid corpora regardless of languages. We also introduce a method to segment user contributions in talk pages into postings.
维基百科是一个有价值的资源,作为语言语料库或许多研究的数据集都很有用。我们以I5格式从维基百科文章和讨论页构建语料库,这是德语参考语料库(Deutsches Referenzkorpus DeReKo)中使用的TEI定制。我们的方法是两阶段转换,结合使用Sweble解析器进行解析和使用XSLT样式表进行转换。这种转换方法能够成功地生成丰富有效的语料库。我们还介绍了一种将讨论页中的用户贡献分割为帖子的方法。
{"title":"Building Linguistic Corpora from Wikipedia Articles and Discussions","authors":"Eliza Margaretha, H. Lüngen","doi":"10.21248/jlcl.29.2014.189","DOIUrl":"https://doi.org/10.21248/jlcl.29.2014.189","url":null,"abstract":"Wikipedia is a valuable resource, useful as a lingustic corpus or a dataset for many kinds of research. We built corpora from Wikipedia articles and talk pages in the I5 format, a TEI customisation used in the German Reference Corpus (Deutsches Referenzkorpus DeReKo). Our approach is a two-stage conversion combining parsing using the Sweble parser, and transformation using XSLT stylesheets. The conversion approach is able to successfully generate rich and valid corpora regardless of languages. We also introduce a method to segment user contributions in talk pages into postings.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"230 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124540041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
IGGSA-STEPS: Shared Task on Source and Target Extraction from Political Speeches IGGSA-STEPS:从政治演讲中提取源和目标的共享任务
Pub Date : 2014-07-01 DOI: 10.21248/jlcl.29.2014.182
Josef Ruppenhofer, Julia Maria Struß, J. Sonntag, Stefan Gindl
Accurate opinion mining requires the exact identification of the source and target of an opinion. To evaluate diverse tools, the research community relies on the existence of a gold standard corpus covering this need. Since such a corpus is currently not available for German, the Interest Group on German Sentiment Analysis decided to create such a resource and make it available to the research community in the context of a shared task. In this paper, we describe the selection of textual sources, development of annotation guidelines, and first evaluation results in the creation of a gold standard corpus for the German language.
准确的意见挖掘需要准确识别意见的来源和目标。为了评估不同的工具,研究界依赖于覆盖这一需求的金标准语料库的存在。由于目前还没有德语语料库,德语情感分析兴趣小组决定创建这样一个资源,并在共享任务的背景下将其提供给研究界。在本文中,我们描述了文本来源的选择,注释指南的开发,以及创建德语黄金标准语料库的第一次评估结果。
{"title":"IGGSA-STEPS: Shared Task on Source and Target Extraction from Political Speeches","authors":"Josef Ruppenhofer, Julia Maria Struß, J. Sonntag, Stefan Gindl","doi":"10.21248/jlcl.29.2014.182","DOIUrl":"https://doi.org/10.21248/jlcl.29.2014.182","url":null,"abstract":"Accurate opinion mining requires the exact identification of the source and target of an opinion. To evaluate diverse tools, the research community relies on the existence of a gold standard corpus covering this need. Since such a corpus is currently not available for German, the Interest Group on German Sentiment Analysis decided to create such a resource and make it available to the research community in the context of a shared task. In this paper, we describe the selection of textual sources, development of annotation guidelines, and first evaluation results in the creation of a gold standard corpus for the German language.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"162 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122297917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Challenges and experiences in collecting a chat corpus 收集聊天语料库的挑战与经验
Pub Date : 2014-07-01 DOI: 10.21248/jlcl.29.2014.190
W. Spooren, T. V. Charldorp
Present day access to a wealth of electronically available linguistic data creates enormous opportunities for cutting edge research questions and analyses. Computer-mediated communication (CMC) data are specifically interesting, for example because the multimodal character of new media puts our ideas about discourse issues like coherence to the test. At the same time CMC data are ephemeral, because of rapid changing technology. That is why weurgently need to collect CMC discourse data before the technology becomes obsolete. This paper describes a number of challenges we encountered when collecting a chat corpus with data from secondary school children in Amsterdam. These challenges are various in nature: logistic, ethical and technological.
如今,大量的电子语言数据为前沿研究问题和分析创造了巨大的机会。例如,计算机媒介传播(CMC)的数据特别有趣,因为新媒体的多模态特征使我们对话语问题(如连贯性)的看法得到了检验。同时,由于技术的快速变化,CMC数据是短暂的。这就是为什么我们迫切需要在技术过时之前收集CMC话语数据。本文描述了我们在收集阿姆斯特丹中学生聊天语料库数据时遇到的一些挑战。这些挑战在本质上是各种各样的:后勤、道德和技术。
{"title":"Challenges and experiences in collecting a chat corpus","authors":"W. Spooren, T. V. Charldorp","doi":"10.21248/jlcl.29.2014.190","DOIUrl":"https://doi.org/10.21248/jlcl.29.2014.190","url":null,"abstract":"Present day access to a wealth of electronically available linguistic data creates enormous opportunities for cutting edge research questions and analyses. Computer-mediated communication (CMC) data are specifically interesting, for example because the multimodal character of new media puts our ideas about discourse issues like coherence to the test. At the same time CMC data are ephemeral, because of rapid changing technology. That is why we\u0000urgently need to collect CMC discourse data before the technology becomes obsolete. This paper describes a number of challenges we encountered when collecting a chat corpus with data from secondary school children in Amsterdam. These challenges are various in nature: logistic, ethical and technological.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124118766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Domain Adaptation for Opinion Mining: A Study of Multipolarity Words 面向意见挖掘的领域自适应:多极词研究
Pub Date : 2014-07-01 DOI: 10.21248/jlcl.29.2014.181
M. Marchand, Romaric Besançon, O. Mesnard, Anne Vilnat
Expression of opinion depends on the domain. For instance, some words, called here multi-polarity words, have dierent polarities across domain. Therefore, a classifier trained on one domain and tested on another one will not perform well without adaptation. This article presents a study of the influence of these multi-polarity words on domain adaptation for automatic opinion classification. We also suggest an exploratory method for detecting them without using any label in the target domain. We show as well how these multi-polarity words can improve opinion classification in an open-domain corpus.
意见的表达取决于领域。例如,有些词,在这里被称为多极词,在不同的域有不同的极性。因此,如果没有自适应,在一个领域上训练并在另一个领域上测试的分类器将不会表现良好。本文研究了多极词对自动意见分类领域自适应的影响。我们还提出了一种探索性的方法来检测它们,而不需要在目标域中使用任何标签。我们还展示了这些多极词如何在开放域语料库中改进意见分类。
{"title":"Domain Adaptation for Opinion Mining: A Study of Multipolarity Words","authors":"M. Marchand, Romaric Besançon, O. Mesnard, Anne Vilnat","doi":"10.21248/jlcl.29.2014.181","DOIUrl":"https://doi.org/10.21248/jlcl.29.2014.181","url":null,"abstract":"Expression of opinion depends on the domain. For instance, some words, called here multi-polarity words, have dierent polarities across domain. Therefore, a classifier trained on one domain and tested on another one will not perform well without adaptation. This article presents a study of the influence of these multi-polarity words on domain adaptation for automatic opinion classification. We also suggest an exploratory method for detecting them without using any label in the target domain. We show as well how these multi-polarity words can improve opinion classification in an open-domain corpus.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"225 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127615119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Unsupervised feature learning for sentiment classification of short documents 短文档情感分类的无监督特征学习
Pub Date : 2014-07-01 DOI: 10.21248/jlcl.29.2014.180
S. Albertini, Alessandro Zamberletti, I. Gallo
The rapid growth of Web information led to an increasing amount of user-generated content, such as customer reviews of products, forum posts and blogs. In this paper we face the task of assigning a sentiment polarity to user-generated short documents to determine whether each of them communicates a positive or negative judgment about a subject. The method we propose exploits a Growing Hierarchical SelfOrganizing Map to obtain a sparse encoding of user-generated content. The encoded documents are subsequently given as input to a Support Vector Machine classifier that assigns them a polarity label. Unlike other works on opinion mining, our model does not use a priori hypotheses involving special words, phrases or language constructs typical of certain domains. Using a dataset composed by customer reviews of products, the experimental results we obtain are close to those achieved by other recent works.
Web信息的快速增长导致用户生成内容的数量不断增加,例如客户对产品的评论、论坛帖子和博客。在本文中,我们面临的任务是为用户生成的短文档分配情感极性,以确定每个文档是否传达了对主题的积极或消极判断。我们提出的方法利用增长层次自组织映射来获得用户生成内容的稀疏编码。编码后的文档随后作为输入输入给支持向量机分类器,该分类器为其分配极性标签。与其他意见挖掘工作不同,我们的模型不使用涉及特定领域典型的特殊单词、短语或语言结构的先验假设。使用由客户评论组成的数据集,我们获得的实验结果与其他近期工作的结果接近。
{"title":"Unsupervised feature learning for sentiment classification of short documents","authors":"S. Albertini, Alessandro Zamberletti, I. Gallo","doi":"10.21248/jlcl.29.2014.180","DOIUrl":"https://doi.org/10.21248/jlcl.29.2014.180","url":null,"abstract":"The rapid growth of Web information led to an increasing amount of user-generated content, such as customer reviews of products, forum posts and blogs. In this paper we face the task of assigning a sentiment polarity to user-generated short documents to determine whether each of them communicates a positive or negative judgment about a subject. The method we propose exploits a Growing Hierarchical SelfOrganizing Map to obtain a sparse encoding of user-generated content. The encoded documents are subsequently given as input to a Support Vector Machine classifier that assigns them a polarity label. Unlike other works on opinion mining, our model does not use a priori hypotheses involving special words, phrases or language constructs typical of certain domains. Using a dataset composed by customer reviews of products, the experimental results we obtain are close to those achieved by other recent works.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"38 11","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120997579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Using Brain Data for Sentiment Analysis 使用大脑数据进行情感分析
Pub Date : 2014-07-01 DOI: 10.21248/jlcl.29.2014.185
Yuqiao Gu, Fabio Celli, J. Steinberger, A. Anderson, Massimo Poesio, C. Strapparava, B. Murphy
We present the results of exploratory experiments using lexical valence extracted from brain using electroencephalography (EEG) for sentiment analysis. We selected 78 English words (36 for training and 42 for testing), presented as stimuli to 3 English native speakers. EEG signals were recorded from the subjects while they performed a mental imaging task for each word stimulus. Wavelet decomposition was employed to extract EEG features from the time-frequency domain. The extracted features were used as inputs to a sparse multinomial logistic regression (SMLR) classifier for valence classification, after univariate ANOVA feature selection. After mapping EEG signals to sentiment valences, we exploited the lexical polarity extracted from brain data for the prediction of the valence of 12 sentences taken from the SemEval-2007 shared task, and compared it against existing lexical resources.
我们提出了探索性实验的结果,使用脑电图(EEG)从大脑中提取词法效价进行情感分析。我们选择了78个英语单词(36个用于训练,42个用于测试),作为刺激呈现给3个英语母语者。当受试者对每个单词刺激进行心理成像任务时,脑电图信号被记录下来。采用小波分解从时频域提取脑电信号特征。在单变量方差分析特征选择后,将提取的特征作为稀疏多项式逻辑回归(SMLR)分类器的输入进行价分类。在将脑电信号映射到情绪效价后,我们利用从大脑数据中提取的词汇极性来预测SemEval-2007共享任务中12个句子的效价,并将其与现有的词汇资源进行比较。
{"title":"Using Brain Data for Sentiment Analysis","authors":"Yuqiao Gu, Fabio Celli, J. Steinberger, A. Anderson, Massimo Poesio, C. Strapparava, B. Murphy","doi":"10.21248/jlcl.29.2014.185","DOIUrl":"https://doi.org/10.21248/jlcl.29.2014.185","url":null,"abstract":"We present the results of exploratory experiments using lexical valence extracted from brain using electroencephalography (EEG) for sentiment analysis. We selected 78 English words (36 for training and 42 for testing), presented as stimuli to 3 English native speakers. EEG signals were recorded from the subjects while they performed a mental imaging task for each word stimulus. Wavelet decomposition was employed to extract EEG features from the time-frequency domain. The extracted features were used as inputs to a sparse multinomial logistic regression (SMLR) classifier for valence classification, after univariate ANOVA feature selection. After mapping EEG signals to sentiment valences, we exploited the lexical polarity extracted from brain data for the prediction of the valence of 12 sentences taken from the SemEval-2007 shared task, and compared it against existing lexical resources.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128562641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
期刊
J. Lang. Technol. Comput. Linguistics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1