首页 > 最新文献

Linguamatica最新文献

英文 中文
Solo Queue at ASSIN: Combinando Abordagens Tradicionais e Emergentes ASSIN的Solo Queue:结合传统和新兴方法
IF 0.6 Q4 LINGUISTICS Pub Date : 2016-12-31 DOI: 10.21814/LM.8.2.230
N. Hartmann
In this paper we present a proposal to automatically label the similarity between a pair of sentences and the results obtained on ASSIN 2016 sentence similarity shared-task. Our proposal consists of using a classical feature of bag-of-words, the TF-IDF model; and an emergent feature, obtained from processing word embeddings. The TF-IDF is used to relate texts which share words. Word embeddings are known by capture the syntax and semantics of a word. Following Mikolov et al. (2013), the sum of embedding vectors can model the meaning of a sentence. Using both features, we are able to capture the words shared between sentences and their semantics. We use linear regression to solve this problem, once the dataset is labeled as real numbers between 1 and 5. Our results are promising. Although the usage of embeddings has not overcome our baseline system, when we combined it with TF-IDF, our system achieved better results than only using TF-IDF. Our results achieved the first collocation of ASSIN 2016 for sentence similarity shared-task applied on brazilian portuguese sentences and second collocation when applying to Portugal portuguese sentences.
本文提出了一种基于ASSIN 2016句子相似度共享任务的句子相似度自动标注方法。我们的建议包括使用经典的词袋特征TF-IDF模型;和一个突现特征,从处理词嵌入得到。TF-IDF用于关联共享单词的文本。单词嵌入是通过捕获单词的语法和语义来实现的。继Mikolov et al.(2013)之后,嵌入向量的总和可以对句子的含义进行建模。使用这两个特征,我们能够捕获句子之间共享的单词及其语义。我们使用线性回归来解决这个问题,一旦数据集被标记为1到5之间的实数。我们的结果很有希望。虽然嵌入的使用并没有克服我们的基线系统,但当我们将其与TF-IDF结合使用时,我们的系统取得了比仅使用TF-IDF更好的结果。我们的结果实现了ASSIN 2016对巴西葡萄牙语句子相似度共享任务的第一次搭配和对葡萄牙葡萄牙语句子的第二次搭配。
{"title":"Solo Queue at ASSIN: Combinando Abordagens Tradicionais e Emergentes","authors":"N. Hartmann","doi":"10.21814/LM.8.2.230","DOIUrl":"https://doi.org/10.21814/LM.8.2.230","url":null,"abstract":"In this paper we present a proposal to automatically label the similarity between a pair of sentences and the results obtained on ASSIN 2016 sentence similarity shared-task. Our proposal consists of using a classical feature of bag-of-words, the TF-IDF model; and an emergent feature, obtained from processing word embeddings. The TF-IDF is used to relate texts which share words. Word embeddings are known by capture the syntax and semantics of a word. Following Mikolov et al. (2013), the sum of embedding vectors can model the meaning of a sentence. Using both features, we are able to capture the words shared between sentences and their semantics. We use linear regression to solve this problem, once the dataset is labeled as real numbers between 1 and 5. Our results are promising. Although the usage of embeddings has not overcome our baseline system, when we combined it with TF-IDF, our system achieved better results than only using TF-IDF. Our results achieved the first collocation of ASSIN 2016 for sentence similarity shared-task applied on brazilian portuguese sentences and second collocation when applying to Portugal portuguese sentences.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"29 1","pages":"59-64"},"PeriodicalIF":0.6,"publicationDate":"2016-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68372457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Perfilado de autor multilingüe en redes sociales a partir de n-gramas de caracteres y de etiquetas gramaticales 基于n-gram字符和语法标签的社交网络多语言作者简介
IF 0.6 Q4 LINGUISTICS Pub Date : 2016-07-22 DOI: 10.21814/LM.8.1.227
Carlos-Emiliano González-Gallardo, Juan-Manuel Torres-Moreno, Azucena Montes Rendón, Gerardo E Sierra
En este articulo presentamos un algoritmo que combina las caracteristicas estilisticas representadas por los n-gramas de caracteres y los n-gramas de etiquetas gramaticales (POS) para clasificar documentos multilengua de redes sociales. En ambos grupos de n-gramas se aplico una normalizacion dinamica dependiente del contexto para extraer la mayor cantidad de informacion estilistica posible codificada en los documentos (emoticonos, inundamiento de caracteres, uso de letras mayusculas, referencias a usuarios, ligas a sitios externos, hashtags, etc.). El algoritmo fue aplicado sobre dos corpus diferentes: los tweets del corpus de entrenamiento de la tarea Author Profiling de PAN-CLEF 2015 y el corpus de "Comentarios de la Ciudad de Mexico en el tiempo" (CCDMX). Los resultados presentan una exactitud muy alta, cercana al 90%.
本文提出了一种结合字符n-gram和语法标签n-gram所代表的文体特征对多语言社交网络文档进行分类的算法。在这两组n-gram中,都应用了上下文相关的动态归一化,以提取尽可能多的编码到文档中的文体信息(表情符号、字符浸水、大写字母的使用、用户引用、外部网站链接、标签等)。该算法被应用于两个不同的语料库:PAN-CLEF 2015作者分析任务训练语料库和“墨西哥城时间评论”(CCDMX)语料库。结果显示准确率非常高,接近90%。
{"title":"Perfilado de autor multilingüe en redes sociales a partir de n-gramas de caracteres y de etiquetas gramaticales","authors":"Carlos-Emiliano González-Gallardo, Juan-Manuel Torres-Moreno, Azucena Montes Rendón, Gerardo E Sierra","doi":"10.21814/LM.8.1.227","DOIUrl":"https://doi.org/10.21814/LM.8.1.227","url":null,"abstract":"En este articulo presentamos un algoritmo que combina las caracteristicas estilisticas representadas por los n-gramas de caracteres y los n-gramas de etiquetas gramaticales (POS) para clasificar documentos multilengua de redes sociales. En ambos grupos de n-gramas se aplico una normalizacion dinamica dependiente del contexto para extraer la mayor cantidad de informacion estilistica posible codificada en los documentos (emoticonos, inundamiento de caracteres, uso de letras mayusculas, referencias a usuarios, ligas a sitios externos, hashtags, etc.). El algoritmo fue aplicado sobre dos corpus diferentes: los tweets del corpus de entrenamiento de la tarea Author Profiling de PAN-CLEF 2015 y el corpus de \"Comentarios de la Ciudad de Mexico en el tiempo\" (CCDMX). Los resultados presentan una exactitud muy alta, cercana al 90%.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"8 1","pages":"21-29"},"PeriodicalIF":0.6,"publicationDate":"2016-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68372105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Compilação de Corpos Comparáveis Especializados: Devemos sempre confiar nas Ferramentas de Compilação Semi-automáticas? 专门的可比机构的编译:我们应该总是依赖半自动编译工具吗?
IF 0.6 Q4 LINGUISTICS Pub Date : 2016-07-22 DOI: 10.21814/LM.8.1.221
Hernani Costa, Isabel Dúran Muñoz, Gloria Corpas Pastor, Ruslan Mitkov
Decisions at the outset of compiling a comparable corpus are of crucial importance for how the corpus is to be built and analysed later on. Several variables and external criteria are usually followed when building a corpus but little is been said about textual distributional similarity in this context and the quality that it brings to research. In an attempt to fulfil this gap, this paper aims at presenting a simple but efficient methodology capable of measuring a corpus internal degree of relatedness. To do so, this methodology takes advantage of both available natural language processing technology and statistical methods in a successful attempt to access the relatedness degree between documents. Our findings prove that using a list of common entities and a set of distributional similarity measures is enough not only to describe and assess the degree of relatedness between the documents in a comparable corpus, but also to rank them according to their degree of relatedness within the corpus.
在编制一个可比语料库的开始决定是至关重要的语料库是如何建立和分析以后。在构建语料库时,通常会遵循几个变量和外部标准,但在这种情况下,很少有人谈到文本分布相似性及其为研究带来的质量。为了填补这一空白,本文旨在提出一种能够测量语料库内部关联度的简单而有效的方法。为此,该方法利用了可用的自然语言处理技术和统计方法,成功地尝试访问文档之间的关联度。我们的研究结果证明,使用共同实体列表和一组分布相似度量不仅足以描述和评估可比语料库中文档之间的关联度,而且可以根据语料库中的关联度对它们进行排序。
{"title":"Compilação de Corpos Comparáveis Especializados: Devemos sempre confiar nas Ferramentas de Compilação Semi-automáticas?","authors":"Hernani Costa, Isabel Dúran Muñoz, Gloria Corpas Pastor, Ruslan Mitkov","doi":"10.21814/LM.8.1.221","DOIUrl":"https://doi.org/10.21814/LM.8.1.221","url":null,"abstract":"Decisions at the outset of compiling a comparable corpus are of crucial importance for how the corpus is to be built and analysed later on. Several variables and external criteria are usually followed when building a corpus but little is been said about textual distributional similarity in this context and the quality that it brings to research. In an attempt to fulfil this gap, this paper aims at presenting a simple but efficient methodology capable of measuring a corpus internal degree of relatedness. To do so, this methodology takes advantage of both available natural language processing technology and statistical methods in a successful attempt to access the relatedness degree between documents. Our findings prove that using a list of common entities and a set of distributional similarity measures is enough not only to describe and assess the degree of relatedness between the documents in a comparable corpus, but also to rank them according to their degree of relatedness within the corpus.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"8 1","pages":"3-19"},"PeriodicalIF":0.6,"publicationDate":"2016-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68371888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Propuesta de clasificación de un banco de voces con fines de identificación forense 为法医鉴定目的对声音库进行分类的建议
IF 0.6 Q4 LINGUISTICS Pub Date : 2016-07-22 DOI: 10.21814/LM.8.1.226
Fernanda López-Escobedo, Julián Solórzano-Soto
En este articulo se presenta el proyecto que se desarrolla para proponer una clasificacion de un banco de voces con fines de identificacion forense. Se expone la manera en que la informacion linguistica puede ser utilizada en una base de datos para reducir el numero de falsos positivos y falsos negativos que resultan cuando se llevan a cabo comparaciones automatizadas para la identificacion forense de voz. En particular, se abordan los fenomenos foneticos que se han propuesto para realizar una clasificacion de un banco de voces en este nivel de la lengua. A partir de esta informacion se describe como construir un modelo de base de datos y el tipo de busquedas que se espera lograr. La propuesta de generar descriptores linguisticos para la clasificacion de un banco de voces pretende ser una metodologia que permita coadyuvar en la imparticion de justicia en Mexico y otros paises de habla hispana.
本文介绍了为法医鉴定目的对声音库进行分类而开发的项目。它解释了如何在数据库中使用语言信息,以减少为法医语音识别进行自动比较时产生的假阳性和假阴性的数量。本研究的目的是分析西班牙语在这一层次上的语料库的演变。从这些信息中,我们描述了如何构建一个数据库模型,以及期望实现的搜索类型。为声音库分类生成语言描述符的建议旨在成为一种有助于墨西哥和其他西班牙语国家伸张正义的方法。
{"title":"Propuesta de clasificación de un banco de voces con fines de identificación forense","authors":"Fernanda López-Escobedo, Julián Solórzano-Soto","doi":"10.21814/LM.8.1.226","DOIUrl":"https://doi.org/10.21814/LM.8.1.226","url":null,"abstract":"En este articulo se presenta el proyecto que se desarrolla para proponer una clasificacion de un banco de voces con fines de identificacion forense. Se expone la manera en que la informacion linguistica puede ser utilizada en una base de datos para reducir el numero de falsos positivos y falsos negativos que resultan cuando se llevan a cabo comparaciones automatizadas para la identificacion forense de voz. En particular, se abordan los fenomenos foneticos que se han propuesto para realizar una clasificacion de un banco de voces en este nivel de la lengua. A partir de esta informacion se describe como construir un modelo de base de datos y el tipo de busquedas que se espera lograr. La propuesta de generar descriptores linguisticos para la clasificacion de un banco de voces pretende ser una metodologia que permita coadyuvar en la imparticion de justicia en Mexico y otros paises de habla hispana.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"8 1","pages":"33-41"},"PeriodicalIF":0.6,"publicationDate":"2016-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68371941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ASinEs: Prolegómenos de un atlas de la variación sintáctica del español ASinEs:西班牙语句法变异图集的序言
IF 0.6 Q4 LINGUISTICS Pub Date : 2015-12-30 DOI: 10.21814/LM.7.2.215
A. Cerrudo, Á. J. Gallego, Anna Pineda, F. Roca
espanolEn este articulo se presenta el ASinEs, una aplicacion con formato de atlas dedicada al estudio sincronico de la variacion sintactica de los geolectos del espanol. Este proyecto es innovador, ya que no existe ningun atlas dedicado exclusivamente a investigar la variacion geolectal de la sintaxis de esta lengua. La versatilidad del ASinEs permite tambien el estudio de geolectos de otros estadios del espanol, asi como los de otras lenguas con las que esta actualmente en contacto. Todo ello proporciona una potente herramienta para la investigacion en el campo de la variacion de las lenguas romanicas y no romanicas (vasco, ingles, lenguas amerindias, etc.).El desarrollo de este proyecto cuenta con la colaboracion del Centre de Linguistica Teorica (Universitat Autonoma de Barcelona), el Centro IKER con sede en Bayona (Francia) y la Real Academia Espanola. EnglishThis paper introduces the ASinEs1, an atlas-based application devoted to the study of the syntactic variation of Spanish geolects. This project is groundbreaking, as there is no other atlas exclusively devoted to study the geolectal variation of geolectal variants of Spanish. Although ASinEs was originally conceived to explore the current geolects of Spanish, its flexibility allows it to study both the geolects of previous stages and the geolects of other close-by languages. This provides us with a po-werful tool to study variation of both Romance and non-Romance languages (Basque, English, Amerindi-an languages, etc.). This project is being developed in collaboration with the Centre de Ling¨u´istica Te`orica (Universitat Aut`onoma de Barcelona), the IKER Center at Bayonne (France), and the Real Academia Espanola.
这篇文章介绍了ASinEs,一个地图集格式的应用程序,致力于同步研究西班牙地理图的句法变化。这个项目是创新的,因为没有专门研究这种语言语法的地理变化的地图集。ASinEs的多功能性也允许研究其他西班牙语阶段的地理图,以及目前接触的其他语言。所有这些都为罗曼语和非罗曼语(巴斯克语、英语、美洲印第安语等)变异领域的研究提供了强大的工具。这个项目的开发得到了理论语言中心(巴塞罗那自治大学)、位于巴约纳(法国)的IKER中心和西班牙皇家学院的合作。这篇论文介绍了ASinEs1,一个基于阿特拉斯的应用程序,致力于研究西班牙地理的语法变体。This project is groundbreaking atlas, as there is other exclusively 40 to study the geolectal variation of geolectal variants of…。虽然ASinEs最初的目的是探索西班牙语的当前地理特征,但它的多样性使它能够研究以前阶段的地理特征和其他近距离语言的地理特征。这为我们提供了一个很好的工具来研究罗曼语和非罗曼语的变体(巴斯克语、英语、美洲印第安语等)。该项目是与语言中心(巴塞罗那大学)、巴约纳IKER中心(法国)和西班牙皇家学院合作开发的。
{"title":"ASinEs: Prolegómenos de un atlas de la variación sintáctica del español","authors":"A. Cerrudo, Á. J. Gallego, Anna Pineda, F. Roca","doi":"10.21814/LM.7.2.215","DOIUrl":"https://doi.org/10.21814/LM.7.2.215","url":null,"abstract":"espanolEn este articulo se presenta el ASinEs, una aplicacion con formato de atlas dedicada al estudio sincronico de la variacion sintactica de los geolectos del espanol. Este proyecto es innovador, ya que no existe ningun atlas dedicado exclusivamente a investigar la variacion geolectal de la sintaxis de esta lengua. La versatilidad del ASinEs permite tambien el estudio de geolectos de otros estadios del espanol, asi como los de otras lenguas con las que esta actualmente en contacto. Todo ello proporciona una potente herramienta para la investigacion en el campo de la variacion de las lenguas romanicas y no romanicas (vasco, ingles, lenguas amerindias, etc.).El desarrollo de este proyecto cuenta con la colaboracion del Centre de Linguistica Teorica (Universitat Autonoma de Barcelona), el Centro IKER con sede en Bayona (Francia) y la Real Academia Espanola. EnglishThis paper introduces the ASinEs1, an atlas-based application devoted to the study of the syntactic variation of Spanish geolects. This project is groundbreaking, as there is no other atlas exclusively devoted to study the geolectal variation of geolectal variants of Spanish. Although ASinEs was originally conceived to explore the current geolects of Spanish, its flexibility allows it to study both the geolects of previous stages and the geolects of other close-by languages. This provides us with a po-werful tool to study variation of both Romance and non-Romance languages (Basque, English, Amerindi-an languages, etc.). This project is being developed in collaboration with the Centre de Ling¨u´istica Te`orica (Universitat Aut`onoma de Barcelona), the IKER Center at Bayonne (France), and the Real Academia Espanola.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"7 1","pages":"59-69"},"PeriodicalIF":0.6,"publicationDate":"2015-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68371716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Descoberta de Synsets Difusos com base na Redundância em vários Dicionários 基于多字典冗余的模糊Synsets发现
IF 0.6 Q4 LINGUISTICS Pub Date : 2015-12-30 DOI: 10.21814/LM.7.2.213
Fábio Santos, Hugo Gonçalo Oliveira
EnglishIn a wordnet, concepts are typically represented as groups of words, commonly known as synsets, and each membership of a word to a synset denotes a different sense of that word. However, since word senses are complex entities, without well-defined boundaries, we suggest to handle them less artificially, by representing them as fuzzy objects, where each word has its membership degree, which can be related to the confidence on using the word to denote the concept conveyed by the synset. We thus propose an approach to discover synsets from a synonymy network, ideally redundant and extracted from several broad-coverage sources. The more synonymy relations there are between two words, the higher the confidence on the semantic equivalence of at least one of their senses. The proposed approach was applied to a network extracted from three Portuguese dictionaries and resulted in a large set of fuzzy synsets. Besides describing this approach and illustrating its results, we rely on three evaluations — comparison against a handcrafted Portuguese thesaurus; comparison against the results of a previous approach with a similar goal; and manual evaluation — to believe that our outcomes are positive and that, in the future, they might my expanded by exploring additional synonymy sources portuguesNuma wordnet, conceitos sao representados atraves de grupos de palavras, vulgarmente chamados de synsets, e cada pertenca de uma palavra a um synset representa um diferente sentido dessa mesma palavra. Mas como os sentidos sao entidades complexas, sem fronteiras bem definidas, para lidar com eles de forma menos artificial, sugerimos que synsets sejam tratados como conjuntos difusos, em que cada palavra tem um grau de pertenca, associado a confianca que existe na utilizacao de cada palavra para transmitir o conceito que emerge do synset. Propomos entao uma abordagem automatica para descobrir um conjunto de synsets difusos a partir de uma rede de sinonimos, idealmente redundante, por ser extraida a partir de varias fontes, e o mais abrangentes possivel. Um dos principios e que, em quantos mais recursos duas palavras forem consideradas sinonimos, maior confianca havera na equivalencia de pelo menos um dos seus sentidos. A abordagem proposta foi aplicada a uma rede extraida a partir de tres dicionarios do portugues e resultou num novo conjunto de synsets para esta lingua, em que as palavras tem pertencas difusas, ou seja, fuzzy synsets. Para alem de apresentar a abordagem e a ilustrar com alguns resultados obtidos, baseamo-nos em tres avaliacoes — comparacao com um tesauro criado manualmente para o portugues; comparacao com uma abordagem anterior com o mesmo objetivo; e avaliacao manual — para confirmar que os resultados sao positivos, e poderao no futuro ser expandidos atraves da exploracao de outras fontes de sinonimos.
在wordnet中,概念通常表示为单词组(通常称为同义词集),一个单词在同义词集中的每个成员都表示该单词的不同含义。然而,由于词义是复杂的实体,没有明确的边界,我们建议减少人为地处理它们,通过将它们表示为模糊对象,其中每个词都有其隶属度,这可以与使用词来表示同义词集所传达的概念的置信度有关。因此,我们提出了一种从同义词网络中发现同义词集的方法,理想情况下,同义词网络是冗余的,并且是从几个广泛覆盖的来源中提取的。两个词之间的同义词关系越多,对其至少一种意义的语义等价的置信度就越高。将该方法应用于从三个葡萄牙语词典中提取的网络,得到了一个大的模糊同义词集。除了描述这种方法并说明其结果外,我们还依赖于三个评估-与手工制作的葡萄牙语词典进行比较;比较:与具有相似目标的先前方法的结果进行比较;和人工评估-相信我们的结果是积极的,并且在未来,他们可能会通过探索其他同义词来源来扩展它们葡萄牙语词汇网,概念和代表词,语料和语法,语料和语法,语料和语法,语料和语法,语料和语法。as como os sentidos sao entidades complexas, semfronteiras bedefinidas, para lidar com eles de formformesmenos artificial, sugerimos que synsets sejam tratados como conjuntos disfusos, em que cada palavra tem grau de pertenca, associado a conconque existes and utilizacao de cada palavra para transmitre to conmitre构思que emerge do synset。提出了一种基于语法集的自动语法集分析方法,该方法在语法集分析、语法集分析、理想冗余、语法集分析和语法集分析等方面具有广泛的应用前景。原则上的原则是相同的,原则上的原则是相同的,原则上的原则是相同的,原则上的原则是相同的,原则上的原则是相同的。本文提出了一种基于语义语义和语义语义的模糊句法集的概念,并提出了一种基于语义语义和语义语义的模糊句法集。Para - alem代表了一种概述,说明了一种示例性的算法、结果、目标、基准和可用性-比较、汇编和标准手册Para - portugal;近系膜前孔比较术;可用性手册- para确认操作系统的结果是否为SAO阳性,是否为未来的用户扩展,是否为数据探索,是否为用户提供更多的信息。
{"title":"Descoberta de Synsets Difusos com base na Redundância em vários Dicionários","authors":"Fábio Santos, Hugo Gonçalo Oliveira","doi":"10.21814/LM.7.2.213","DOIUrl":"https://doi.org/10.21814/LM.7.2.213","url":null,"abstract":"EnglishIn a wordnet, concepts are typically represented as groups of words, commonly known as synsets, and each membership of a word to a synset denotes a different sense of that word. However, since word senses are complex entities, without well-defined boundaries, we suggest to handle them less artificially, by representing them as fuzzy objects, where each word has its membership degree, which can be related to the confidence on using the word to denote the concept conveyed by the synset. We thus propose an approach to discover synsets from a synonymy network, ideally redundant and extracted from several broad-coverage sources. The more synonymy relations there are between two words, the higher the confidence on the semantic equivalence of at least one of their senses. The proposed approach was applied to a network extracted from three Portuguese dictionaries and resulted in a large set of fuzzy synsets. Besides describing this approach and illustrating its results, we rely on three evaluations — comparison against a handcrafted Portuguese thesaurus; comparison against the results of a previous approach with a similar goal; and manual evaluation — to believe that our outcomes are positive and that, in the future, they might my expanded by exploring additional synonymy sources portuguesNuma wordnet, conceitos sao representados atraves de grupos de palavras, vulgarmente chamados de synsets, e cada pertenca de uma palavra a um synset representa um diferente sentido dessa mesma palavra. Mas como os sentidos sao entidades complexas, sem fronteiras bem definidas, para lidar com eles de forma menos artificial, sugerimos que synsets sejam tratados como conjuntos difusos, em que cada palavra tem um grau de pertenca, associado a confianca que existe na utilizacao de cada palavra para transmitir o conceito que emerge do synset. Propomos entao uma abordagem automatica para descobrir um conjunto de synsets difusos a partir de uma rede de sinonimos, idealmente redundante, por ser extraida a partir de varias fontes, e o mais abrangentes possivel. Um dos principios e que, em quantos mais recursos duas palavras forem consideradas sinonimos, maior confianca havera na equivalencia de pelo menos um dos seus sentidos. A abordagem proposta foi aplicada a uma rede extraida a partir de tres dicionarios do portugues e resultou num novo conjunto de synsets para esta lingua, em que as palavras tem pertencas difusas, ou seja, fuzzy synsets. Para alem de apresentar a abordagem e a ilustrar com alguns resultados obtidos, baseamo-nos em tres avaliacoes — comparacao com um tesauro criado manualmente para o portugues; comparacao com uma abordagem anterior com o mesmo objetivo; e avaliacao manual — para confirmar que os resultados sao positivos, e poderao no futuro ser expandidos atraves da exploracao de outras fontes de sinonimos.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"7 1","pages":"3-17"},"PeriodicalIF":0.6,"publicationDate":"2015-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68372028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Uso de uma Ferramenta de Processamento de Linguagem Natural como Auxílio à Coleta de Exemplos para o Estudo de Propriedades Sintático-Semânticas de Verbos 使用自然语言处理工具作为收集例子的辅助,以研究动词的句法语义属性
IF 0.6 Q4 LINGUISTICS Pub Date : 2015-12-30 DOI: 10.21814/LM.7.2.216
Larissa Picoli, Juliana Pinheiro Campos Pirovani, E. Oliveira, Eric Guy Claude Laporte
A analise e descricao de propriedades sintatico-semânticas de verbos sao importantes para a compreensao do funcionamento de uma lingua e fundamentais para o processamento automatico de linguagem natural, uma vez que a codificacao dessa descricao pode ser explorada por ferramentas que realizam esse tipo de processamento. Esse trabalho experimenta o uso do Unitex, uma ferramenta de processamento de linguagem natural, para coletar uma lista de verbos que podem ser analisados e descritos por um linguista. Isso contribui significativamente para esse tipo de estudo linguistico, diminuindo o esforco manual humano na busca de verbos. Foi realizado um estudo de caso para automatizar parcialmente a coleta de verbos de base adjetiva com sufixo -ecer em um corpus de 47 milhoes de palavras. A abordagem proposta e comparada com a coleta manual e a extracao a partir de um dicionario para o PLN.
动词的句法语义属性的分析和描述对于理解语言的功能是很重要的,也是自然语言自动处理的基础,因为这种描述的编码可以被执行这类处理的工具利用。这项工作尝试使用自然语言处理工具Unitex来收集一个动词列表,这些动词可以由语言学家分析和描述。这大大有助于这类语言的研究,减少了人类在寻找动词时的体力劳动。通过一个案例研究,在一个4700万个单词的语料库中,部分自动化了带有-ecer后缀的形容词动词的收集。提出的方法与人工收集和从字典中提取nlp进行了比较。
{"title":"Uso de uma Ferramenta de Processamento de Linguagem Natural como Auxílio à Coleta de Exemplos para o Estudo de Propriedades Sintático-Semânticas de Verbos","authors":"Larissa Picoli, Juliana Pinheiro Campos Pirovani, E. Oliveira, Eric Guy Claude Laporte","doi":"10.21814/LM.7.2.216","DOIUrl":"https://doi.org/10.21814/LM.7.2.216","url":null,"abstract":"A analise e descricao de propriedades sintatico-semânticas de verbos sao importantes para a compreensao do funcionamento de uma lingua e fundamentais para o processamento automatico de linguagem natural, uma vez que a codificacao dessa descricao pode ser explorada por ferramentas que realizam esse tipo de processamento. Esse trabalho experimenta o uso do Unitex, uma ferramenta de processamento de linguagem natural, para coletar uma lista de verbos que podem ser analisados e descritos por um linguista. Isso contribui significativamente para esse tipo de estudo linguistico, diminuindo o esforco manual humano na busca de verbos. Foi realizado um estudo de caso para automatizar parcialmente a coleta de verbos de base adjetiva com sufixo -ecer em um corpus de 47 milhoes de palavras. A abordagem proposta e comparada com a coleta manual e a extracao a partir de um dicionario para o PLN.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"6 1","pages":"35-44"},"PeriodicalIF":0.6,"publicationDate":"2015-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68371776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
El Test de Turing para la evaluación de resumen automático de texto 图灵测试用于自动文本总结评估
IF 0.6 Q4 LINGUISTICS Pub Date : 2015-12-30 DOI: 10.21814/LM.7.2.214
Alejandro Molina-Villegas, Juan-Manuel Torres-Moreno
espanolActualmente existen varios metodos para producir resumenes de texto de manera automatica, pero la evaluacion de los mismos continua siendo un tema desafiante. En este articulo estudiamos la evaluacion de la calidad de resumenes producidos de manera automatica mediante un metodo de compresion de frases. Abordamos la problematica que supone el uso de metricas automaticas, las cuales no toman en cuenta ni la gramatica ni la validez de las oraciones. Nuestra propuesta de evaluacion esta basada en el test de Turing, en el cual varios jueces humanos deben identificar el origen, humano o automatico, de una serie de resumenes. Tambien explicamos como validar las respuestas de los jueces por medio del test estadistico de Fisher. EnglishCurrently there are several methods to produce summaries of text automatically, but the evaluation of these remains a challenging issue. In this paper, we study the quality assessment of automatically generated abstracts. We deal with one of the major drawbacks of automatic metrics, which do not take into account either the grammar or the validity of sentences. Our proposal is based on the Turing test, in which a human judges must identify the source of a series of summaries. We propose how statistically validate the judgements using the Fisher's exact test.
目前有几种自动生成摘要的方法,但它们的评估仍然是一个具有挑战性的问题。在这篇文章中,我们研究了使用短语压缩法自动生成的摘要的质量评价。我们解决了使用自动度量的问题,它既不考虑语法,也不考虑句子的有效性。我们的评估建议是基于图灵测试,在这个测试中,几个人工法官必须识别一系列摘要的来源,无论是人工的还是自动的。我们还解释了如何通过Fisher统计检验来验证评委的回答。目前有几种自动生成文本摘要的方法,但对这些方法的评价仍然是一个具有挑战性的问题。在本文中,我们研究了自动生成摘要的质量评估。我们正在处理自动指标的一个主要缺陷,它既没有考虑到句子的语法,也没有考虑到句子的有效性。Our提案is based on the Turing test, in which a human法官必须确定the source of a series of summaries。我们提出如何在统计上验证使用Fisher精确检验的判断。
{"title":"El Test de Turing para la evaluación de resumen automático de texto","authors":"Alejandro Molina-Villegas, Juan-Manuel Torres-Moreno","doi":"10.21814/LM.7.2.214","DOIUrl":"https://doi.org/10.21814/LM.7.2.214","url":null,"abstract":"espanolActualmente existen varios metodos para producir resumenes de texto de manera automatica, pero la evaluacion de los mismos continua siendo un tema desafiante. En este articulo estudiamos la evaluacion de la calidad de resumenes producidos de manera automatica mediante un metodo de compresion de frases. Abordamos la problematica que supone el uso de metricas automaticas, las cuales no toman en cuenta ni la gramatica ni la validez de las oraciones. Nuestra propuesta de evaluacion esta basada en el test de Turing, en el cual varios jueces humanos deben identificar el origen, humano o automatico, de una serie de resumenes. Tambien explicamos como validar las respuestas de los jueces por medio del test estadistico de Fisher. EnglishCurrently there are several methods to produce summaries of text automatically, but the evaluation of these remains a challenging issue. In this paper, we study the quality assessment of automatically generated abstracts. We deal with one of the major drawbacks of automatic metrics, which do not take into account either the grammar or the validity of sentences. Our proposal is based on the Turing test, in which a human judges must identify the source of a series of summaries. We propose how statistically validate the judgements using the Fisher's exact test.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"7 1","pages":"45-55"},"PeriodicalIF":0.6,"publicationDate":"2015-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68372077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Reconocimiento de términos en español mediante la aplicación de un enfoque de comparación entre corpus 通过应用语料库比较方法来识别西班牙语术语
IF 0.6 Q4 LINGUISTICS Pub Date : 2015-12-01 DOI: 10.21814/LM.7.2.217
O. A. López, C. Aguilar, Tomás Infante
espanolEn este articulo presentamos una metodologia para la identificacion y extraccion de terminos a partir de fuentes textuales en espanol correspondientes a dominios de conocimiento especializados mediante un enfoque de contraste entre corpus. El enfoque de contraste entre corpus hace uso de medidas para asignar relevancia a palabras que ocurren tanto en el corpus de dominio como en corpus de lengua general o de otro dominio diferente al de interes. Dado lo anterior, en este trabajo realizamos una exploracion de cuatro medidas usadas para asignar relevancia a palabras con el objetivo de incorporar la de mejor desempeno a nuestra metodologia. Los resultados obtenidos muestran un desempeno mejor de las medidas diferencia de rangos y razon de frecuencias relativas comparado con la razon log-likelihood y la medida usada en Termostat. EnglishIn this article we present a methodology for identifying and extracting terms from text sources in Spanish corresponding specialized-domain corpus by means of a contrastive approach. The contrastive approach requires a measure for assigning relevance to words occurring both in domain corpus and reference corpus. Therefore, in this work we explored four measures used for assigning relevance to words with the goal of incorporating the best measure in our methodology. Our results show a better performance of rank difference and relative frequency ratio measures compared with log-likelihood ratio and the measure used by Termostat.
本文提出了一种方法,通过语料库对比的方法,从专业知识领域对应的西班牙语文本来源中识别和提取术语。语料库对比方法使用度量来分配出现在领域语料库、通用语言语料库或感兴趣的其他领域的单词的相关性。在这种情况下,重要的是要记住,词汇的相关性是由它们的相关性决定的,而不是由它们的相关性决定的。在此基础上,对温度恒温器中使用的对数相似度比值和相对频率差进行了测量。在这篇文章中,我们提出了一种方法,通过对比方法从相应的西班牙语专业领域语体的文本来源中识别和提取术语。对比法要求对出现在域语料库和引用语料库中的词进行相关性分配。因此,在本文中,我们探讨了用于分配词语相关性的四种方法,目的是将最佳方法纳入我们的方法。我们的结果表明,与对数相似度比和恒温器使用的测量方法相比,秩差和相对频率比测量方法的性能更好。
{"title":"Reconocimiento de términos en español mediante la aplicación de un enfoque de comparación entre corpus","authors":"O. A. López, C. Aguilar, Tomás Infante","doi":"10.21814/LM.7.2.217","DOIUrl":"https://doi.org/10.21814/LM.7.2.217","url":null,"abstract":"espanolEn este articulo presentamos una metodologia para la identificacion y extraccion de terminos a partir de fuentes textuales en espanol correspondientes a dominios de conocimiento especializados mediante un enfoque de contraste entre corpus. El enfoque de contraste entre corpus hace uso de medidas para asignar relevancia a palabras que ocurren tanto en el corpus de dominio como en corpus de lengua general o de otro dominio diferente al de interes. Dado lo anterior, en este trabajo realizamos una exploracion de cuatro medidas usadas para asignar relevancia a palabras con el objetivo de incorporar la de mejor desempeno a nuestra metodologia. Los resultados obtenidos muestran un desempeno mejor de las medidas diferencia de rangos y razon de frecuencias relativas comparado con la razon log-likelihood y la medida usada en Termostat. EnglishIn this article we present a methodology for identifying and extracting terms from text sources in Spanish corresponding specialized-domain corpus by means of a contrastive approach. The contrastive approach requires a measure for assigning relevance to words occurring both in domain corpus and reference corpus. Therefore, in this work we explored four measures used for assigning relevance to words with the goal of incorporating the best measure in our methodology. Our results show a better performance of rank difference and relative frequency ratio measures compared with log-likelihood ratio and the measure used by Termostat.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"38 1","pages":"19-34"},"PeriodicalIF":0.6,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68371854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Estudio de la influencia de incorporar conocimiento léxico-semántico a la técnica de Análisis de Componentes Principales para la generación de resúmenes multilingües 将词汇语义知识纳入主成分分析技术对多语言摘要生成的影响研究
IF 0.6 Q4 LINGUISTICS Pub Date : 2015-07-31 DOI: 10.21814/LM.7.1.205
Óscar Alcón, E. Lloret
The objective of automatic text summarization  is to reduce the dimension of a text keeping the relevant information. In this paper we analyse and apply the language-independent Principal Component Analysis technique for generating extractive single-document multilingual summaries. This technique will be studied to evaluate its performance with and without adding lexical-semantic knowledge through language-dependent resources and tools. Experiments were conducted using two different corpora: newswire and Wikipedia articles in three languages (English, German and Spanish) to validate the use of this technique in several scenarios. The proposed approaches show very competitive results compared to multilingual available systems, indicating that, although there is still room for improvement with respect to the technique and the type of knowledge to be taken into consideration, this has great potential for being applied in other contexts and for other languages.
自动文本摘要的目标是降低保存相关信息的文本的维数。在本文中,我们分析并应用了独立于语言的主成分分析技术来生成抽取的单文档多语言摘要。本文将通过语言相关的资源和工具来研究该技术在添加和不添加词汇语义知识的情况下的性能。实验使用了两种不同的语料库:三种语言(英语、德语和西班牙语)的新闻专线和维基百科文章,以验证该技术在几种情况下的使用。与现有的多语文系统相比,拟议的方法显示出非常有竞争力的结果,这表明,虽然在技术和要考虑的知识类型方面仍有改进的余地,但这在其他情况和其他语文方面具有很大的应用潜力。
{"title":"Estudio de la influencia de incorporar conocimiento léxico-semántico a la técnica de Análisis de Componentes Principales para la generación de resúmenes multilingües","authors":"Óscar Alcón, E. Lloret","doi":"10.21814/LM.7.1.205","DOIUrl":"https://doi.org/10.21814/LM.7.1.205","url":null,"abstract":"The objective of automatic text summarization  is to reduce the dimension of a text keeping the relevant information. In this paper we analyse and apply the language-independent Principal Component Analysis technique for generating extractive single-document multilingual summaries. This technique will be studied to evaluate its performance with and without adding lexical-semantic knowledge through language-dependent resources and tools. Experiments were conducted using two different corpora: newswire and Wikipedia articles in three languages (English, German and Spanish) to validate the use of this technique in several scenarios. The proposed approaches show very competitive results compared to multilingual available systems, indicating that, although there is still room for improvement with respect to the technique and the type of knowledge to be taken into consideration, this has great potential for being applied in other contexts and for other languages.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"7 1","pages":"53-63"},"PeriodicalIF":0.6,"publicationDate":"2015-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68371618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
Linguamatica
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1