In this paper we present a proposal to automatically label the similarity between a pair of sentences and the results obtained on ASSIN 2016 sentence similarity shared-task. Our proposal consists of using a classical feature of bag-of-words, the TF-IDF model; and an emergent feature, obtained from processing word embeddings. The TF-IDF is used to relate texts which share words. Word embeddings are known by capture the syntax and semantics of a word. Following Mikolov et al. (2013), the sum of embedding vectors can model the meaning of a sentence. Using both features, we are able to capture the words shared between sentences and their semantics. We use linear regression to solve this problem, once the dataset is labeled as real numbers between 1 and 5. Our results are promising. Although the usage of embeddings has not overcome our baseline system, when we combined it with TF-IDF, our system achieved better results than only using TF-IDF. Our results achieved the first collocation of ASSIN 2016 for sentence similarity shared-task applied on brazilian portuguese sentences and second collocation when applying to Portugal portuguese sentences.
本文提出了一种基于ASSIN 2016句子相似度共享任务的句子相似度自动标注方法。我们的建议包括使用经典的词袋特征TF-IDF模型;和一个突现特征,从处理词嵌入得到。TF-IDF用于关联共享单词的文本。单词嵌入是通过捕获单词的语法和语义来实现的。继Mikolov et al.(2013)之后,嵌入向量的总和可以对句子的含义进行建模。使用这两个特征,我们能够捕获句子之间共享的单词及其语义。我们使用线性回归来解决这个问题,一旦数据集被标记为1到5之间的实数。我们的结果很有希望。虽然嵌入的使用并没有克服我们的基线系统,但当我们将其与TF-IDF结合使用时,我们的系统取得了比仅使用TF-IDF更好的结果。我们的结果实现了ASSIN 2016对巴西葡萄牙语句子相似度共享任务的第一次搭配和对葡萄牙葡萄牙语句子的第二次搭配。
{"title":"Solo Queue at ASSIN: Combinando Abordagens Tradicionais e Emergentes","authors":"N. Hartmann","doi":"10.21814/LM.8.2.230","DOIUrl":"https://doi.org/10.21814/LM.8.2.230","url":null,"abstract":"In this paper we present a proposal to automatically label the similarity between a pair of sentences and the results obtained on ASSIN 2016 sentence similarity shared-task. Our proposal consists of using a classical feature of bag-of-words, the TF-IDF model; and an emergent feature, obtained from processing word embeddings. The TF-IDF is used to relate texts which share words. Word embeddings are known by capture the syntax and semantics of a word. Following Mikolov et al. (2013), the sum of embedding vectors can model the meaning of a sentence. Using both features, we are able to capture the words shared between sentences and their semantics. We use linear regression to solve this problem, once the dataset is labeled as real numbers between 1 and 5. Our results are promising. Although the usage of embeddings has not overcome our baseline system, when we combined it with TF-IDF, our system achieved better results than only using TF-IDF. Our results achieved the first collocation of ASSIN 2016 for sentence similarity shared-task applied on brazilian portuguese sentences and second collocation when applying to Portugal portuguese sentences.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"29 1","pages":"59-64"},"PeriodicalIF":0.6,"publicationDate":"2016-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68372457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carlos-Emiliano González-Gallardo, Juan-Manuel Torres-Moreno, Azucena Montes Rendón, Gerardo E Sierra
En este articulo presentamos un algoritmo que combina las caracteristicas estilisticas representadas por los n-gramas de caracteres y los n-gramas de etiquetas gramaticales (POS) para clasificar documentos multilengua de redes sociales. En ambos grupos de n-gramas se aplico una normalizacion dinamica dependiente del contexto para extraer la mayor cantidad de informacion estilistica posible codificada en los documentos (emoticonos, inundamiento de caracteres, uso de letras mayusculas, referencias a usuarios, ligas a sitios externos, hashtags, etc.). El algoritmo fue aplicado sobre dos corpus diferentes: los tweets del corpus de entrenamiento de la tarea Author Profiling de PAN-CLEF 2015 y el corpus de "Comentarios de la Ciudad de Mexico en el tiempo" (CCDMX). Los resultados presentan una exactitud muy alta, cercana al 90%.
{"title":"Perfilado de autor multilingüe en redes sociales a partir de n-gramas de caracteres y de etiquetas gramaticales","authors":"Carlos-Emiliano González-Gallardo, Juan-Manuel Torres-Moreno, Azucena Montes Rendón, Gerardo E Sierra","doi":"10.21814/LM.8.1.227","DOIUrl":"https://doi.org/10.21814/LM.8.1.227","url":null,"abstract":"En este articulo presentamos un algoritmo que combina las caracteristicas estilisticas representadas por los n-gramas de caracteres y los n-gramas de etiquetas gramaticales (POS) para clasificar documentos multilengua de redes sociales. En ambos grupos de n-gramas se aplico una normalizacion dinamica dependiente del contexto para extraer la mayor cantidad de informacion estilistica posible codificada en los documentos (emoticonos, inundamiento de caracteres, uso de letras mayusculas, referencias a usuarios, ligas a sitios externos, hashtags, etc.). El algoritmo fue aplicado sobre dos corpus diferentes: los tweets del corpus de entrenamiento de la tarea Author Profiling de PAN-CLEF 2015 y el corpus de \"Comentarios de la Ciudad de Mexico en el tiempo\" (CCDMX). Los resultados presentan una exactitud muy alta, cercana al 90%.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"8 1","pages":"21-29"},"PeriodicalIF":0.6,"publicationDate":"2016-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68372105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hernani Costa, Isabel Dúran Muñoz, Gloria Corpas Pastor, Ruslan Mitkov
Decisions at the outset of compiling a comparable corpus are of crucial importance for how the corpus is to be built and analysed later on. Several variables and external criteria are usually followed when building a corpus but little is been said about textual distributional similarity in this context and the quality that it brings to research. In an attempt to fulfil this gap, this paper aims at presenting a simple but efficient methodology capable of measuring a corpus internal degree of relatedness. To do so, this methodology takes advantage of both available natural language processing technology and statistical methods in a successful attempt to access the relatedness degree between documents. Our findings prove that using a list of common entities and a set of distributional similarity measures is enough not only to describe and assess the degree of relatedness between the documents in a comparable corpus, but also to rank them according to their degree of relatedness within the corpus.
{"title":"Compilação de Corpos Comparáveis Especializados: Devemos sempre confiar nas Ferramentas de Compilação Semi-automáticas?","authors":"Hernani Costa, Isabel Dúran Muñoz, Gloria Corpas Pastor, Ruslan Mitkov","doi":"10.21814/LM.8.1.221","DOIUrl":"https://doi.org/10.21814/LM.8.1.221","url":null,"abstract":"Decisions at the outset of compiling a comparable corpus are of crucial importance for how the corpus is to be built and analysed later on. Several variables and external criteria are usually followed when building a corpus but little is been said about textual distributional similarity in this context and the quality that it brings to research. In an attempt to fulfil this gap, this paper aims at presenting a simple but efficient methodology capable of measuring a corpus internal degree of relatedness. To do so, this methodology takes advantage of both available natural language processing technology and statistical methods in a successful attempt to access the relatedness degree between documents. Our findings prove that using a list of common entities and a set of distributional similarity measures is enough not only to describe and assess the degree of relatedness between the documents in a comparable corpus, but also to rank them according to their degree of relatedness within the corpus.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"8 1","pages":"3-19"},"PeriodicalIF":0.6,"publicationDate":"2016-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68371888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
En este articulo se presenta el proyecto que se desarrolla para proponer una clasificacion de un banco de voces con fines de identificacion forense. Se expone la manera en que la informacion linguistica puede ser utilizada en una base de datos para reducir el numero de falsos positivos y falsos negativos que resultan cuando se llevan a cabo comparaciones automatizadas para la identificacion forense de voz. En particular, se abordan los fenomenos foneticos que se han propuesto para realizar una clasificacion de un banco de voces en este nivel de la lengua. A partir de esta informacion se describe como construir un modelo de base de datos y el tipo de busquedas que se espera lograr. La propuesta de generar descriptores linguisticos para la clasificacion de un banco de voces pretende ser una metodologia que permita coadyuvar en la imparticion de justicia en Mexico y otros paises de habla hispana.
{"title":"Propuesta de clasificación de un banco de voces con fines de identificación forense","authors":"Fernanda López-Escobedo, Julián Solórzano-Soto","doi":"10.21814/LM.8.1.226","DOIUrl":"https://doi.org/10.21814/LM.8.1.226","url":null,"abstract":"En este articulo se presenta el proyecto que se desarrolla para proponer una clasificacion de un banco de voces con fines de identificacion forense. Se expone la manera en que la informacion linguistica puede ser utilizada en una base de datos para reducir el numero de falsos positivos y falsos negativos que resultan cuando se llevan a cabo comparaciones automatizadas para la identificacion forense de voz. En particular, se abordan los fenomenos foneticos que se han propuesto para realizar una clasificacion de un banco de voces en este nivel de la lengua. A partir de esta informacion se describe como construir un modelo de base de datos y el tipo de busquedas que se espera lograr. La propuesta de generar descriptores linguisticos para la clasificacion de un banco de voces pretende ser una metodologia que permita coadyuvar en la imparticion de justicia en Mexico y otros paises de habla hispana.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"8 1","pages":"33-41"},"PeriodicalIF":0.6,"publicationDate":"2016-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68371941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
espanolEn este articulo se presenta el ASinEs, una aplicacion con formato de atlas dedicada al estudio sincronico de la variacion sintactica de los geolectos del espanol. Este proyecto es innovador, ya que no existe ningun atlas dedicado exclusivamente a investigar la variacion geolectal de la sintaxis de esta lengua. La versatilidad del ASinEs permite tambien el estudio de geolectos de otros estadios del espanol, asi como los de otras lenguas con las que esta actualmente en contacto. Todo ello proporciona una potente herramienta para la investigacion en el campo de la variacion de las lenguas romanicas y no romanicas (vasco, ingles, lenguas amerindias, etc.).El desarrollo de este proyecto cuenta con la colaboracion del Centre de Linguistica Teorica (Universitat Autonoma de Barcelona), el Centro IKER con sede en Bayona (Francia) y la Real Academia Espanola. EnglishThis paper introduces the ASinEs1, an atlas-based application devoted to the study of the syntactic variation of Spanish geolects. This project is groundbreaking, as there is no other atlas exclusively devoted to study the geolectal variation of geolectal variants of Spanish. Although ASinEs was originally conceived to explore the current geolects of Spanish, its flexibility allows it to study both the geolects of previous stages and the geolects of other close-by languages. This provides us with a po-werful tool to study variation of both Romance and non-Romance languages (Basque, English, Amerindi-an languages, etc.). This project is being developed in collaboration with the Centre de Ling¨u´istica Te`orica (Universitat Aut`onoma de Barcelona), the IKER Center at Bayonne (France), and the Real Academia Espanola.
这篇文章介绍了ASinEs,一个地图集格式的应用程序,致力于同步研究西班牙地理图的句法变化。这个项目是创新的,因为没有专门研究这种语言语法的地理变化的地图集。ASinEs的多功能性也允许研究其他西班牙语阶段的地理图,以及目前接触的其他语言。所有这些都为罗曼语和非罗曼语(巴斯克语、英语、美洲印第安语等)变异领域的研究提供了强大的工具。这个项目的开发得到了理论语言中心(巴塞罗那自治大学)、位于巴约纳(法国)的IKER中心和西班牙皇家学院的合作。这篇论文介绍了ASinEs1,一个基于阿特拉斯的应用程序,致力于研究西班牙地理的语法变体。This project is groundbreaking atlas, as there is other exclusively 40 to study the geolectal variation of geolectal variants of…。虽然ASinEs最初的目的是探索西班牙语的当前地理特征,但它的多样性使它能够研究以前阶段的地理特征和其他近距离语言的地理特征。这为我们提供了一个很好的工具来研究罗曼语和非罗曼语的变体(巴斯克语、英语、美洲印第安语等)。该项目是与语言中心(巴塞罗那大学)、巴约纳IKER中心(法国)和西班牙皇家学院合作开发的。
{"title":"ASinEs: Prolegómenos de un atlas de la variación sintáctica del español","authors":"A. Cerrudo, Á. J. Gallego, Anna Pineda, F. Roca","doi":"10.21814/LM.7.2.215","DOIUrl":"https://doi.org/10.21814/LM.7.2.215","url":null,"abstract":"espanolEn este articulo se presenta el ASinEs, una aplicacion con formato de atlas dedicada al estudio sincronico de la variacion sintactica de los geolectos del espanol. Este proyecto es innovador, ya que no existe ningun atlas dedicado exclusivamente a investigar la variacion geolectal de la sintaxis de esta lengua. La versatilidad del ASinEs permite tambien el estudio de geolectos de otros estadios del espanol, asi como los de otras lenguas con las que esta actualmente en contacto. Todo ello proporciona una potente herramienta para la investigacion en el campo de la variacion de las lenguas romanicas y no romanicas (vasco, ingles, lenguas amerindias, etc.).El desarrollo de este proyecto cuenta con la colaboracion del Centre de Linguistica Teorica (Universitat Autonoma de Barcelona), el Centro IKER con sede en Bayona (Francia) y la Real Academia Espanola. EnglishThis paper introduces the ASinEs1, an atlas-based application devoted to the study of the syntactic variation of Spanish geolects. This project is groundbreaking, as there is no other atlas exclusively devoted to study the geolectal variation of geolectal variants of Spanish. Although ASinEs was originally conceived to explore the current geolects of Spanish, its flexibility allows it to study both the geolects of previous stages and the geolects of other close-by languages. This provides us with a po-werful tool to study variation of both Romance and non-Romance languages (Basque, English, Amerindi-an languages, etc.). This project is being developed in collaboration with the Centre de Ling¨u´istica Te`orica (Universitat Aut`onoma de Barcelona), the IKER Center at Bayonne (France), and the Real Academia Espanola.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"7 1","pages":"59-69"},"PeriodicalIF":0.6,"publicationDate":"2015-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68371716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
EnglishIn a wordnet, concepts are typically represented as groups of words, commonly known as synsets, and each membership of a word to a synset denotes a different sense of that word. However, since word senses are complex entities, without well-defined boundaries, we suggest to handle them less artificially, by representing them as fuzzy objects, where each word has its membership degree, which can be related to the confidence on using the word to denote the concept conveyed by the synset. We thus propose an approach to discover synsets from a synonymy network, ideally redundant and extracted from several broad-coverage sources. The more synonymy relations there are between two words, the higher the confidence on the semantic equivalence of at least one of their senses. The proposed approach was applied to a network extracted from three Portuguese dictionaries and resulted in a large set of fuzzy synsets. Besides describing this approach and illustrating its results, we rely on three evaluations — comparison against a handcrafted Portuguese thesaurus; comparison against the results of a previous approach with a similar goal; and manual evaluation — to believe that our outcomes are positive and that, in the future, they might my expanded by exploring additional synonymy sources portuguesNuma wordnet, conceitos sao representados atraves de grupos de palavras, vulgarmente chamados de synsets, e cada pertenca de uma palavra a um synset representa um diferente sentido dessa mesma palavra. Mas como os sentidos sao entidades complexas, sem fronteiras bem definidas, para lidar com eles de forma menos artificial, sugerimos que synsets sejam tratados como conjuntos difusos, em que cada palavra tem um grau de pertenca, associado a confianca que existe na utilizacao de cada palavra para transmitir o conceito que emerge do synset. Propomos entao uma abordagem automatica para descobrir um conjunto de synsets difusos a partir de uma rede de sinonimos, idealmente redundante, por ser extraida a partir de varias fontes, e o mais abrangentes possivel. Um dos principios e que, em quantos mais recursos duas palavras forem consideradas sinonimos, maior confianca havera na equivalencia de pelo menos um dos seus sentidos. A abordagem proposta foi aplicada a uma rede extraida a partir de tres dicionarios do portugues e resultou num novo conjunto de synsets para esta lingua, em que as palavras tem pertencas difusas, ou seja, fuzzy synsets. Para alem de apresentar a abordagem e a ilustrar com alguns resultados obtidos, baseamo-nos em tres avaliacoes — comparacao com um tesauro criado manualmente para o portugues; comparacao com uma abordagem anterior com o mesmo objetivo; e avaliacao manual — para confirmar que os resultados sao positivos, e poderao no futuro ser expandidos atraves da exploracao de outras fontes de sinonimos.
在wordnet中,概念通常表示为单词组(通常称为同义词集),一个单词在同义词集中的每个成员都表示该单词的不同含义。然而,由于词义是复杂的实体,没有明确的边界,我们建议减少人为地处理它们,通过将它们表示为模糊对象,其中每个词都有其隶属度,这可以与使用词来表示同义词集所传达的概念的置信度有关。因此,我们提出了一种从同义词网络中发现同义词集的方法,理想情况下,同义词网络是冗余的,并且是从几个广泛覆盖的来源中提取的。两个词之间的同义词关系越多,对其至少一种意义的语义等价的置信度就越高。将该方法应用于从三个葡萄牙语词典中提取的网络,得到了一个大的模糊同义词集。除了描述这种方法并说明其结果外,我们还依赖于三个评估-与手工制作的葡萄牙语词典进行比较;比较:与具有相似目标的先前方法的结果进行比较;和人工评估-相信我们的结果是积极的,并且在未来,他们可能会通过探索其他同义词来源来扩展它们葡萄牙语词汇网,概念和代表词,语料和语法,语料和语法,语料和语法,语料和语法,语料和语法。as como os sentidos sao entidades complexas, semfronteiras bedefinidas, para lidar com eles de formformesmenos artificial, sugerimos que synsets sejam tratados como conjuntos disfusos, em que cada palavra tem grau de pertenca, associado a conconque existes and utilizacao de cada palavra para transmitre to conmitre构思que emerge do synset。提出了一种基于语法集的自动语法集分析方法,该方法在语法集分析、语法集分析、理想冗余、语法集分析和语法集分析等方面具有广泛的应用前景。原则上的原则是相同的,原则上的原则是相同的,原则上的原则是相同的,原则上的原则是相同的,原则上的原则是相同的。本文提出了一种基于语义语义和语义语义的模糊句法集的概念,并提出了一种基于语义语义和语义语义的模糊句法集。Para - alem代表了一种概述,说明了一种示例性的算法、结果、目标、基准和可用性-比较、汇编和标准手册Para - portugal;近系膜前孔比较术;可用性手册- para确认操作系统的结果是否为SAO阳性,是否为未来的用户扩展,是否为数据探索,是否为用户提供更多的信息。
{"title":"Descoberta de Synsets Difusos com base na Redundância em vários Dicionários","authors":"Fábio Santos, Hugo Gonçalo Oliveira","doi":"10.21814/LM.7.2.213","DOIUrl":"https://doi.org/10.21814/LM.7.2.213","url":null,"abstract":"EnglishIn a wordnet, concepts are typically represented as groups of words, commonly known as synsets, and each membership of a word to a synset denotes a different sense of that word. However, since word senses are complex entities, without well-defined boundaries, we suggest to handle them less artificially, by representing them as fuzzy objects, where each word has its membership degree, which can be related to the confidence on using the word to denote the concept conveyed by the synset. We thus propose an approach to discover synsets from a synonymy network, ideally redundant and extracted from several broad-coverage sources. The more synonymy relations there are between two words, the higher the confidence on the semantic equivalence of at least one of their senses. The proposed approach was applied to a network extracted from three Portuguese dictionaries and resulted in a large set of fuzzy synsets. Besides describing this approach and illustrating its results, we rely on three evaluations — comparison against a handcrafted Portuguese thesaurus; comparison against the results of a previous approach with a similar goal; and manual evaluation — to believe that our outcomes are positive and that, in the future, they might my expanded by exploring additional synonymy sources portuguesNuma wordnet, conceitos sao representados atraves de grupos de palavras, vulgarmente chamados de synsets, e cada pertenca de uma palavra a um synset representa um diferente sentido dessa mesma palavra. Mas como os sentidos sao entidades complexas, sem fronteiras bem definidas, para lidar com eles de forma menos artificial, sugerimos que synsets sejam tratados como conjuntos difusos, em que cada palavra tem um grau de pertenca, associado a confianca que existe na utilizacao de cada palavra para transmitir o conceito que emerge do synset. Propomos entao uma abordagem automatica para descobrir um conjunto de synsets difusos a partir de uma rede de sinonimos, idealmente redundante, por ser extraida a partir de varias fontes, e o mais abrangentes possivel. Um dos principios e que, em quantos mais recursos duas palavras forem consideradas sinonimos, maior confianca havera na equivalencia de pelo menos um dos seus sentidos. A abordagem proposta foi aplicada a uma rede extraida a partir de tres dicionarios do portugues e resultou num novo conjunto de synsets para esta lingua, em que as palavras tem pertencas difusas, ou seja, fuzzy synsets. Para alem de apresentar a abordagem e a ilustrar com alguns resultados obtidos, baseamo-nos em tres avaliacoes — comparacao com um tesauro criado manualmente para o portugues; comparacao com uma abordagem anterior com o mesmo objetivo; e avaliacao manual — para confirmar que os resultados sao positivos, e poderao no futuro ser expandidos atraves da exploracao de outras fontes de sinonimos.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"7 1","pages":"3-17"},"PeriodicalIF":0.6,"publicationDate":"2015-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68372028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Larissa Picoli, Juliana Pinheiro Campos Pirovani, E. Oliveira, Eric Guy Claude Laporte
A analise e descricao de propriedades sintatico-semânticas de verbos sao importantes para a compreensao do funcionamento de uma lingua e fundamentais para o processamento automatico de linguagem natural, uma vez que a codificacao dessa descricao pode ser explorada por ferramentas que realizam esse tipo de processamento. Esse trabalho experimenta o uso do Unitex, uma ferramenta de processamento de linguagem natural, para coletar uma lista de verbos que podem ser analisados e descritos por um linguista. Isso contribui significativamente para esse tipo de estudo linguistico, diminuindo o esforco manual humano na busca de verbos. Foi realizado um estudo de caso para automatizar parcialmente a coleta de verbos de base adjetiva com sufixo -ecer em um corpus de 47 milhoes de palavras. A abordagem proposta e comparada com a coleta manual e a extracao a partir de um dicionario para o PLN.
{"title":"Uso de uma Ferramenta de Processamento de Linguagem Natural como Auxílio à Coleta de Exemplos para o Estudo de Propriedades Sintático-Semânticas de Verbos","authors":"Larissa Picoli, Juliana Pinheiro Campos Pirovani, E. Oliveira, Eric Guy Claude Laporte","doi":"10.21814/LM.7.2.216","DOIUrl":"https://doi.org/10.21814/LM.7.2.216","url":null,"abstract":"A analise e descricao de propriedades sintatico-semânticas de verbos sao importantes para a compreensao do funcionamento de uma lingua e fundamentais para o processamento automatico de linguagem natural, uma vez que a codificacao dessa descricao pode ser explorada por ferramentas que realizam esse tipo de processamento. Esse trabalho experimenta o uso do Unitex, uma ferramenta de processamento de linguagem natural, para coletar uma lista de verbos que podem ser analisados e descritos por um linguista. Isso contribui significativamente para esse tipo de estudo linguistico, diminuindo o esforco manual humano na busca de verbos. Foi realizado um estudo de caso para automatizar parcialmente a coleta de verbos de base adjetiva com sufixo -ecer em um corpus de 47 milhoes de palavras. A abordagem proposta e comparada com a coleta manual e a extracao a partir de um dicionario para o PLN.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"6 1","pages":"35-44"},"PeriodicalIF":0.6,"publicationDate":"2015-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68371776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
espanolActualmente existen varios metodos para producir resumenes de texto de manera automatica, pero la evaluacion de los mismos continua siendo un tema desafiante. En este articulo estudiamos la evaluacion de la calidad de resumenes producidos de manera automatica mediante un metodo de compresion de frases. Abordamos la problematica que supone el uso de metricas automaticas, las cuales no toman en cuenta ni la gramatica ni la validez de las oraciones. Nuestra propuesta de evaluacion esta basada en el test de Turing, en el cual varios jueces humanos deben identificar el origen, humano o automatico, de una serie de resumenes. Tambien explicamos como validar las respuestas de los jueces por medio del test estadistico de Fisher. EnglishCurrently there are several methods to produce summaries of text automatically, but the evaluation of these remains a challenging issue. In this paper, we study the quality assessment of automatically generated abstracts. We deal with one of the major drawbacks of automatic metrics, which do not take into account either the grammar or the validity of sentences. Our proposal is based on the Turing test, in which a human judges must identify the source of a series of summaries. We propose how statistically validate the judgements using the Fisher's exact test.
目前有几种自动生成摘要的方法,但它们的评估仍然是一个具有挑战性的问题。在这篇文章中,我们研究了使用短语压缩法自动生成的摘要的质量评价。我们解决了使用自动度量的问题,它既不考虑语法,也不考虑句子的有效性。我们的评估建议是基于图灵测试,在这个测试中,几个人工法官必须识别一系列摘要的来源,无论是人工的还是自动的。我们还解释了如何通过Fisher统计检验来验证评委的回答。目前有几种自动生成文本摘要的方法,但对这些方法的评价仍然是一个具有挑战性的问题。在本文中,我们研究了自动生成摘要的质量评估。我们正在处理自动指标的一个主要缺陷,它既没有考虑到句子的语法,也没有考虑到句子的有效性。Our提案is based on the Turing test, in which a human法官必须确定the source of a series of summaries。我们提出如何在统计上验证使用Fisher精确检验的判断。
{"title":"El Test de Turing para la evaluación de resumen automático de texto","authors":"Alejandro Molina-Villegas, Juan-Manuel Torres-Moreno","doi":"10.21814/LM.7.2.214","DOIUrl":"https://doi.org/10.21814/LM.7.2.214","url":null,"abstract":"espanolActualmente existen varios metodos para producir resumenes de texto de manera automatica, pero la evaluacion de los mismos continua siendo un tema desafiante. En este articulo estudiamos la evaluacion de la calidad de resumenes producidos de manera automatica mediante un metodo de compresion de frases. Abordamos la problematica que supone el uso de metricas automaticas, las cuales no toman en cuenta ni la gramatica ni la validez de las oraciones. Nuestra propuesta de evaluacion esta basada en el test de Turing, en el cual varios jueces humanos deben identificar el origen, humano o automatico, de una serie de resumenes. Tambien explicamos como validar las respuestas de los jueces por medio del test estadistico de Fisher. EnglishCurrently there are several methods to produce summaries of text automatically, but the evaluation of these remains a challenging issue. In this paper, we study the quality assessment of automatically generated abstracts. We deal with one of the major drawbacks of automatic metrics, which do not take into account either the grammar or the validity of sentences. Our proposal is based on the Turing test, in which a human judges must identify the source of a series of summaries. We propose how statistically validate the judgements using the Fisher's exact test.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"7 1","pages":"45-55"},"PeriodicalIF":0.6,"publicationDate":"2015-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68372077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
espanolEn este articulo presentamos una metodologia para la identificacion y extraccion de terminos a partir de fuentes textuales en espanol correspondientes a dominios de conocimiento especializados mediante un enfoque de contraste entre corpus. El enfoque de contraste entre corpus hace uso de medidas para asignar relevancia a palabras que ocurren tanto en el corpus de dominio como en corpus de lengua general o de otro dominio diferente al de interes. Dado lo anterior, en este trabajo realizamos una exploracion de cuatro medidas usadas para asignar relevancia a palabras con el objetivo de incorporar la de mejor desempeno a nuestra metodologia. Los resultados obtenidos muestran un desempeno mejor de las medidas diferencia de rangos y razon de frecuencias relativas comparado con la razon log-likelihood y la medida usada en Termostat. EnglishIn this article we present a methodology for identifying and extracting terms from text sources in Spanish corresponding specialized-domain corpus by means of a contrastive approach. The contrastive approach requires a measure for assigning relevance to words occurring both in domain corpus and reference corpus. Therefore, in this work we explored four measures used for assigning relevance to words with the goal of incorporating the best measure in our methodology. Our results show a better performance of rank difference and relative frequency ratio measures compared with log-likelihood ratio and the measure used by Termostat.
{"title":"Reconocimiento de términos en español mediante la aplicación de un enfoque de comparación entre corpus","authors":"O. A. López, C. Aguilar, Tomás Infante","doi":"10.21814/LM.7.2.217","DOIUrl":"https://doi.org/10.21814/LM.7.2.217","url":null,"abstract":"espanolEn este articulo presentamos una metodologia para la identificacion y extraccion de terminos a partir de fuentes textuales en espanol correspondientes a dominios de conocimiento especializados mediante un enfoque de contraste entre corpus. El enfoque de contraste entre corpus hace uso de medidas para asignar relevancia a palabras que ocurren tanto en el corpus de dominio como en corpus de lengua general o de otro dominio diferente al de interes. Dado lo anterior, en este trabajo realizamos una exploracion de cuatro medidas usadas para asignar relevancia a palabras con el objetivo de incorporar la de mejor desempeno a nuestra metodologia. Los resultados obtenidos muestran un desempeno mejor de las medidas diferencia de rangos y razon de frecuencias relativas comparado con la razon log-likelihood y la medida usada en Termostat. EnglishIn this article we present a methodology for identifying and extracting terms from text sources in Spanish corresponding specialized-domain corpus by means of a contrastive approach. The contrastive approach requires a measure for assigning relevance to words occurring both in domain corpus and reference corpus. Therefore, in this work we explored four measures used for assigning relevance to words with the goal of incorporating the best measure in our methodology. Our results show a better performance of rank difference and relative frequency ratio measures compared with log-likelihood ratio and the measure used by Termostat.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"38 1","pages":"19-34"},"PeriodicalIF":0.6,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68371854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The objective of automatic text summarization is to reduce the dimension of a text keeping the relevant information. In this paper we analyse and apply the language-independent Principal Component Analysis technique for generating extractive single-document multilingual summaries. This technique will be studied to evaluate its performance with and without adding lexical-semantic knowledge through language-dependent resources and tools. Experiments were conducted using two different corpora: newswire and Wikipedia articles in three languages (English, German and Spanish) to validate the use of this technique in several scenarios. The proposed approaches show very competitive results compared to multilingual available systems, indicating that, although there is still room for improvement with respect to the technique and the type of knowledge to be taken into consideration, this has great potential for being applied in other contexts and for other languages.
{"title":"Estudio de la influencia de incorporar conocimiento léxico-semántico a la técnica de Análisis de Componentes Principales para la generación de resúmenes multilingües","authors":"Óscar Alcón, E. Lloret","doi":"10.21814/LM.7.1.205","DOIUrl":"https://doi.org/10.21814/LM.7.1.205","url":null,"abstract":"The objective of automatic text summarization is to reduce the dimension of a text keeping the relevant information. In this paper we analyse and apply the language-independent Principal Component Analysis technique for generating extractive single-document multilingual summaries. This technique will be studied to evaluate its performance with and without adding lexical-semantic knowledge through language-dependent resources and tools. Experiments were conducted using two different corpora: newswire and Wikipedia articles in three languages (English, German and Spanish) to validate the use of this technique in several scenarios. The proposed approaches show very competitive results compared to multilingual available systems, indicating that, although there is still room for improvement with respect to the technique and the type of knowledge to be taken into consideration, this has great potential for being applied in other contexts and for other languages.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"7 1","pages":"53-63"},"PeriodicalIF":0.6,"publicationDate":"2015-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68371618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}