Lara Gil-Vallejo, I. Castellón, Marta Coll-Florit, J. Turmo
En este trabajo nos centramos en la adquisicion de clasificaciones verbales automaticas para el espanol. Para ello realizamos una serie de experimentos con 20 sentidos verbales del corpus Sensem. Empleamos diferentes tipos de atributos que abarcan informacion linguistica diversa y un metodo de clustering jerarquico aglomerativo para generar varias clasificaciones. Comparamos cada una de estas clasificaciones automaticas con un gold standard creado semi-automaticamente teniendo en cuenta construcciones linguisticas propuestas desde la linguistica teorica. Esta comparacion nos permite saber que atributos son mas adecuados para crear de forma automatica una clasificacion coherente con la teoria sobre construcciones y cuales son las similitudes y diferencias entre la clasificacion verbal automatica y la que se basa en la teoria sobre construcciones linguisticas.
{"title":"Hacia una clasificación verbal automática para el español: estudio sobre la relevancia de los diferentes tipos y configuraciones de información sintáctico-semántica","authors":"Lara Gil-Vallejo, I. Castellón, Marta Coll-Florit, J. Turmo","doi":"10.21814/LM.7.1.202","DOIUrl":"https://doi.org/10.21814/LM.7.1.202","url":null,"abstract":"En este trabajo nos centramos en la adquisicion de clasificaciones verbales automaticas para el espanol. Para ello realizamos una serie de experimentos con 20 sentidos verbales del corpus Sensem. Empleamos diferentes tipos de atributos que abarcan informacion linguistica diversa y un metodo de clustering jerarquico aglomerativo para generar varias clasificaciones. Comparamos cada una de estas clasificaciones automaticas con un gold standard creado semi-automaticamente teniendo en cuenta construcciones linguisticas propuestas desde la linguistica teorica. Esta comparacion nos permite saber que atributos son mas adecuados para crear de forma automatica una clasificacion coherente con la teoria sobre construcciones y cuales son las similitudes y diferencias entre la clasificacion verbal automatica y la que se basa en la teoria sobre construcciones linguisticas.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"7 1","pages":"41-52"},"PeriodicalIF":0.6,"publicationDate":"2015-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68371463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
New equipments, such as smartphones and tablets, are changing human computer interaction. These devices present several challenges, especially due to their small screen and keyboard. In order to use text and voice in multimodal interaction, it is essential to deploy modules to translate the internal information of the applications into sentences or texts, in order to display it on screen or synthesize it. Also, these modules must generate phrases and texts in the user's native language; the development should not require considerable resources; and the outcome of the generation should achieve a good degree of variability. Our main objective is to propose, implement and evaluate a method of data conversion to Portuguese which can be developed with a minimum of time and knowledge, but without compromising the necessary variability and quality of what is generated. The developed system, for a Medication Assistant, is intended to create descriptions, in natural language, of medication to be taken. Motivated by recent results, we opted for an approach based on machine translation, with models trained on a small parallel corpus. For that, a new corpus was created. With it, two variants of the system were trained: phrase-based translation and syntax-based translation. The two variants were evaluated by automatic measurements -- BLEU and Meteor -- and by humans. The results showed that a phrase-based approach produced better results than a syntax-based one: human evaluators evaluated 60% of phrase-based responses as good, or very good, compared to only 46% of syntax-based responses. Considering the corpus size, we judge this value (60%) as good.
{"title":"Geração de Linguagem Natural para Conversão de Dados em Texto - Aplicação a um Assistente de Medicação para o Português","authors":"J. C. Pereira, A. Teixeira","doi":"10.21814/LM.7.1.206","DOIUrl":"https://doi.org/10.21814/LM.7.1.206","url":null,"abstract":"New equipments, such as smartphones and tablets, are changing human computer interaction. These devices present several challenges, especially due to their small screen and keyboard. In order to use text and voice in multimodal interaction, it is essential to deploy modules to translate the internal information of the applications into sentences or texts, in order to display it on screen or synthesize it. Also, these modules must generate phrases and texts in the user's native language; the development should not require considerable resources; and the outcome of the generation should achieve a good degree of variability. Our main objective is to propose, implement and evaluate a method of data conversion to Portuguese which can be developed with a minimum of time and knowledge, but without compromising the necessary variability and quality of what is generated. The developed system, for a Medication Assistant, is intended to create descriptions, in natural language, of medication to be taken. Motivated by recent results, we opted for an approach based on machine translation, with models trained on a small parallel corpus. For that, a new corpus was created. With it, two variants of the system were trained: phrase-based translation and syntax-based translation. The two variants were evaluated by automatic measurements -- BLEU and Meteor -- and by humans. The results showed that a phrase-based approach produced better results than a syntax-based one: human evaluators evaluated 60% of phrase-based responses as good, or very good, compared to only 46% of syntax-based responses. Considering the corpus size, we judge this value (60%) as good.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"7 1","pages":"3-21"},"PeriodicalIF":0.6,"publicationDate":"2015-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68371861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This article describes some of the procedures for the execution of an online English-Portuguese glossary prototype in Eletrical Engineering / Eletrotechnical Field terminology – aimed mainly at beginner students from technical and graduation courses in Electrical Engineering. The methodology is comprised of a corpus of datasheets, documents often used by professionals of the Electrical Engineering area, and the comparison of data obtained from these datasheets with the data gathered from 108 students of Electrical courses. Results point to the relevance of considering the point of view of our target audience to build the glossary properly.
{"title":"A arquitetura de um glossário terminológico Inglês-Português na área de Eletrotécnica","authors":"S. Fadanelli, M. J. B. Finatto","doi":"10.21814/LM.7.1.204","DOIUrl":"https://doi.org/10.21814/LM.7.1.204","url":null,"abstract":"This article describes some of the procedures for the execution of an online English-Portuguese glossary prototype in Eletrical Engineering / Eletrotechnical Field terminology – aimed mainly at beginner students from technical and graduation courses in Electrical Engineering. The methodology is comprised of a corpus of datasheets, documents often used by professionals of the Electrical Engineering area, and the comparison of data obtained from these datasheets with the data gathered from 108 students of Electrical courses. Results point to the relevance of considering the point of view of our target audience to build the glossary properly.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"65 1","pages":"67-71"},"PeriodicalIF":0.6,"publicationDate":"2015-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68371207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automatic document summarization is the task of automatically generating condensed versions of source texts, presenting itself as one of the fundamental problems in the areas of Information Retrieval and Natural Language Processing. In this paper, different extractive approaches are compared in the task of summarizing individual documents corresponding to journalistic texts written in Portuguese. Through the use of the ROUGE package for measuring the quality of the produced summaries, we report on results for two different experimental domains, involving (i) the generation of headlines for news articles written in European Portuguese, and (ii) the generation of summaries for news articles written in Brazilian Portuguese. The results demonstrate that methods based on the selection of the first sentences have the best results when building extractive news headlines in terms of several ROUGE metrics. Regarding the generation of summaries with more than one sentence, the method that achieved the best results was the LSA Squared algorithm, for the various ROUGE metrics.
{"title":"Uma Comparação Sistemática de Diferentes Abordagens para a Sumarização Automática Extrativa de Textos em Português","authors":"M. Costa, Bruno Martins","doi":"10.21814/LM.7.1.203","DOIUrl":"https://doi.org/10.21814/LM.7.1.203","url":null,"abstract":"Automatic document summarization is the task of automatically generating condensed versions of source texts, presenting itself as one of the fundamental problems in the areas of Information Retrieval and Natural Language Processing. In this paper, different extractive approaches are compared in the task of summarizing individual documents corresponding to journalistic texts written in Portuguese. Through the use of the ROUGE package for measuring the quality of the produced summaries, we report on results for two different experimental domains, involving (i) the generation of headlines for news articles written in European Portuguese, and (ii) the generation of summaries for news articles written in Brazilian Portuguese. The results demonstrate that methods based on the selection of the first sentences have the best results when building extractive news headlines in terms of several ROUGE metrics. Regarding the generation of summaries with more than one sentence, the method that achieved the best results was the LSA Squared algorithm, for the various ROUGE metrics.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"7 1","pages":"23-40"},"PeriodicalIF":0.6,"publicationDate":"2015-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68371571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Erick Nilsen Pereira de Souza, Daniela Barreio Claro
Relation Extraction (RE) is a task of Information Extraction (IE) responsible for the discovery of semantic relationships between concepts in unstructured text. When the extraction is not limited to a predefined set of relations, the task is called Open Relation Extraction, whose main challenge is to reduce the proportion of invalid extractions in the universe of relationships identified. Current methods based on a set of specific machine learning features eliminate much of the invalid extractions. However, these solutions have the disadvantage of being highly language-dependent. This dependence arises from the difficulty in finding the most representative set of features to the Open RE problem, considering the peculiarities of each language. In this context, the present work proposes to assess the difficulties of classification based on features in open relation extraction in Portuguese, aiming to base new solutions that can reduce language dependence in this task. The results indicate that many representative features in English can not be mapped directly to the Portuguese language with satisfactory merits of classification. Among the classification algorithms evaluated, J48 showed the best results with a F-measure value of 84.1%, followed by SVM (83.9%), Perceptron (82.0%) and Naive Bayes (79,9%).
{"title":"Extração de Relações utilizando Features Diferenciadas para Português","authors":"Erick Nilsen Pereira de Souza, Daniela Barreio Claro","doi":"10.21814/LM.6.2.182","DOIUrl":"https://doi.org/10.21814/LM.6.2.182","url":null,"abstract":"Relation Extraction (RE) is a task of Information Extraction (IE) responsible for the discovery of semantic relationships between concepts in unstructured text. When the extraction is not limited to a predefined set of relations, the task is called Open Relation Extraction, whose main challenge is to reduce the proportion of invalid extractions in the universe of relationships identified. Current methods based on a set of specific machine learning features eliminate much of the invalid extractions. However, these solutions have the disadvantage of being highly language-dependent. This dependence arises from the difficulty in finding the most representative set of features to the Open RE problem, considering the peculiarities of each language. In this context, the present work proposes to assess the difficulties of classification based on features in open relation extraction in Portuguese, aiming to base new solutions that can reduce language dependence in this task. The results indicate that many representative features in English can not be mapped directly to the Portuguese language with satisfactory merits of classification. Among the classification algorithms evaluated, J48 showed the best results with a F-measure value of 84.1%, followed by SVM (83.9%), Perceptron (82.0%) and Naive Bayes (79,9%).","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"55 3 1","pages":"57-65"},"PeriodicalIF":0.6,"publicationDate":"2014-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68370924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Uxoa Iñurrieta Urmeneta, I. Aduriz, A. D. D. Ilarraza, Gorka Labaka, K. Sarasola
This article deals with noun+verb combinations in bilingual Basque-Spanish and Spanish-Basque dictionaries. We take a look at morphosyntactic and semantic features of word combinations in both language directions, and compare them to identify differences and similarities. Our work reveals the high complexity of those constructions and, hence, the need to address them specifically in Natural Language Processing tools, for example in Machine Translation. All of our results are publicly available online, where users can query the combinations we have analysed.
{"title":"Izen+aditz konbinazioen azterketa elebiduna, hizkuntza-aplikazio aurreratuei begira","authors":"Uxoa Iñurrieta Urmeneta, I. Aduriz, A. D. D. Ilarraza, Gorka Labaka, K. Sarasola","doi":"10.21814/LM.6.2.188","DOIUrl":"https://doi.org/10.21814/LM.6.2.188","url":null,"abstract":"This article deals with noun+verb combinations in bilingual Basque-Spanish and Spanish-Basque dictionaries. We take a look at morphosyntactic and semantic features of word combinations in both language directions, and compare them to identify differences and similarities. Our work reveals the high complexity of those constructions and, hence, the need to address them specifically in Natural Language Processing tools, for example in Machine Translation. All of our results are publicly available online, where users can query the combinations we have analysed.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"6 1","pages":"45-55"},"PeriodicalIF":0.6,"publicationDate":"2014-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68371131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we present the foundations for a lexical acquisition experiment designed in the framework of the SKATeR research project and aimed to the expansion of the Galician WordNet using the lexicographical data collected in a ``traditional'' Galician dictionary of synonyms.
{"title":"O dicionario de sinónimos como recurso para a expansión de WordNet","authors":"Xavier Gómez Guinovart, Miguel Anxo Solla Portela","doi":"10.21814/LM.6.2.183","DOIUrl":"https://doi.org/10.21814/LM.6.2.183","url":null,"abstract":"In this paper, we present the foundations for a lexical acquisition experiment designed in the framework of the SKATeR research project and aimed to the expansion of the Galician WordNet using the lexicographical data collected in a ``traditional'' Galician dictionary of synonyms.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"6 1","pages":"69-74"},"PeriodicalIF":0.6,"publicationDate":"2014-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68370991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anabela Barreiro, Wang Ling, Luísa Coheur, Fernando Batista, Isabel Trancoso
Language technologies, in particular machine translation applications, have the potential to help break down linguistic and cultural barriers, presenting an important contribution to the globalization and internationalization of the Portuguese language, by allowing content to be shared 'from' and 'to' this language. This article aims to present the research work developed at the Laboratory of Spoken Language Systems of INESC-ID in the field of machine translation, namely the automated speech translation, the translation of microblogs and the creation of a hybrid machine translation system. We will focus on the creation of the hybrid system, which aims at combining linguistic knowledge, in particular semantico-syntactic knowledge, with statistical knowledge, to increase the level of translation quality.
{"title":"Projetos sobre Tradução Automática do Português no Laboratório de Sistemas de Língua Falada do INESC-ID","authors":"Anabela Barreiro, Wang Ling, Luísa Coheur, Fernando Batista, Isabel Trancoso","doi":"10.21814/LM.6.2.196","DOIUrl":"https://doi.org/10.21814/LM.6.2.196","url":null,"abstract":"Language technologies, in particular machine translation applications, have the potential to help break down linguistic and cultural barriers, presenting an important contribution to the globalization and internationalization of the Portuguese language, by allowing content to be shared 'from' and 'to' this language. This article aims to present the research work developed at the Laboratory of Spoken Language Systems of INESC-ID in the field of machine translation, namely the automated speech translation, the translation of microblogs and the creation of a hybrid machine translation system. We will focus on the creation of the hybrid system, which aims at combining linguistic knowledge, in particular semantico-syntactic knowledge, with statistical knowledge, to increase the level of translation quality.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"6 1","pages":"75-85"},"PeriodicalIF":0.6,"publicationDate":"2014-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68371348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Begoña Altuna, M.ª Jesús Aranzabe, A. D. D. Ilarraza
Time information extraction is very useful in natural language processing (NLP), as it can be used in text simplification, information extraction and machine translation systems. In this paper we present the first steps of making that information accessible for Basque language: on one hand, Basque structures that convey time have been analysed based on grammars and, on the other hand, first decisions on tagging those on real texts have been taken. Also, we give account of an annotating experiment we have carried out on a financial news corpus.
{"title":"Euskarazko denbora-egiturak. Azterketa eta etiketatze-esperimentua","authors":"Begoña Altuna, M.ª Jesús Aranzabe, A. D. D. Ilarraza","doi":"10.21814/LM.6.2.184","DOIUrl":"https://doi.org/10.21814/LM.6.2.184","url":null,"abstract":"Time information extraction is very useful in natural language processing (NLP), as it can be used in text simplification, information extraction and machine translation systems. In this paper we present the first steps of making that information accessible for Basque language: on one hand, Basque structures that convey time have been analysed based on grammars and, on the other hand, first decisions on tagging those on real texts have been taken. Also, we give account of an annotating experiment we have carried out on a financial news corpus.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"6 1","pages":"13-24"},"PeriodicalIF":0.6,"publicationDate":"2014-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68371046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cursing is a form of expression that is noted by its intensity. When someone uses this form of expression they are emitting a spontaneous and raw form of opinion, usually suppressed for the ``mild ways'' and sensitive people. As it happens, this sort of expression is also valuable when doing some sort of opinion mining and sentiment analysis, now a routine task across the social networks. Therefore in this work we try to evaluate the methods that allow the recovery of this forms of expression, disguised through obfuscation methods, often as a way to escape automatic censorship.
{"title":"Avaliação de métodos de desofuscação de palavrões","authors":"Gustavo Laboreiro, E. Oliveira","doi":"10.21814/LM.6.2.191","DOIUrl":"https://doi.org/10.21814/LM.6.2.191","url":null,"abstract":"Cursing is a form of expression that is noted by its intensity. When someone uses this form of expression they are emitting a spontaneous and raw form of opinion, usually suppressed for the ``mild ways'' and sensitive people. As it happens, this sort of expression is also valuable when doing some sort of opinion mining and sentiment analysis, now a routine task across the social networks. Therefore in this work we try to evaluate the methods that allow the recovery of this forms of expression, disguised through obfuscation methods, often as a way to escape automatic censorship.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"6 1","pages":"25-43"},"PeriodicalIF":0.6,"publicationDate":"2014-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68371242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}