文本体裁的结构化数据表示作为一种自动文本处理技术

IF 0.7 0 LANGUAGE & LINGUISTICS Texto Livre-Linguagem e Tecnologia Pub Date : 2022-01-27 DOI:10.35699/1983-3652.2022.35445

Claudia Aparecida Fonseca, M. V. C. Guelpeli, Rafael Santiago de Souza Netto

{"title":"文本体裁的结构化数据表示作为一种自动文本处理技术","authors":"Claudia Aparecida Fonseca, M. V. C. Guelpeli, Rafael Santiago de Souza Netto","doi":"10.35699/1983-3652.2022.35445","DOIUrl":null,"url":null,"abstract":"The present article was developed in the field of Natural Language Processing and Language Studies based on a corpus compiled by computational tools. This study is based on the assumption that it is helpful to trace a close relationship between corpus generation/annotation and the assessment of the constitutive elements of the text genre source. It aims to demonstrate, through specific studies of structured data from the text genre ‘scientific article’, alternatives to automatic text processing techniques. In order to reach the intended goal, the authors created a computational model for the compilation of a linguistic, specialized Corpus, representative of the genre Scientific Article - CorpACE. The object of study includes the constitutive elements of scientific articles, marked in XML, extracted and collected from the SciELO-Scientific Electronic Library On-line database. The final product was a database obtained with information extracted and structured in XML format, which designates and identifies the markups of the genre being analyzed and is available for many tools and applications. The results demonstrate how the representation of constitutive elements of the genre can condense available information with hierarchical and dynamic processes built during the compilation. At the end of the study, it is believed that more research will be required for bringing Language Science and Computer Science closer with emphasis on NLP in the attempt to represent and manipulate linguistic knowledge in its many levels – morphological, syntactic, semantic and discursive – in order to improve implementation and manipulation of automatic text processing.","PeriodicalId":52012,"journal":{"name":"Texto Livre-Linguagem e Tecnologia","volume":"49 1","pages":""},"PeriodicalIF":0.7000,"publicationDate":"2022-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Representation of structured data of the text genre as a technique for automatic text processing\",\"authors\":\"Claudia Aparecida Fonseca, M. V. C. Guelpeli, Rafael Santiago de Souza Netto\",\"doi\":\"10.35699/1983-3652.2022.35445\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The present article was developed in the field of Natural Language Processing and Language Studies based on a corpus compiled by computational tools. This study is based on the assumption that it is helpful to trace a close relationship between corpus generation/annotation and the assessment of the constitutive elements of the text genre source. It aims to demonstrate, through specific studies of structured data from the text genre ‘scientific article’, alternatives to automatic text processing techniques. In order to reach the intended goal, the authors created a computational model for the compilation of a linguistic, specialized Corpus, representative of the genre Scientific Article - CorpACE. The object of study includes the constitutive elements of scientific articles, marked in XML, extracted and collected from the SciELO-Scientific Electronic Library On-line database. The final product was a database obtained with information extracted and structured in XML format, which designates and identifies the markups of the genre being analyzed and is available for many tools and applications. The results demonstrate how the representation of constitutive elements of the genre can condense available information with hierarchical and dynamic processes built during the compilation. At the end of the study, it is believed that more research will be required for bringing Language Science and Computer Science closer with emphasis on NLP in the attempt to represent and manipulate linguistic knowledge in its many levels – morphological, syntactic, semantic and discursive – in order to improve implementation and manipulation of automatic text processing.\",\"PeriodicalId\":52012,\"journal\":{\"name\":\"Texto Livre-Linguagem e Tecnologia\",\"volume\":\"49 1\",\"pages\":\"\"},\"PeriodicalIF\":0.7000,\"publicationDate\":\"2022-01-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Texto Livre-Linguagem e Tecnologia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.35699/1983-3652.2022.35445\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"0\",\"JCRName\":\"LANGUAGE & LINGUISTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Texto Livre-Linguagem e Tecnologia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.35699/1983-3652.2022.35445","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}

引用次数: 1

摘要

本文是在自然语言处理和语言研究领域的基础上，基于计算工具汇编的语料库。本研究是基于这样一个假设:语料库生成/注释与文本体裁源的构成要素评估之间的密切关系是有帮助的。它旨在通过对文本类型“科学文章”的结构化数据的具体研究，展示自动文本处理技术的替代方案。为了达到预期的目标，作者创建了一个计算模型，用于编译一个语言学的，专门的语料库，代表体裁科学文章- CorpACE。研究对象包括从SciELO-Scientific Electronic Library在线数据库中提取并收集的以XML标记的科学文章的构成要素。最终的产品是一个数据库，其中提取了以XML格式结构化的信息，它指定和标识了要分析的类型的标记，并且可供许多工具和应用程序使用。结果表明，该类型的构成要素的表示可以通过编译过程中建立的分层和动态过程来浓缩可用信息。在研究的最后，我们相信将需要更多的研究来拉近语言科学和计算机科学的距离，重点放在NLP上，试图在形态学、句法、语义和话语等多个层面上表示和操纵语言知识，以改进自动文本处理的实现和操作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Representation of structured data of the text genre as a technique for automatic text processing

The present article was developed in the field of Natural Language Processing and Language Studies based on a corpus compiled by computational tools. This study is based on the assumption that it is helpful to trace a close relationship between corpus generation/annotation and the assessment of the constitutive elements of the text genre source. It aims to demonstrate, through specific studies of structured data from the text genre ‘scientific article’, alternatives to automatic text processing techniques. In order to reach the intended goal, the authors created a computational model for the compilation of a linguistic, specialized Corpus, representative of the genre Scientific Article - CorpACE. The object of study includes the constitutive elements of scientific articles, marked in XML, extracted and collected from the SciELO-Scientific Electronic Library On-line database. The final product was a database obtained with information extracted and structured in XML format, which designates and identifies the markups of the genre being analyzed and is available for many tools and applications. The results demonstrate how the representation of constitutive elements of the genre can condense available information with hierarchical and dynamic processes built during the compilation. At the end of the study, it is believed that more research will be required for bringing Language Science and Computer Science closer with emphasis on NLP in the attempt to represent and manipulate linguistic knowledge in its many levels – morphological, syntactic, semantic and discursive – in order to improve implementation and manipulation of automatic text processing.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Texto Livre-Linguagem e Tecnologia LANGUAGE & LINGUISTICS-

CiteScore

1.10

自引率

16.70%

发文量

审稿时长

5 weeks

期刊介绍： Texto Livre: Linguagem e Tecnologia is a quarterly journal, sponsored by the School of Letters of the Federal University of Minas Gerais (Brazil) since 2008. It welcomes submissions of articles, reviews, essays and translations on the relationship between languages and digital media. Its mission is to promote scientific production in the field of language studies, especially analysis of writing and practices for teaching writing through free and open new technologies, and studies on documentation and dissemination of free and open software, providing researchers from Brazil and abroad with the opportunity to share their research and contribute to the debate and scientific progress in the area. Topics of interest to this journal include: intertextuality, usability, computer use in the classroom, free culture, digital inclusion, digital literacy, dissemination of free software and other topics related to language and technology. The journal accepts manuscripts in Portuguese, Spanish, English and French, with no need for a translation into Portuguese. Texto Livre is intended for researchers and for a non-academic audience interested in critical approaches to the related topics addressed by the journal.