An investigation of linguistic problems in automatic multi-document summaries / Uma investigação de problemas linguísticos em sumários automáticos multidocumento

IF 0.1 0 LANGUAGE & LINGUISTICS Revista de Estudos da Linguagem Pub Date : 2021-03-19 DOI:10.17851/2237-2083.29.2.859-907

Márcio de Souza Dias, Ariani Di Felippo, A. Rassi, P. Cardoso, F. Nóbrega, T. Pardo

{"title":"An investigation of linguistic problems in automatic multi-document summaries / Uma investigação de problemas linguísticos em sumários automáticos multidocumento","authors":"Márcio de Souza Dias, Ariani Di Felippo, A. Rassi, P. Cardoso, F. Nóbrega, T. Pardo","doi":"10.17851/2237-2083.29.2.859-907","DOIUrl":null,"url":null,"abstract":"Abstract: Automatic summaries commonly present diverse linguistic problems that affect textual quality and thus their understanding by users. Few studies have tried to characterize such problems and their relation with the performance of the summarization systems. In this paper, we investigated the problems in multi-document extracts (i.e., summaries produced by concatenating several sentences taken exactly as they appear in the source texts) generated by systems for Brazilian Portuguese that have different approaches (i.e., superficial and deep) and performances (i.e., baseline and state-of-the art methods). For that, we first reviewed the main characterization studies, resulting in a typology of linguistic problems more suitable for multi-document summarization. Then, we manually annotated a corpus of automatic multi-document extracts in Portuguese based on the typology, which showed that some of linguistic problems are significantly more recurrent than others. Thus, this corpus annotation may support research on linguistic problems detection and correction for summary improvement, allowing the production of automatic summaries that are not only informative (i.e., they convey the content of the source material), but also linguistically well structured. Keywords: automatic summarization; multi-document summary; linguistic problem; corpus annotation. Resumo: Sumarios automaticos geralmente apresentam varios problemas linguisticos que afetam a sua qualidade textual e, consequentemente, sua compreensao pelos usuarios. Alguns trabalhos caracterizam tais problemas e os relacionam ao desempenho dos sistemas de sumarizacao. Neste artigo, investigaram-se os problemas em extratos (isto e, sumarios produzidos pela concatenacao de sentencas extraidas na integra dos textos-fonte) multidocumento em Portugues do Brasil gerados por sistemas que apresentam diferentes abordagens (isto e, superficial e profunda) e desempenho (isto e, metodos baseline e do estado-da-arte). Para tanto, as principais caracterizacoes dos problemas linguisticos em sumarios automaticos foram investigadas, resultando em uma tipologia mais adequada a sumarizacao multidocumento. Em seguida, anotou-se manualmente um corpus de extratos com base na tipologia, evidenciando que alguns tipos de problemas sao significativamente mais recorrentes que outros. Assim, essa anotacao gera subsidios para as tarefas automaticas de deteccao e correcao de problemas linguisticos com vistas a producao de sumarios automaticos nao so mais informativos (isto e, que cobrem o conteudo do material de origem), como tambem linguisticamente bem-estruturados. Palavras-chave: sumarizacao automatica; sumario multidocumento; problema linguistico; anotacao de corpus .","PeriodicalId":42188,"journal":{"name":"Revista de Estudos da Linguagem","volume":"29 1","pages":"859-907"},"PeriodicalIF":0.1000,"publicationDate":"2021-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Revista de Estudos da Linguagem","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17851/2237-2083.29.2.859-907","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Abstract: Automatic summaries commonly present diverse linguistic problems that affect textual quality and thus their understanding by users. Few studies have tried to characterize such problems and their relation with the performance of the summarization systems. In this paper, we investigated the problems in multi-document extracts (i.e., summaries produced by concatenating several sentences taken exactly as they appear in the source texts) generated by systems for Brazilian Portuguese that have different approaches (i.e., superficial and deep) and performances (i.e., baseline and state-of-the art methods). For that, we first reviewed the main characterization studies, resulting in a typology of linguistic problems more suitable for multi-document summarization. Then, we manually annotated a corpus of automatic multi-document extracts in Portuguese based on the typology, which showed that some of linguistic problems are significantly more recurrent than others. Thus, this corpus annotation may support research on linguistic problems detection and correction for summary improvement, allowing the production of automatic summaries that are not only informative (i.e., they convey the content of the source material), but also linguistically well structured. Keywords: automatic summarization; multi-document summary; linguistic problem; corpus annotation. Resumo: Sumarios automaticos geralmente apresentam varios problemas linguisticos que afetam a sua qualidade textual e, consequentemente, sua compreensao pelos usuarios. Alguns trabalhos caracterizam tais problemas e os relacionam ao desempenho dos sistemas de sumarizacao. Neste artigo, investigaram-se os problemas em extratos (isto e, sumarios produzidos pela concatenacao de sentencas extraidas na integra dos textos-fonte) multidocumento em Portugues do Brasil gerados por sistemas que apresentam diferentes abordagens (isto e, superficial e profunda) e desempenho (isto e, metodos baseline e do estado-da-arte). Para tanto, as principais caracterizacoes dos problemas linguisticos em sumarios automaticos foram investigadas, resultando em uma tipologia mais adequada a sumarizacao multidocumento. Em seguida, anotou-se manualmente um corpus de extratos com base na tipologia, evidenciando que alguns tipos de problemas sao significativamente mais recorrentes que outros. Assim, essa anotacao gera subsidios para as tarefas automaticas de deteccao e correcao de problemas linguisticos com vistas a producao de sumarios automaticos nao so mais informativos (isto e, que cobrem o conteudo do material de origem), como tambem linguisticamente bem-estruturados. Palavras-chave: sumarizacao automatica; sumario multidocumento; problema linguistico; anotacao de corpus .

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

自动多文档摘要中的语言问题研究/ Uma investiga o de problemas linguísticos em sumários automáticos multidocumento

摘要：自动摘要通常会出现各种各样的语言问题，这些问题会影响文本质量，从而影响用户的理解。很少有研究试图描述这些问题及其与摘要系统性能的关系。在本文中，我们研究了巴西葡萄牙语系统生成的多文档摘录（即，通过将源文本中出现的几个句子拼接而成的摘要）中的问题，这些系统具有不同的方法（即，肤浅和深入）和性能（即，基线和最先进的方法）。为此，我们首先回顾了主要的表征研究，得出了更适合多文档摘要的语言问题类型学。然后，我们基于类型学手动注释了葡萄牙语的自动多文档提取语料库，这表明一些语言问题比其他问题更容易重复。因此，这种语料库注释可以支持对语言问题检测和纠正的研究，以改进摘要，从而允许生成不仅信息丰富（即，它们传达了源材料的内容），而且在语言上结构良好的自动摘要。关键词：自动摘要；多文档摘要；语言问题；语料库注释。摘要：自动摘要通常会出现各种语言问题，影响其文本质量，从而影响用户对其的理解。一些研究描述了这些问题的特征，并将它们与汇总系统的性能联系起来。在这篇文章中，我们研究了巴西葡萄牙语多文档的摘录（即，通过从源文本中提取的句子拼接而产生的摘要）中的问题，这些摘录是由呈现不同方法（即，肤浅和深入）和性能（即，基线和最先进的方法）的系统生成的。因此，研究了自动摘要中语言问题的主要特征，得出了更适合多文档摘要的类型学。然后，手动记录了基于类型学的摘录语料库，表明某些类型的问题比其他类型的问题更容易复发。因此，这种注释为检测和纠正语言问题的自动任务提供了补贴，以期产生不仅信息丰富（即涵盖源材料的内容），而且语言结构良好的自动摘要。关键词：自动毒瘤；多文档摘要；语言问题；语料库注释。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊