Language independent text summarization of western European languages using shape coding of text elements

2017 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD) Pub Date : 2017-07-29 DOI:10.1109/FSKD.2017.8393116

A. Saleh, L. Weigang

{"title":"Language independent text summarization of western European languages using shape coding of text elements","authors":"A. Saleh, L. Weigang","doi":"10.1109/FSKD.2017.8393116","DOIUrl":null,"url":null,"abstract":"The majority of text summarization techniques in literature depend, in one way or another, on language dependent pre-structured lexicons, databases, taggers and/or parsers. Such techniques require a prior knowledge of the language of the text being summarized. In this paper we propose an extractive text summarization tool, UnB Language Independent Text Summarizer (UnB-LITS), which is capable of performing text summarization in a language independent manner. The new model depends on intrinsic characteristics of the text being summarized rather than its language and thus eliminates the need for language dependent lexicons, databases, taggers or parsers. Within this tool, we develop an innovative way of coding the shapes of text elements (words, n-grams, sentences and paragraphs), in addition to proposing language independent algorithms that is capable of normalizing words and performing relative stemming or lemmatization. The proposed algorithms and Shape-Coding routine enable the UnB-LITS tool to extract intrinsic features of document elements and score them statistically to extract a representative extractive summary independent of the document language. In this paper we focused on single document summarization of western European languages. The tool was tested on hundreds of documents written in English, Portuguese, French and Spanish and showed better performance as compared with the results obtained in literature as well as from commercial summarizers.","PeriodicalId":236093,"journal":{"name":"2017 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FSKD.2017.8393116","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

The majority of text summarization techniques in literature depend, in one way or another, on language dependent pre-structured lexicons, databases, taggers and/or parsers. Such techniques require a prior knowledge of the language of the text being summarized. In this paper we propose an extractive text summarization tool, UnB Language Independent Text Summarizer (UnB-LITS), which is capable of performing text summarization in a language independent manner. The new model depends on intrinsic characteristics of the text being summarized rather than its language and thus eliminates the need for language dependent lexicons, databases, taggers or parsers. Within this tool, we develop an innovative way of coding the shapes of text elements (words, n-grams, sentences and paragraphs), in addition to proposing language independent algorithms that is capable of normalizing words and performing relative stemming or lemmatization. The proposed algorithms and Shape-Coding routine enable the UnB-LITS tool to extract intrinsic features of document elements and score them statistically to extract a representative extractive summary independent of the document language. In this paper we focused on single document summarization of western European languages. The tool was tested on hundreds of documents written in English, Portuguese, French and Spanish and showed better performance as compared with the results obtained in literature as well as from commercial summarizers.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于文本元素形状编码的西欧语言非语言文本摘要

文献中的大多数文本摘要技术都以这样或那样的方式依赖于语言相关的预结构化词汇、数据库、标记器和/或解析器。这种技巧要求对所总结的文本的语言有事先的了解。在本文中，我们提出了一种提取文本摘要工具，UnB语言独立文本摘要器(UnB- lits)，它能够以语言独立的方式执行文本摘要。新模型依赖于被总结文本的内在特征，而不是其语言，因此消除了对依赖于语言的词典、数据库、标注器或解析器的需要。在这个工具中，我们开发了一种创新的方法来编码文本元素(单词，n-gram，句子和段落)的形状，除了提出能够规范化单词并执行相对词干或词法化的语言独立算法之外。所提出的算法和Shape-Coding例程使UnB-LITS工具能够提取文档元素的内在特征，并对其进行统计评分，以提取独立于文档语言的代表性提取摘要。本文主要研究了西欧语言的单文献摘要。该工具在数百份用英语、葡萄牙语、法语和西班牙语撰写的文件上进行了测试，与从文献和商业摘要器中获得的结果相比，显示出更好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2017 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)

自引率

0.00%

发文量