Unsupervised Extractive Summarization with BERT

A. Dutulescu, M. Dascalu, Stefan Ruseti
{"title":"Unsupervised Extractive Summarization with BERT","authors":"A. Dutulescu, M. Dascalu, Stefan Ruseti","doi":"10.1109/SYNASC57785.2022.00032","DOIUrl":null,"url":null,"abstract":"The task of document summarization became more pressing as the information volume increased exponentially from news websites to scientific writings. As such, the necessity for tools that automatically summarize written text, while keeping its meaning and extracting relevant information, increased. Extractive summarization is an NLP task that targets the identification of relevant sentences from a document and the creation of a summary with those phrases. While extensive research and large datasets are available in English, Romanian and other low-resource languages lack such methods and corpora. In this work, we introduce a new approach for summarization using a Masked Language Model for assessing sentence importance, and we research several baselines for the Romanian language including K-Means with BERT embeddings, an MLP considering handcrafted features, and PacSum. In addition, we also present an evaluation corpus to be used for assessing current and future models. The unsupervised methods do not require large datasets for training and make use of low computational power. All of the proposed approaches consider BERT, a state-of-the-art Transformer used for generating contextualized embeddings. The obtained ROUGE score of 56.29 is comparable with state-of-the-art scores and the METEOR average of 51.20 supersedes the most advanced current model.","PeriodicalId":446065,"journal":{"name":"2022 24th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 24th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SYNASC57785.2022.00032","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The task of document summarization became more pressing as the information volume increased exponentially from news websites to scientific writings. As such, the necessity for tools that automatically summarize written text, while keeping its meaning and extracting relevant information, increased. Extractive summarization is an NLP task that targets the identification of relevant sentences from a document and the creation of a summary with those phrases. While extensive research and large datasets are available in English, Romanian and other low-resource languages lack such methods and corpora. In this work, we introduce a new approach for summarization using a Masked Language Model for assessing sentence importance, and we research several baselines for the Romanian language including K-Means with BERT embeddings, an MLP considering handcrafted features, and PacSum. In addition, we also present an evaluation corpus to be used for assessing current and future models. The unsupervised methods do not require large datasets for training and make use of low computational power. All of the proposed approaches consider BERT, a state-of-the-art Transformer used for generating contextualized embeddings. The obtained ROUGE score of 56.29 is comparable with state-of-the-art scores and the METEOR average of 51.20 supersedes the most advanced current model.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于BERT的无监督抽取摘要
随着从新闻网站到科学著作的信息量呈指数级增长,文献摘要的任务变得更加紧迫。因此,对自动总结书面文本,同时保留其含义并提取相关信息的工具的需求增加了。摘要提取是一项NLP任务,其目标是从文档中识别相关句子并使用这些短语创建摘要。虽然英语有广泛的研究和大型数据集,但罗马尼亚语和其他资源匮乏的语言缺乏这样的方法和语料库。在这项工作中,我们引入了一种使用屏蔽语言模型来评估句子重要性的新方法,并研究了罗马尼亚语的几个基线,包括带有BERT嵌入的K-Means,考虑手工制作特征的MLP和PacSum。此外,我们还提出了一个评估语料库,用于评估当前和未来的模型。无监督方法不需要大量的数据集进行训练,并且利用较低的计算能力。所有建议的方法都考虑了BERT,这是一种用于生成上下文化嵌入的最先进的Transformer。得到的ROUGE分数为56.29,与最先进的分数相当,METEOR平均值为51.20,取代了目前最先进的模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
FERPModels: A Certification Framework for Expansion-Based QBF Solving Reducing Adversarial Vulnerability Using GANs Fully-adaptive Model for Broadcasting with Universal Lists An Ant Colony Optimisation Approach to the Densest k-Subgraph Problem* IPO-MAXSAT: The In-Parameter-Order Strategy combined with MaxSAT solving for Covering Array Generation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1