{"title":"Unsupervised Extractive Summarization with BERT","authors":"A. Dutulescu, M. Dascalu, Stefan Ruseti","doi":"10.1109/SYNASC57785.2022.00032","DOIUrl":null,"url":null,"abstract":"The task of document summarization became more pressing as the information volume increased exponentially from news websites to scientific writings. As such, the necessity for tools that automatically summarize written text, while keeping its meaning and extracting relevant information, increased. Extractive summarization is an NLP task that targets the identification of relevant sentences from a document and the creation of a summary with those phrases. While extensive research and large datasets are available in English, Romanian and other low-resource languages lack such methods and corpora. In this work, we introduce a new approach for summarization using a Masked Language Model for assessing sentence importance, and we research several baselines for the Romanian language including K-Means with BERT embeddings, an MLP considering handcrafted features, and PacSum. In addition, we also present an evaluation corpus to be used for assessing current and future models. The unsupervised methods do not require large datasets for training and make use of low computational power. All of the proposed approaches consider BERT, a state-of-the-art Transformer used for generating contextualized embeddings. The obtained ROUGE score of 56.29 is comparable with state-of-the-art scores and the METEOR average of 51.20 supersedes the most advanced current model.","PeriodicalId":446065,"journal":{"name":"2022 24th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 24th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SYNASC57785.2022.00032","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The task of document summarization became more pressing as the information volume increased exponentially from news websites to scientific writings. As such, the necessity for tools that automatically summarize written text, while keeping its meaning and extracting relevant information, increased. Extractive summarization is an NLP task that targets the identification of relevant sentences from a document and the creation of a summary with those phrases. While extensive research and large datasets are available in English, Romanian and other low-resource languages lack such methods and corpora. In this work, we introduce a new approach for summarization using a Masked Language Model for assessing sentence importance, and we research several baselines for the Romanian language including K-Means with BERT embeddings, an MLP considering handcrafted features, and PacSum. In addition, we also present an evaluation corpus to be used for assessing current and future models. The unsupervised methods do not require large datasets for training and make use of low computational power. All of the proposed approaches consider BERT, a state-of-the-art Transformer used for generating contextualized embeddings. The obtained ROUGE score of 56.29 is comparable with state-of-the-art scores and the METEOR average of 51.20 supersedes the most advanced current model.