Van Zachary V. Singco, Joel C. Trillo, Cristopher C. Abalorio, James Cloyd M. Bustillo, Junell T. Bojocan, Michelle C. Elape
{"title":"基于ocr的混合图像文本摘要器,使用Luhn算法和FinetuneTransformer模型用于长文档","authors":"Van Zachary V. Singco, Joel C. Trillo, Cristopher C. Abalorio, James Cloyd M. Bustillo, Junell T. Bojocan, Michelle C. Elape","doi":"10.46338/ijetae0223_07","DOIUrl":null,"url":null,"abstract":"The accessibility of an enormous number of image text documents on the internet has expanded the opportunities to develop a system for image text recognition with text summarization. Several approaches used in ATS in the literature are based on extractive and abstractive techniques; however, few implementations of the hybrid approach were observed. This paper employed state-of-the-art transformer models with the Luhn algorithm for extracted texts using Tesseract OCR. Nine models were generated and tested using the hybrid text summarization approach. Using ROUGE metrics, we compared the proposed system finetune abstractive models against existing abstractive models that use the same dataset Xsum. As a result, the finetune model got the highest ROUGE score during evaluation; in ROUGE-1 score was 57%, the ROUGE-2 score was 43%, and the ROUGE-L score was 42%. Furthermore, even when better algorithms and models were available for summarization, the Luhn algorithm and T5 finetune model provided significant results.","PeriodicalId":169403,"journal":{"name":"International Journal of Emerging Technology and Advanced Engineering","volume":"75 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"OCR-based Hybrid Image Text Summarizer using Luhn Algorithm with FinetuneTransformer Modelsfor Long Document\",\"authors\":\"Van Zachary V. Singco, Joel C. Trillo, Cristopher C. Abalorio, James Cloyd M. Bustillo, Junell T. Bojocan, Michelle C. Elape\",\"doi\":\"10.46338/ijetae0223_07\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The accessibility of an enormous number of image text documents on the internet has expanded the opportunities to develop a system for image text recognition with text summarization. Several approaches used in ATS in the literature are based on extractive and abstractive techniques; however, few implementations of the hybrid approach were observed. This paper employed state-of-the-art transformer models with the Luhn algorithm for extracted texts using Tesseract OCR. Nine models were generated and tested using the hybrid text summarization approach. Using ROUGE metrics, we compared the proposed system finetune abstractive models against existing abstractive models that use the same dataset Xsum. As a result, the finetune model got the highest ROUGE score during evaluation; in ROUGE-1 score was 57%, the ROUGE-2 score was 43%, and the ROUGE-L score was 42%. Furthermore, even when better algorithms and models were available for summarization, the Luhn algorithm and T5 finetune model provided significant results.\",\"PeriodicalId\":169403,\"journal\":{\"name\":\"International Journal of Emerging Technology and Advanced Engineering\",\"volume\":\"75 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Emerging Technology and Advanced Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.46338/ijetae0223_07\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Emerging Technology and Advanced Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.46338/ijetae0223_07","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
OCR-based Hybrid Image Text Summarizer using Luhn Algorithm with FinetuneTransformer Modelsfor Long Document
The accessibility of an enormous number of image text documents on the internet has expanded the opportunities to develop a system for image text recognition with text summarization. Several approaches used in ATS in the literature are based on extractive and abstractive techniques; however, few implementations of the hybrid approach were observed. This paper employed state-of-the-art transformer models with the Luhn algorithm for extracted texts using Tesseract OCR. Nine models were generated and tested using the hybrid text summarization approach. Using ROUGE metrics, we compared the proposed system finetune abstractive models against existing abstractive models that use the same dataset Xsum. As a result, the finetune model got the highest ROUGE score during evaluation; in ROUGE-1 score was 57%, the ROUGE-2 score was 43%, and the ROUGE-L score was 42%. Furthermore, even when better algorithms and models were available for summarization, the Luhn algorithm and T5 finetune model provided significant results.