{"title":"黄金标准普什图数据集和细分应用程序的开发","authors":"Yan Han, Marek Rychlik","doi":"10.6017/ITAL.V40I1.12553","DOIUrl":null,"url":null,"abstract":"The article aims to introduce a gold-standard Pashto dataset and a segmentation app. The Pashto dataset consists of 300 line images and corresponding Pashto text from three selected books. A line image is simply an image consisting of one text line from a scanned page. To our knowledge, this is one of the first open access datasets which directly maps line images to their corresponding text in the Pashto language. We also introduce the development of a segmentation app using textbox expanding algorithms, a different approach to OCR segmentation.","PeriodicalId":50361,"journal":{"name":"Information Technology and Libraries","volume":" ","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2021-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Development of a Gold-standard Pashto Dataset and a Segmentation App\",\"authors\":\"Yan Han, Marek Rychlik\",\"doi\":\"10.6017/ITAL.V40I1.12553\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The article aims to introduce a gold-standard Pashto dataset and a segmentation app. The Pashto dataset consists of 300 line images and corresponding Pashto text from three selected books. A line image is simply an image consisting of one text line from a scanned page. To our knowledge, this is one of the first open access datasets which directly maps line images to their corresponding text in the Pashto language. We also introduce the development of a segmentation app using textbox expanding algorithms, a different approach to OCR segmentation.\",\"PeriodicalId\":50361,\"journal\":{\"name\":\"Information Technology and Libraries\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2021-03-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Technology and Libraries\",\"FirstCategoryId\":\"91\",\"ListUrlMain\":\"https://doi.org/10.6017/ITAL.V40I1.12553\",\"RegionNum\":4,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Technology and Libraries","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.6017/ITAL.V40I1.12553","RegionNum":4,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Development of a Gold-standard Pashto Dataset and a Segmentation App
The article aims to introduce a gold-standard Pashto dataset and a segmentation app. The Pashto dataset consists of 300 line images and corresponding Pashto text from three selected books. A line image is simply an image consisting of one text line from a scanned page. To our knowledge, this is one of the first open access datasets which directly maps line images to their corresponding text in the Pashto language. We also introduce the development of a segmentation app using textbox expanding algorithms, a different approach to OCR segmentation.
期刊介绍:
Information Technology and Libraries publishes original material related to all aspects of information technology in all types of libraries. Topic areas include, but are not limited to, library automation, digital libraries, metadata, identity management, distributed systems and networks, computer security, intellectual property rights, technical standards, geographic information systems, desktop applications, information discovery tools, web-scale library services, cloud computing, digital preservation, data curation, virtualization, search-engine optimization, emerging technologies, social networking, open data, the semantic web, mobile services and applications, usability, universal access to technology, library consortia, vendor relations, and digital humanities.