将文档转换作为具有高吞吐量和响应能力的云服务交付

Q1 Computer Science IEEE Cloud Computing Pub Date : 2022-06-01 DOI:10.1109/CLOUD55607.2022.00060
Christoph Auer, Michele Dolfi, A. Carvalho, Cesar Berrospi Ramis, P. W. J. S. I. Research, SoftINSA Lda.
{"title":"将文档转换作为具有高吞吐量和响应能力的云服务交付","authors":"Christoph Auer, Michele Dolfi, A. Carvalho, Cesar Berrospi Ramis, P. W. J. S. I. Research, SoftINSA Lda.","doi":"10.1109/CLOUD55607.2022.00060","DOIUrl":null,"url":null,"abstract":"Document understanding is a key business process in the data-driven economy since documents are central to knowledge discovery and business insights. Converting documents into a machine-processable format is a particular challenge here due to their huge variability in formats and complex structure. Accordingly, many algorithms and machine-learning methods emerged to solve particular tasks such as Optical Character Recognition (OCR), layout analysis, table-structure recovery, figure understanding, etc. We observe the adoption of such methods in document understanding solutions offered by all major cloud providers. Yet, publications outlining how such services are designed and optimized to scale in the cloud are scarce. In this paper, we focus on the case of document conversion to illustrate the particular challenges of scaling a complex data processing pipeline with a strong reliance on machine-learning methods on cloud infrastructure. Our key objective is to achieve high scalability and responsiveness for different workload profiles in a well-defined resource budget. We outline the requirements, design, and implementation choices of our document conversion service and reflect on the challenges we faced. Evidence for the scaling behavior and resource efficiency is provided for two alternative workload distribution strategies and deployment configurations. Our best-performing method achieves sustained throughput of over one million PDF pages per hour on 3072 CPU cores across 192 nodes.","PeriodicalId":54281,"journal":{"name":"IEEE Cloud Computing","volume":"110 1","pages":"363-373"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness\",\"authors\":\"Christoph Auer, Michele Dolfi, A. Carvalho, Cesar Berrospi Ramis, P. W. J. S. I. Research, SoftINSA Lda.\",\"doi\":\"10.1109/CLOUD55607.2022.00060\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Document understanding is a key business process in the data-driven economy since documents are central to knowledge discovery and business insights. Converting documents into a machine-processable format is a particular challenge here due to their huge variability in formats and complex structure. Accordingly, many algorithms and machine-learning methods emerged to solve particular tasks such as Optical Character Recognition (OCR), layout analysis, table-structure recovery, figure understanding, etc. We observe the adoption of such methods in document understanding solutions offered by all major cloud providers. Yet, publications outlining how such services are designed and optimized to scale in the cloud are scarce. In this paper, we focus on the case of document conversion to illustrate the particular challenges of scaling a complex data processing pipeline with a strong reliance on machine-learning methods on cloud infrastructure. Our key objective is to achieve high scalability and responsiveness for different workload profiles in a well-defined resource budget. We outline the requirements, design, and implementation choices of our document conversion service and reflect on the challenges we faced. Evidence for the scaling behavior and resource efficiency is provided for two alternative workload distribution strategies and deployment configurations. Our best-performing method achieves sustained throughput of over one million PDF pages per hour on 3072 CPU cores across 192 nodes.\",\"PeriodicalId\":54281,\"journal\":{\"name\":\"IEEE Cloud Computing\",\"volume\":\"110 1\",\"pages\":\"363-373\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Cloud Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CLOUD55607.2022.00060\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLOUD55607.2022.00060","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 2

摘要

文档理解是数据驱动经济中的一个关键业务流程,因为文档是知识发现和业务洞察的核心。将文档转换为机器可处理的格式在这里是一个特别的挑战,因为它们在格式和复杂的结构上有很大的可变性。因此,出现了许多算法和机器学习方法来解决特定的任务,如光学字符识别(OCR)、布局分析、表结构恢复、图形理解等。我们观察到所有主要云提供商提供的文档理解解决方案都采用了这些方法。然而,概述如何设计和优化这些服务以在云中扩展的出版物很少。在本文中,我们将重点关注文档转换的情况,以说明扩展复杂数据处理管道的特殊挑战,该管道强烈依赖于云基础设施上的机器学习方法。我们的主要目标是在定义良好的资源预算中实现不同工作负载配置文件的高可伸缩性和响应性。我们概述了文档转换服务的需求、设计和实现选择,并反映了我们面临的挑战。本文为两种可选的工作负载分布策略和部署配置提供了扩展行为和资源效率的证据。我们性能最好的方法在192个节点上的3072个CPU内核上实现了每小时超过100万PDF页面的持续吞吐量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness
Document understanding is a key business process in the data-driven economy since documents are central to knowledge discovery and business insights. Converting documents into a machine-processable format is a particular challenge here due to their huge variability in formats and complex structure. Accordingly, many algorithms and machine-learning methods emerged to solve particular tasks such as Optical Character Recognition (OCR), layout analysis, table-structure recovery, figure understanding, etc. We observe the adoption of such methods in document understanding solutions offered by all major cloud providers. Yet, publications outlining how such services are designed and optimized to scale in the cloud are scarce. In this paper, we focus on the case of document conversion to illustrate the particular challenges of scaling a complex data processing pipeline with a strong reliance on machine-learning methods on cloud infrastructure. Our key objective is to achieve high scalability and responsiveness for different workload profiles in a well-defined resource budget. We outline the requirements, design, and implementation choices of our document conversion service and reflect on the challenges we faced. Evidence for the scaling behavior and resource efficiency is provided for two alternative workload distribution strategies and deployment configurations. Our best-performing method achieves sustained throughput of over one million PDF pages per hour on 3072 CPU cores across 192 nodes.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE Cloud Computing
IEEE Cloud Computing Computer Science-Computer Networks and Communications
CiteScore
11.20
自引率
0.00%
发文量
0
期刊介绍: Cessation. IEEE Cloud Computing is committed to the timely publication of peer-reviewed articles that provide innovative research ideas, applications results, and case studies in all areas of cloud computing. Topics relating to novel theory, algorithms, performance analyses and applications of techniques are covered. More specifically: Cloud software, Cloud security, Trade-offs between privacy and utility of cloud, Cloud in the business environment, Cloud economics, Cloud governance, Migrating to the cloud, Cloud standards, Development tools, Backup and recovery, Interoperability, Applications management, Data analytics, Communications protocols, Mobile cloud, Private clouds, Liability issues for data loss on clouds, Data integration, Big data, Cloud education, Cloud skill sets, Cloud energy consumption, The architecture of cloud computing, Applications in commerce, education, and industry, Infrastructure as a Service (IaaS), Platform as a Service (PaaS), Software as a Service (SaaS), Business Process as a Service (BPaaS)
期刊最新文献
Different in different ways: A network-analysis approach to voice and prosody in Autism Spectrum Disorder. Layered Contention Mitigation for Cloud Storage Towards More Effective and Explainable Fault Management Using Cross-Layer Service Topology Bypass Container Overlay Networks with Transparent BPF-driven Socket Replacement Event-Driven Approach for Monitoring and Orchestration of Cloud and Edge-Enabled IoT Systems
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1