PolyDoc: Surveying PDF Files from the PolySwarm network

Prashant Anantharaman, R. Lathrop, Rebecca Shapiro, M. Locasto
{"title":"PolyDoc: Surveying PDF Files from the PolySwarm network","authors":"Prashant Anantharaman, R. Lathrop, Rebecca Shapiro, M. Locasto","doi":"10.1109/SPW59333.2023.00017","DOIUrl":null,"url":null,"abstract":"Complex data formats implicitly demand complex logic to parse and apprehend them. The Portable Document Format (PDF) is among the most demanding formats because it is used as both a data exchange and presentation format, and it has a particularly stringent tradition of supporting in-teroperability and consistent presentation. These requirements create complexity that presents an opportunity for adversaries to encode a variety of exploits and attacks. To investigate whether there is an association between structural malforms and malice (using PDF files as the example challenge format), we built PolyDoc, a tool that conducts format-aware tracing of files pulled from the PolySwarm network. The PolySwarm network crowdsources threat intelligence by running files through several industry-scale threat-detection engines. The PolySwarm network provides a PolyScore, which indicates whether a file is safe or malicious, as judged by threat-detection engines. We ran PolyDoc in a live hunt mode to gather PDF files submitted to PolySwarm and then trace the execution of these PDF files through popular PDF tools such as Mutool, Poppler, and Caradoc. We collected and analyzed 58,906 files from PolySwarm. Further, we used the PDF Error Ontology to assign error categories based on tracer output and compared them to the PolyScore. Our work demonstrates three core insights. First, PDF files classified as malicious contain syntactic malformations. Second, “uncategorized” error ontology classes were common across our different PDF tools—demonstrating that the PDF Error Ontology may be underspecified for files that real-world threat engines receive. Finally, attackers leverage specific syntactic malformations in attacks: malformations that current PDF tools can detect.","PeriodicalId":308378,"journal":{"name":"2023 IEEE Security and Privacy Workshops (SPW)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE Security and Privacy Workshops (SPW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPW59333.2023.00017","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Complex data formats implicitly demand complex logic to parse and apprehend them. The Portable Document Format (PDF) is among the most demanding formats because it is used as both a data exchange and presentation format, and it has a particularly stringent tradition of supporting in-teroperability and consistent presentation. These requirements create complexity that presents an opportunity for adversaries to encode a variety of exploits and attacks. To investigate whether there is an association between structural malforms and malice (using PDF files as the example challenge format), we built PolyDoc, a tool that conducts format-aware tracing of files pulled from the PolySwarm network. The PolySwarm network crowdsources threat intelligence by running files through several industry-scale threat-detection engines. The PolySwarm network provides a PolyScore, which indicates whether a file is safe or malicious, as judged by threat-detection engines. We ran PolyDoc in a live hunt mode to gather PDF files submitted to PolySwarm and then trace the execution of these PDF files through popular PDF tools such as Mutool, Poppler, and Caradoc. We collected and analyzed 58,906 files from PolySwarm. Further, we used the PDF Error Ontology to assign error categories based on tracer output and compared them to the PolyScore. Our work demonstrates three core insights. First, PDF files classified as malicious contain syntactic malformations. Second, “uncategorized” error ontology classes were common across our different PDF tools—demonstrating that the PDF Error Ontology may be underspecified for files that real-world threat engines receive. Finally, attackers leverage specific syntactic malformations in attacks: malformations that current PDF tools can detect.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
PolyDoc:调查来自polywarm网络的PDF文件
复杂的数据格式隐含地需要复杂的逻辑来解析和理解它们。可移植文档格式(Portable Document Format, PDF)是要求最高的格式之一,因为它既可用作数据交换格式,也可用作表示格式,而且它在支持互操作性和一致表示方面有着特别严格的传统。这些需求带来了复杂性,为对手提供了编码各种利用和攻击的机会。为了调查结构畸形和恶意之间是否存在关联(使用PDF文件作为示例挑战格式),我们构建了PolyDoc,这是一个对从PolySwarm网络中提取的文件进行格式感知跟踪的工具。该网络通过几个行业规模的威胁检测引擎运行文件,将威胁情报众包。该网络提供了一个PolyScore,它表明一个文件是安全的还是恶意的,由威胁检测引擎判断。我们以实时搜索模式运行PolyDoc,收集提交给PolySwarm的PDF文件,然后通过流行的PDF工具(如Mutool, Poppler和Caradoc)跟踪这些PDF文件的执行情况。我们收集并分析了来自PolySwarm的58,906个文件。此外,我们使用PDF错误本体根据跟踪器输出分配错误类别,并将它们与PolyScore进行比较。我们的工作展示了三个核心见解。首先,被归类为恶意的PDF文件包含语法错误。其次,“未分类”的错误本体类在我们不同的PDF工具中很常见,这表明PDF错误本体可能没有为现实世界的威胁引擎接收的文件指定充分。最后,攻击者利用攻击中特定的语法错误:当前PDF工具可以检测到的错误。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
DISV: Domain Independent Semantic Validation of Data Files PolyDoc: Surveying PDF Files from the PolySwarm network Emoji shellcoding in RISC-V Divergent Representations: When Compiler Optimizations Enable Exploitation Cryo-Mechanical RAM Content Extraction Against Modern Embedded Systems
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1