Hakuin: Optimizing Blind SQL Injection with Probabilistic Language Models

Jakub Pruzinec, Quynh Anh Nguyen
{"title":"Hakuin: Optimizing Blind SQL Injection with Probabilistic Language Models","authors":"Jakub Pruzinec, Quynh Anh Nguyen","doi":"10.1109/SPW59333.2023.00039","DOIUrl":null,"url":null,"abstract":"SQL Injection (SQLI) is a pervasive web attack where a malicious input is used to dynamically build SQL queries in a way that tricks the database (DB) engine into performing unintended harmful operations. Among many potential exploitations, an attacker may opt to exfiltrate the application data. The exfiltration process is straightforward when the web application responds to injected queries with their results. In case the content is not exposed, the adversary can still deduce it using Blind SQLI (BSQLI), an inference technique based on response differences or time delays. Unfortunately, a common drawback of BSQLI is its low inference rate (one bit per request), which severely limits the volume of data that can be extracted this way. To address this limitation, the state-of-the-art BSQLI tools optimize the inference of textual data with binary search. However, this approach has two major limitations: it assumes a uniform distribution of characters and does not take into account the history of previously inferred characters. Consequently, the technique is inefficient for natural languages used ubiquitously in DBs. This paper presents Hakuin - a new framework for optimizing BSQLI with probabilistic language models. Hakuin employs domain-specific pre-trained and adaptive models to predict the next characters based on the inference history and prioritizes characters with a higher probability of being the right ones. It also tracks statistical information to opportunistically guess strings as a whole instead of inferring the characters separately. We benchmark Hakuin against 3 state-of-the-art BSQLI tools using 20 industry-standard DB schemas and a generic DB. The results show that Hakuin is about 6 times more efficient in inferring schemas, up to 3.2 times more efficient with generic data, and up to 26 times more efficient on columns with limited values compared to the second-best performing tool. To the best of our knowledge, Hakuin is the first solution that combines domain-specific pre-trained and adaptive language models to optimize BSQLI. We release its full source code, datasets, and language models to facilitate further research.","PeriodicalId":308378,"journal":{"name":"2023 IEEE Security and Privacy Workshops (SPW)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE Security and Privacy Workshops (SPW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPW59333.2023.00039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

SQL Injection (SQLI) is a pervasive web attack where a malicious input is used to dynamically build SQL queries in a way that tricks the database (DB) engine into performing unintended harmful operations. Among many potential exploitations, an attacker may opt to exfiltrate the application data. The exfiltration process is straightforward when the web application responds to injected queries with their results. In case the content is not exposed, the adversary can still deduce it using Blind SQLI (BSQLI), an inference technique based on response differences or time delays. Unfortunately, a common drawback of BSQLI is its low inference rate (one bit per request), which severely limits the volume of data that can be extracted this way. To address this limitation, the state-of-the-art BSQLI tools optimize the inference of textual data with binary search. However, this approach has two major limitations: it assumes a uniform distribution of characters and does not take into account the history of previously inferred characters. Consequently, the technique is inefficient for natural languages used ubiquitously in DBs. This paper presents Hakuin - a new framework for optimizing BSQLI with probabilistic language models. Hakuin employs domain-specific pre-trained and adaptive models to predict the next characters based on the inference history and prioritizes characters with a higher probability of being the right ones. It also tracks statistical information to opportunistically guess strings as a whole instead of inferring the characters separately. We benchmark Hakuin against 3 state-of-the-art BSQLI tools using 20 industry-standard DB schemas and a generic DB. The results show that Hakuin is about 6 times more efficient in inferring schemas, up to 3.2 times more efficient with generic data, and up to 26 times more efficient on columns with limited values compared to the second-best performing tool. To the best of our knowledge, Hakuin is the first solution that combines domain-specific pre-trained and adaptive language models to optimize BSQLI. We release its full source code, datasets, and language models to facilitate further research.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Hakuin:用概率语言模型优化SQL盲注入
SQL注入(SQL Injection, SQLI)是一种普遍存在的web攻击,在这种攻击中,恶意输入被用来动态构建SQL查询,从而欺骗数据库(DB)引擎执行意外的有害操作。在许多潜在的利用中,攻击者可能会选择泄漏应用程序数据。当web应用程序用其结果响应注入的查询时,泄漏过程很简单。如果内容没有公开,攻击者仍然可以使用盲SQLI (BSQLI)推断出内容,盲SQLI是一种基于响应差异或时间延迟的推断技术。不幸的是,BSQLI的一个常见缺点是它的推断率很低(每个请求一个比特),这严重限制了可以通过这种方式提取的数据量。为了解决这个限制,最先进的BSQLI工具使用二进制搜索优化了文本数据的推理。然而,这种方法有两个主要的局限性:它假设字符的均匀分布,并且没有考虑先前推断的字符的历史。因此,该技术对于db中普遍使用的自然语言是低效的。本文提出了基于概率语言模型的BSQLI优化框架Hakuin。Hakuin使用特定领域的预训练和自适应模型,根据推理历史预测下一个字符,并优先考虑概率较高的字符。它还跟踪统计信息,以机会主义地猜测字符串作为一个整体,而不是单独推断字符。我们使用20个行业标准数据库模式和一个通用数据库,对Hakuin和3个最先进的BSQLI工具进行了基准测试。结果表明,与性能第二好的工具相比,Hakuin在推断模式方面的效率提高了6倍,在处理通用数据方面的效率提高了3.2倍,在处理具有有限值的列方面的效率提高了26倍。据我们所知,Hakuin是第一个结合了特定领域预训练和自适应语言模型来优化BSQLI的解决方案。我们发布了完整的源代码、数据集和语言模型,以促进进一步的研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
DISV: Domain Independent Semantic Validation of Data Files PolyDoc: Surveying PDF Files from the PolySwarm network Emoji shellcoding in RISC-V Divergent Representations: When Compiler Optimizations Enable Exploitation Cryo-Mechanical RAM Content Extraction Against Modern Embedded Systems
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1