Hakuin:用概率语言模型优化SQL盲注入

2023 IEEE Security and Privacy Workshops (SPW) Pub Date : 2023-05-01 DOI:10.1109/SPW59333.2023.00039

Jakub Pruzinec, Quynh Anh Nguyen

{"title":"Hakuin:用概率语言模型优化SQL盲注入","authors":"Jakub Pruzinec, Quynh Anh Nguyen","doi":"10.1109/SPW59333.2023.00039","DOIUrl":null,"url":null,"abstract":"SQL Injection (SQLI) is a pervasive web attack where a malicious input is used to dynamically build SQL queries in a way that tricks the database (DB) engine into performing unintended harmful operations. Among many potential exploitations, an attacker may opt to exfiltrate the application data. The exfiltration process is straightforward when the web application responds to injected queries with their results. In case the content is not exposed, the adversary can still deduce it using Blind SQLI (BSQLI), an inference technique based on response differences or time delays. Unfortunately, a common drawback of BSQLI is its low inference rate (one bit per request), which severely limits the volume of data that can be extracted this way. To address this limitation, the state-of-the-art BSQLI tools optimize the inference of textual data with binary search. However, this approach has two major limitations: it assumes a uniform distribution of characters and does not take into account the history of previously inferred characters. Consequently, the technique is inefficient for natural languages used ubiquitously in DBs. This paper presents Hakuin - a new framework for optimizing BSQLI with probabilistic language models. Hakuin employs domain-specific pre-trained and adaptive models to predict the next characters based on the inference history and prioritizes characters with a higher probability of being the right ones. It also tracks statistical information to opportunistically guess strings as a whole instead of inferring the characters separately. We benchmark Hakuin against 3 state-of-the-art BSQLI tools using 20 industry-standard DB schemas and a generic DB. The results show that Hakuin is about 6 times more efficient in inferring schemas, up to 3.2 times more efficient with generic data, and up to 26 times more efficient on columns with limited values compared to the second-best performing tool. To the best of our knowledge, Hakuin is the first solution that combines domain-specific pre-trained and adaptive language models to optimize BSQLI. We release its full source code, datasets, and language models to facilitate further research.","PeriodicalId":308378,"journal":{"name":"2023 IEEE Security and Privacy Workshops (SPW)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Hakuin: Optimizing Blind SQL Injection with Probabilistic Language Models\",\"authors\":\"Jakub Pruzinec, Quynh Anh Nguyen\",\"doi\":\"10.1109/SPW59333.2023.00039\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"SQL Injection (SQLI) is a pervasive web attack where a malicious input is used to dynamically build SQL queries in a way that tricks the database (DB) engine into performing unintended harmful operations. Among many potential exploitations, an attacker may opt to exfiltrate the application data. The exfiltration process is straightforward when the web application responds to injected queries with their results. In case the content is not exposed, the adversary can still deduce it using Blind SQLI (BSQLI), an inference technique based on response differences or time delays. Unfortunately, a common drawback of BSQLI is its low inference rate (one bit per request), which severely limits the volume of data that can be extracted this way. To address this limitation, the state-of-the-art BSQLI tools optimize the inference of textual data with binary search. However, this approach has two major limitations: it assumes a uniform distribution of characters and does not take into account the history of previously inferred characters. Consequently, the technique is inefficient for natural languages used ubiquitously in DBs. This paper presents Hakuin - a new framework for optimizing BSQLI with probabilistic language models. Hakuin employs domain-specific pre-trained and adaptive models to predict the next characters based on the inference history and prioritizes characters with a higher probability of being the right ones. It also tracks statistical information to opportunistically guess strings as a whole instead of inferring the characters separately. We benchmark Hakuin against 3 state-of-the-art BSQLI tools using 20 industry-standard DB schemas and a generic DB. The results show that Hakuin is about 6 times more efficient in inferring schemas, up to 3.2 times more efficient with generic data, and up to 26 times more efficient on columns with limited values compared to the second-best performing tool. To the best of our knowledge, Hakuin is the first solution that combines domain-specific pre-trained and adaptive language models to optimize BSQLI. We release its full source code, datasets, and language models to facilitate further research.\",\"PeriodicalId\":308378,\"journal\":{\"name\":\"2023 IEEE Security and Privacy Workshops (SPW)\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE Security and Privacy Workshops (SPW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SPW59333.2023.00039\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE Security and Privacy Workshops (SPW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPW59333.2023.00039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

SQL注入(SQL Injection, SQLI)是一种普遍存在的web攻击，在这种攻击中，恶意输入被用来动态构建SQL查询，从而欺骗数据库(DB)引擎执行意外的有害操作。在许多潜在的利用中，攻击者可能会选择泄漏应用程序数据。当web应用程序用其结果响应注入的查询时，泄漏过程很简单。如果内容没有公开，攻击者仍然可以使用盲SQLI (BSQLI)推断出内容，盲SQLI是一种基于响应差异或时间延迟的推断技术。不幸的是，BSQLI的一个常见缺点是它的推断率很低(每个请求一个比特)，这严重限制了可以通过这种方式提取的数据量。为了解决这个限制，最先进的BSQLI工具使用二进制搜索优化了文本数据的推理。然而，这种方法有两个主要的局限性:它假设字符的均匀分布，并且没有考虑先前推断的字符的历史。因此，该技术对于db中普遍使用的自然语言是低效的。本文提出了基于概率语言模型的BSQLI优化框架Hakuin。Hakuin使用特定领域的预训练和自适应模型，根据推理历史预测下一个字符，并优先考虑概率较高的字符。它还跟踪统计信息，以机会主义地猜测字符串作为一个整体，而不是单独推断字符。我们使用20个行业标准数据库模式和一个通用数据库，对Hakuin和3个最先进的BSQLI工具进行了基准测试。结果表明，与性能第二好的工具相比，Hakuin在推断模式方面的效率提高了6倍，在处理通用数据方面的效率提高了3.2倍，在处理具有有限值的列方面的效率提高了26倍。据我们所知，Hakuin是第一个结合了特定领域预训练和自适应语言模型来优化BSQLI的解决方案。我们发布了完整的源代码、数据集和语言模型，以促进进一步的研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Hakuin: Optimizing Blind SQL Injection with Probabilistic Language Models

SQL Injection (SQLI) is a pervasive web attack where a malicious input is used to dynamically build SQL queries in a way that tricks the database (DB) engine into performing unintended harmful operations. Among many potential exploitations, an attacker may opt to exfiltrate the application data. The exfiltration process is straightforward when the web application responds to injected queries with their results. In case the content is not exposed, the adversary can still deduce it using Blind SQLI (BSQLI), an inference technique based on response differences or time delays. Unfortunately, a common drawback of BSQLI is its low inference rate (one bit per request), which severely limits the volume of data that can be extracted this way. To address this limitation, the state-of-the-art BSQLI tools optimize the inference of textual data with binary search. However, this approach has two major limitations: it assumes a uniform distribution of characters and does not take into account the history of previously inferred characters. Consequently, the technique is inefficient for natural languages used ubiquitously in DBs. This paper presents Hakuin - a new framework for optimizing BSQLI with probabilistic language models. Hakuin employs domain-specific pre-trained and adaptive models to predict the next characters based on the inference history and prioritizes characters with a higher probability of being the right ones. It also tracks statistical information to opportunistically guess strings as a whole instead of inferring the characters separately. We benchmark Hakuin against 3 state-of-the-art BSQLI tools using 20 industry-standard DB schemas and a generic DB. The results show that Hakuin is about 6 times more efficient in inferring schemas, up to 3.2 times more efficient with generic data, and up to 26 times more efficient on columns with limited values compared to the second-best performing tool. To the best of our knowledge, Hakuin is the first solution that combines domain-specific pre-trained and adaptive language models to optimize BSQLI. We release its full source code, datasets, and language models to facilitate further research.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 IEEE Security and Privacy Workshops (SPW)

自引率

0.00%

发文量

期刊最新文献

DISV: Domain Independent Semantic Validation of Data Files PolyDoc: Surveying PDF Files from the PolySwarm network Emoji shellcoding in RISC-V Divergent Representations: When Compiler Optimizations Enable Exploitation Cryo-Mechanical RAM Content Extraction Against Modern Embedded Systems