A Generic Solver Combining Unsupervised Learning and Representation Learning for Breaking Text-Based Captchas

Proceedings of The Web Conference 2020 Pub Date : 2020-04-20 DOI:10.1145/3366423.3380166

Sheng Tian, T. Xiong

{"title":"A Generic Solver Combining Unsupervised Learning and Representation Learning for Breaking Text-Based Captchas","authors":"Sheng Tian, T. Xiong","doi":"10.1145/3366423.3380166","DOIUrl":null,"url":null,"abstract":"Although there are many alternative captcha schemes available, text-based captchas are still one of the most popular security mechanism to maintain Internet security and prevent malicious attacks, due to the user preferences and ease of design. Over the past decade, different methods of breaking captchas have been proposed, which helps captcha keep evolving and become more robust. However, these previous works generally require heavy expert involvement and gradually become ineffective with the introduction of new security features. This paper proposes a generic solver combining unsupervised learning and representation learning to automatically remove the noisy background of captchas and solve text-based captchas. We introduce a new training scheme for constructing mini-batches, which contain a large number of unlabeled hard examples, to improve the efficiency of representation learning. Unlike existing deep learning algorithms, our method requires significantly fewer labeled samples and surpasses the recognition performance of a fully-supervised model with the same network architecture. Moreover, extensive experiments show that the proposed method outperforms state-of-the-art by delivering a higher accuracy on various captcha schemes. We provide further discussions of potential applications of the proposed unified framework. We hope that our work can inspire the community to enhance the security of text-based captchas.","PeriodicalId":20754,"journal":{"name":"Proceedings of The Web Conference 2020","volume":"341 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of The Web Conference 2020","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3366423.3380166","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

Abstract

Although there are many alternative captcha schemes available, text-based captchas are still one of the most popular security mechanism to maintain Internet security and prevent malicious attacks, due to the user preferences and ease of design. Over the past decade, different methods of breaking captchas have been proposed, which helps captcha keep evolving and become more robust. However, these previous works generally require heavy expert involvement and gradually become ineffective with the introduction of new security features. This paper proposes a generic solver combining unsupervised learning and representation learning to automatically remove the noisy background of captchas and solve text-based captchas. We introduce a new training scheme for constructing mini-batches, which contain a large number of unlabeled hard examples, to improve the efficiency of representation learning. Unlike existing deep learning algorithms, our method requires significantly fewer labeled samples and surpasses the recognition performance of a fully-supervised model with the same network architecture. Moreover, extensive experiments show that the proposed method outperforms state-of-the-art by delivering a higher accuracy on various captcha schemes. We provide further discussions of potential applications of the proposed unified framework. We hope that our work can inspire the community to enhance the security of text-based captchas.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

结合无监督学习和表示学习的破解文本验证码的通用求解器

尽管有许多替代验证码方案可用，但基于文本的验证码仍然是维护互联网安全和防止恶意攻击的最流行的安全机制之一，因为用户偏好和易于设计。在过去的十年里，人们提出了不同的破解验证码的方法，这有助于验证码不断发展并变得更加强大。然而，这些以前的工作通常需要大量的专家参与，并且随着新的安全特性的引入逐渐变得无效。本文提出了一种结合无监督学习和表示学习的通用求解器，用于自动去除验证码的噪声背景，求解基于文本的验证码。为了提高表示学习的效率，我们引入了一种新的训练方案来构造包含大量未标记的难样例的小批量。与现有的深度学习算法不同，我们的方法需要更少的标记样本，并且超过了具有相同网络架构的全监督模型的识别性能。此外，大量的实验表明，所提出的方法通过在各种验证码方案中提供更高的准确性而优于最先进的技术。我们进一步讨论了所提议的统一框架的潜在应用。我们希望我们的工作能够激励社会提高基于文本的验证码的安全性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of The Web Conference 2020

自引率

0.00%

发文量