Finding Clues for Your Secrets: Semantics-Driven, Learning-Based Privacy Discovery in Mobile Apps

Proceedings 2019 Network and Distributed System Security Symposium Pub Date : 2018-01-01 DOI:10.14722/ndss.2018.23099

Yuhong Nan, Zhemin Yang, Xiaofeng Wang, Yuan Zhang, Donglai Zhu, Min Yang

{"title":"Finding Clues for Your Secrets: Semantics-Driven, Learning-Based Privacy Discovery in Mobile Apps","authors":"Yuhong Nan, Zhemin Yang, Xiaofeng Wang, Yuan Zhang, Donglai Zhu, Min Yang","doi":"10.14722/ndss.2018.23099","DOIUrl":null,"url":null,"abstract":"A long-standing challenge in analyzing information leaks within mobile apps is to automatically identify the code operating on sensitive data. With all existing solutions relying on System APIs (e.g., IMEI, GPS location) or features of user interfaces (UI), the content from app servers, like user’s Facebook profile, payment history, fall through the crack. Finding such content is important given the fact that most apps today are web applications, whose critical data are often on the server side. In the meantime, operations on the data within mobile apps are often hard to capture, since all server-side information is delivered to the app in the same way, sensitive or not. A unique observation of our research is that in modern apps, a program is essentially a semantics-rich documentation carrying meaningful program elements such as method names, variables and constants that reveal the sensitive data involved, even when the program is under moderate obfuscation. Leveraging this observation, we develop a novel semantics-driven solution for automatic discovery of sensitive user data, including those from the server side. Our approach utilizes natural language processing (NLP) to automatically locate the program elements (variables, methods, etc.) of interest, and then performs a learning-based program structure analysis to accurately identify those indeed carrying sensitive content. Using this new technique, we analyzed 445,668 popular apps, an unprecedented scale for this type of research. Our work brings to light the pervasiveness of information leaks, and the channels through which the leaks happen, including unintentional over-sharing across libraries and aggressive data acquisition behaviors. Further we found that many high-profile apps and libraries are involved in such leaks. Our findings contribute to a better understanding of the privacy risk in mobile apps and also highlight the importance of data protection in today’s software composition.","PeriodicalId":20444,"journal":{"name":"Proceedings 2019 Network and Distributed System Security Symposium","volume":"42 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2018-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"54","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 2019 Network and Distributed System Security Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14722/ndss.2018.23099","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 54

Abstract

A long-standing challenge in analyzing information leaks within mobile apps is to automatically identify the code operating on sensitive data. With all existing solutions relying on System APIs (e.g., IMEI, GPS location) or features of user interfaces (UI), the content from app servers, like user’s Facebook profile, payment history, fall through the crack. Finding such content is important given the fact that most apps today are web applications, whose critical data are often on the server side. In the meantime, operations on the data within mobile apps are often hard to capture, since all server-side information is delivered to the app in the same way, sensitive or not. A unique observation of our research is that in modern apps, a program is essentially a semantics-rich documentation carrying meaningful program elements such as method names, variables and constants that reveal the sensitive data involved, even when the program is under moderate obfuscation. Leveraging this observation, we develop a novel semantics-driven solution for automatic discovery of sensitive user data, including those from the server side. Our approach utilizes natural language processing (NLP) to automatically locate the program elements (variables, methods, etc.) of interest, and then performs a learning-based program structure analysis to accurately identify those indeed carrying sensitive content. Using this new technique, we analyzed 445,668 popular apps, an unprecedented scale for this type of research. Our work brings to light the pervasiveness of information leaks, and the channels through which the leaks happen, including unintentional over-sharing across libraries and aggressive data acquisition behaviors. Further we found that many high-profile apps and libraries are involved in such leaks. Our findings contribute to a better understanding of the privacy risk in mobile apps and also highlight the importance of data protection in today’s software composition.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

为你的秘密寻找线索:移动应用中语义驱动、基于学习的隐私发现

在分析移动应用程序中的信息泄露时，一个长期存在的挑战是自动识别对敏感数据进行操作的代码。由于所有现有的解决方案都依赖于系统api(如IMEI, GPS定位)或用户界面(UI)的功能，来自应用服务器的内容，如用户的Facebook个人资料，支付历史，都无法破解。考虑到当今大多数应用程序都是web应用程序，其关键数据通常位于服务器端，找到这样的内容非常重要。与此同时，移动应用中对数据的操作通常很难捕捉，因为所有服务器端信息都以相同的方式传递给应用，无论是否敏感。我们研究的一个独特观察是，在现代应用程序中，程序本质上是一个语义丰富的文档，包含有意义的程序元素，如方法名、变量和常量，这些元素揭示了所涉及的敏感数据，即使程序处于中度混淆状态。利用这一观察结果，我们开发了一种新的语义驱动的解决方案，用于自动发现敏感用户数据，包括来自服务器端的数据。我们的方法利用自然语言处理(NLP)自动定位感兴趣的程序元素(变量、方法等)，然后执行基于学习的程序结构分析，以准确识别那些确实携带敏感内容的程序。使用这种新技术，我们分析了445,668个流行应用程序，这是此类研究中前所未有的规模。我们的工作揭示了信息泄露的普遍性，以及泄露发生的渠道，包括图书馆之间无意的过度共享和激进的数据获取行为。此外，我们发现许多备受瞩目的应用程序和库都涉及此类泄漏。我们的研究结果有助于更好地理解移动应用程序中的隐私风险，同时也强调了数据保护在当今软件构成中的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings 2019 Network and Distributed System Security Symposium

自引率

0.00%

发文量