Anonymizing NYC Taxi Data: Does It Matter?

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA) Pub Date : 2016-10-01 DOI:10.1109/DSAA.2016.21

Marie Douriez, Harish Doraiswamy, J. Freire, Cláudio T. Silva

{"title":"Anonymizing NYC Taxi Data: Does It Matter?","authors":"Marie Douriez, Harish Doraiswamy, J. Freire, Cláudio T. Silva","doi":"10.1109/DSAA.2016.21","DOIUrl":null,"url":null,"abstract":"The widespread use of location-based services has led to an increasing availability of trajectory data from urban environments. These data carry rich information that are useful for improving cities through traffic management and city planning. Yet, it also contains information about individuals which can jeopardize their privacy. In this study, we work with the New York City (NYC) taxi trips data set publicly released by the Taxi and Limousine Commission (TLC). This data set contains information about every taxi cab ride that happened in NYC. A bad hashing of the medallion numbers (the ID corresponding to a taxi) allowed the recovery of all the medallion numbers and led to a privacy breach for the drivers, whose income could be easily extracted. In this work, we initiate a study to evaluate whether \"perfect\" anonymity is possible and if such an identity disclosure can be avoided given the availability of diverse sets of external data sets through which the hidden information can be recovered. This is accomplished through a spatio-temporal join based attack which matches the taxi data with an external medallion data that can be easily gathered by an adversary. Using a simulation of the medallion data, we show that our attack can re-identify over 91% of the taxis that ply in NYC even when using a perfect pseudonymization of medallion numbers. We also explore the effectiveness of trajectory anonymization strategies and demonstrate that our attack can still identify a significant fraction of the taxis in NYC. Given the restrictions in publishing the taxi data by TLC, our results indicate that unless the utility of the data set is significantly compromised, it will not be possible to maintain the privacy of taxi medallion owners and drivers.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"57","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSAA.2016.21","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 57

Abstract

The widespread use of location-based services has led to an increasing availability of trajectory data from urban environments. These data carry rich information that are useful for improving cities through traffic management and city planning. Yet, it also contains information about individuals which can jeopardize their privacy. In this study, we work with the New York City (NYC) taxi trips data set publicly released by the Taxi and Limousine Commission (TLC). This data set contains information about every taxi cab ride that happened in NYC. A bad hashing of the medallion numbers (the ID corresponding to a taxi) allowed the recovery of all the medallion numbers and led to a privacy breach for the drivers, whose income could be easily extracted. In this work, we initiate a study to evaluate whether "perfect" anonymity is possible and if such an identity disclosure can be avoided given the availability of diverse sets of external data sets through which the hidden information can be recovered. This is accomplished through a spatio-temporal join based attack which matches the taxi data with an external medallion data that can be easily gathered by an adversary. Using a simulation of the medallion data, we show that our attack can re-identify over 91% of the taxis that ply in NYC even when using a perfect pseudonymization of medallion numbers. We also explore the effectiveness of trajectory anonymization strategies and demonstrate that our attack can still identify a significant fraction of the taxis in NYC. Given the restrictions in publishing the taxi data by TLC, our results indicate that unless the utility of the data set is significantly compromised, it will not be possible to maintain the privacy of taxi medallion owners and drivers.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

匿名纽约市出租车数据:重要吗?

基于位置的服务的广泛使用使得来自城市环境的轨迹数据的可用性越来越高。这些数据包含丰富的信息，有助于通过交通管理和城市规划来改善城市。然而，它也包含可能危及其隐私的个人信息。在这项研究中，我们使用了纽约市出租车和豪华轿车委员会(TLC)公开发布的出租车出行数据集。这个数据集包含了在纽约发生的每一次出租车乘坐的信息。对车牌号码(出租车对应的ID)进行错误的散列处理，可以恢复所有车牌号码，并导致司机的隐私泄露，他们的收入很容易被提取出来。在这项工作中，我们发起了一项研究，以评估“完美”匿名是否可能，以及考虑到各种外部数据集的可用性，是否可以避免这种身份披露，通过这些数据集可以恢复隐藏的信息。这是通过基于时空连接的攻击来实现的，该攻击将出租车数据与可以被对手轻松收集的外部奖章数据相匹配。通过对牌照数据的模拟，我们发现，即使使用完美的牌照数字假名，我们的攻击也可以重新识别在纽约市运营的91%以上的出租车。我们还探索了轨迹匿名化策略的有效性，并证明我们的攻击仍然可以识别纽约市很大一部分出租车。鉴于TLC发布出租车数据的限制，我们的研究结果表明，除非数据集的效用受到严重损害，否则不可能维护出租车牌照所有者和司机的隐私。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

自引率

0.00%

发文量

期刊最新文献

A Multi-Granularity Pattern-Based Sequence Classification Framework for Educational Data Task Composition in Crowdsourcing Maritime Pattern Extraction from AIS Data Using a Genetic Algorithm What Did I Do Wrong in My MOBA Game? Mining Patterns Discriminating Deviant Behaviours Nonparametric Adjoint-Based Inference for Stochastic Differential Equations