Anonymizing NYC Taxi Data: Does It Matter?

Marie Douriez, Harish Doraiswamy, J. Freire, Cláudio T. Silva
{"title":"Anonymizing NYC Taxi Data: Does It Matter?","authors":"Marie Douriez, Harish Doraiswamy, J. Freire, Cláudio T. Silva","doi":"10.1109/DSAA.2016.21","DOIUrl":null,"url":null,"abstract":"The widespread use of location-based services has led to an increasing availability of trajectory data from urban environments. These data carry rich information that are useful for improving cities through traffic management and city planning. Yet, it also contains information about individuals which can jeopardize their privacy. In this study, we work with the New York City (NYC) taxi trips data set publicly released by the Taxi and Limousine Commission (TLC). This data set contains information about every taxi cab ride that happened in NYC. A bad hashing of the medallion numbers (the ID corresponding to a taxi) allowed the recovery of all the medallion numbers and led to a privacy breach for the drivers, whose income could be easily extracted. In this work, we initiate a study to evaluate whether \"perfect\" anonymity is possible and if such an identity disclosure can be avoided given the availability of diverse sets of external data sets through which the hidden information can be recovered. This is accomplished through a spatio-temporal join based attack which matches the taxi data with an external medallion data that can be easily gathered by an adversary. Using a simulation of the medallion data, we show that our attack can re-identify over 91% of the taxis that ply in NYC even when using a perfect pseudonymization of medallion numbers. We also explore the effectiveness of trajectory anonymization strategies and demonstrate that our attack can still identify a significant fraction of the taxis in NYC. Given the restrictions in publishing the taxi data by TLC, our results indicate that unless the utility of the data set is significantly compromised, it will not be possible to maintain the privacy of taxi medallion owners and drivers.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"57","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSAA.2016.21","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 57

Abstract

The widespread use of location-based services has led to an increasing availability of trajectory data from urban environments. These data carry rich information that are useful for improving cities through traffic management and city planning. Yet, it also contains information about individuals which can jeopardize their privacy. In this study, we work with the New York City (NYC) taxi trips data set publicly released by the Taxi and Limousine Commission (TLC). This data set contains information about every taxi cab ride that happened in NYC. A bad hashing of the medallion numbers (the ID corresponding to a taxi) allowed the recovery of all the medallion numbers and led to a privacy breach for the drivers, whose income could be easily extracted. In this work, we initiate a study to evaluate whether "perfect" anonymity is possible and if such an identity disclosure can be avoided given the availability of diverse sets of external data sets through which the hidden information can be recovered. This is accomplished through a spatio-temporal join based attack which matches the taxi data with an external medallion data that can be easily gathered by an adversary. Using a simulation of the medallion data, we show that our attack can re-identify over 91% of the taxis that ply in NYC even when using a perfect pseudonymization of medallion numbers. We also explore the effectiveness of trajectory anonymization strategies and demonstrate that our attack can still identify a significant fraction of the taxis in NYC. Given the restrictions in publishing the taxi data by TLC, our results indicate that unless the utility of the data set is significantly compromised, it will not be possible to maintain the privacy of taxi medallion owners and drivers.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
匿名纽约市出租车数据:重要吗?
基于位置的服务的广泛使用使得来自城市环境的轨迹数据的可用性越来越高。这些数据包含丰富的信息,有助于通过交通管理和城市规划来改善城市。然而,它也包含可能危及其隐私的个人信息。在这项研究中,我们使用了纽约市出租车和豪华轿车委员会(TLC)公开发布的出租车出行数据集。这个数据集包含了在纽约发生的每一次出租车乘坐的信息。对车牌号码(出租车对应的ID)进行错误的散列处理,可以恢复所有车牌号码,并导致司机的隐私泄露,他们的收入很容易被提取出来。在这项工作中,我们发起了一项研究,以评估“完美”匿名是否可能,以及考虑到各种外部数据集的可用性,是否可以避免这种身份披露,通过这些数据集可以恢复隐藏的信息。这是通过基于时空连接的攻击来实现的,该攻击将出租车数据与可以被对手轻松收集的外部奖章数据相匹配。通过对牌照数据的模拟,我们发现,即使使用完美的牌照数字假名,我们的攻击也可以重新识别在纽约市运营的91%以上的出租车。我们还探索了轨迹匿名化策略的有效性,并证明我们的攻击仍然可以识别纽约市很大一部分出租车。鉴于TLC发布出租车数据的限制,我们的研究结果表明,除非数据集的效用受到严重损害,否则不可能维护出租车牌照所有者和司机的隐私。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A Multi-Granularity Pattern-Based Sequence Classification Framework for Educational Data Task Composition in Crowdsourcing Maritime Pattern Extraction from AIS Data Using a Genetic Algorithm What Did I Do Wrong in My MOBA Game? Mining Patterns Discriminating Deviant Behaviours Nonparametric Adjoint-Based Inference for Stochastic Differential Equations
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1