Proceedings of the 25th International Conference on World Wide Web最新文献_第7页

People and Cookies: Imperfect Treatment Assignment in Online Experiments 人与饼干:在线实验中的不完美处理分配

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2882984

Dominic Coey, Michael C. Bailey

Identifying the same internet user across devices or over time is often infeasible. This presents a problem for online experiments, as it precludes person-level randomization. Randomization must instead be done using imperfect proxies for people, like cookies, email addresses, or device identifiers. Users may be partially treated and partially untreated as some of their cookies are assigned to the test group and some to the control group, complicating statistical inference. We show that the estimated treatment effect in a cookie-level experiment converges to a weighted average of the marginal effects of treating more of a user's cookies. If the marginal effects of cookie treatment exposure are positive and constant, it underestimates the true person-level effect by a factor equal to the number of cookies per person. Using two separate datasets---cookie assignment data from Atlas and advertising exposure and purchase data from Facebook---we empirically quantify the differences between cookie and person-level advertising effectiveness experiments. The effects are substantial: cookie tests underestimate the true person-level effects by a factor of about three, and require two to three times the number of people to achieve the same power as a test with perfect treatment assignment.

识别跨设备或跨时间的同一互联网用户通常是不可行的。这对在线实验提出了一个问题，因为它排除了个人水平的随机化。随机化必须使用不完美的代理来代替，比如cookie、电子邮件地址或设备标识符。用户可能被部分处理，部分未处理，因为他们的一些cookie被分配给测试组，一些分配给控制组，这使统计推断变得复杂。我们表明，在cookie级别的实验中，估计的处理效果收敛于处理更多用户cookie的边际效应的加权平均值。如果饼干处理暴露的边际效应是积极和恒定的，它低估了真实的个人水平的影响，其系数等于每人饼干的数量。使用两个独立的数据集——来自Atlas的cookie分配数据和来自Facebook的广告曝光和购买数据——我们从经验上量化了cookie和个人层面广告效果实验之间的差异。效果是显著的:饼干测试将真实的个人水平效应低估了约三倍，并且需要两到三倍的人数才能达到与完美治疗分配测试相同的效果。

{"title":"People and Cookies: Imperfect Treatment Assignment in Online Experiments","authors":"Dominic Coey, Michael C. Bailey","doi":"10.1145/2872427.2882984","DOIUrl":"https://doi.org/10.1145/2872427.2882984","url":null,"abstract":"Identifying the same internet user across devices or over time is often infeasible. This presents a problem for online experiments, as it precludes person-level randomization. Randomization must instead be done using imperfect proxies for people, like cookies, email addresses, or device identifiers. Users may be partially treated and partially untreated as some of their cookies are assigned to the test group and some to the control group, complicating statistical inference. We show that the estimated treatment effect in a cookie-level experiment converges to a weighted average of the marginal effects of treating more of a user's cookies. If the marginal effects of cookie treatment exposure are positive and constant, it underestimates the true person-level effect by a factor equal to the number of cookies per person. Using two separate datasets---cookie assignment data from Atlas and advertising exposure and purchase data from Facebook---we empirically quantify the differences between cookie and person-level advertising effectiveness experiments. The effects are substantial: cookie tests underestimate the true person-level effects by a factor of about three, and require two to three times the number of people to achieve the same power as a test with perfect treatment assignment.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"51 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89314200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Understanding User Economic Behavior in the City Using Large-scale Geotagged and Crowdsourced Data 利用大规模地理标记和众包数据了解城市用户经济行为

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883066

Yingjie Zhang, Beibei Li, Jason I. Hong

The pervasiveness of mobile technologies today have facilitated the creation of massive crowdsourced and geotagged data from individual users in real time and at different locations in the city. Such ubiquitous user-generated data allow us to infer various patterns of human behavior, which help us understand the interactions between humans and cities. In this study, we focus on understanding users economic behavior in the city by examining the economic value from crowdsourced and geotaggged data. Specifically, we extract multiple traffic and human mobility features from publicly available data sources using NLP and geo-mapping techniques, and examine the effects of both static and dynamic features on economic outcome of local businesses. Our study is instantiated on a unique dataset of restaurant bookings from OpenTable for 3,187 restaurants in New York City from November 2013 to March 2014. Our results suggest that foot traffic can increase local popularity and business performance, while mobility and traffic from automobiles may hurt local businesses, especially the well-established chains and high-end restaurants. We also find that on average one more street closure nearby leads to a 4.7% decrease in the probability of a restaurant being fully booked during the dinner peak. Our study demonstrates the potential of how to best make use of the large volumes and diverse sources of crowdsourced and geotagged user-generated data to create matrices to predict local economic demand in a manner that is fast, cheap, accurate, and meaningful.

如今，移动技术的普及促进了来自城市不同地点的个人用户的大量实时众包和地理标记数据的创建。这种无处不在的用户生成数据使我们能够推断出人类行为的各种模式，这有助于我们理解人类与城市之间的相互作用。在本研究中，我们通过研究众包和地理标记数据的经济价值，重点了解城市用户的经济行为。具体而言，我们使用NLP和地理映射技术从公开可用的数据源中提取多种交通和人类流动性特征，并检查静态和动态特征对当地企业经济成果的影响。我们的研究是在OpenTable 2013年11月至2014年3月期间纽约市3187家餐厅预订的独特数据集上进行的。我们的研究结果表明，人流量可以提高当地的知名度和经营业绩，而汽车的流动性和流量可能会损害当地的企业，特别是成熟的连锁店和高端餐厅。我们还发现，平均而言，附近多关闭一条街道，就会导致餐厅在用餐高峰期间满座的概率下降4.7%。我们的研究展示了如何最好地利用大量和不同来源的众包和地理标记用户生成数据来创建矩阵，以快速、廉价、准确和有意义的方式预测当地经济需求的潜力。

{"title":"Understanding User Economic Behavior in the City Using Large-scale Geotagged and Crowdsourced Data","authors":"Yingjie Zhang, Beibei Li, Jason I. Hong","doi":"10.1145/2872427.2883066","DOIUrl":"https://doi.org/10.1145/2872427.2883066","url":null,"abstract":"The pervasiveness of mobile technologies today have facilitated the creation of massive crowdsourced and geotagged data from individual users in real time and at different locations in the city. Such ubiquitous user-generated data allow us to infer various patterns of human behavior, which help us understand the interactions between humans and cities. In this study, we focus on understanding users economic behavior in the city by examining the economic value from crowdsourced and geotaggged data. Specifically, we extract multiple traffic and human mobility features from publicly available data sources using NLP and geo-mapping techniques, and examine the effects of both static and dynamic features on economic outcome of local businesses. Our study is instantiated on a unique dataset of restaurant bookings from OpenTable for 3,187 restaurants in New York City from November 2013 to March 2014. Our results suggest that foot traffic can increase local popularity and business performance, while mobility and traffic from automobiles may hurt local businesses, especially the well-established chains and high-end restaurants. We also find that on average one more street closure nearby leads to a 4.7% decrease in the probability of a restaurant being fully booked during the dinner peak. Our study demonstrates the potential of how to best make use of the large volumes and diverse sources of crowdsourced and geotagged user-generated data to create matrices to predict local economic demand in a manner that is fast, cheap, accurate, and meaningful.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88535050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Reverse Engineering SPARQL Queries 逆向工程SPARQL查询

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2882989

M. Arenas, G. I. Diaz, Egor V. Kostylev

Semantic Web systems provide open interfaces for end-users to access data via a powerful high-level query language, SPARQL. But users unfamiliar with either the details of SPARQL or properties of the target dataset may find it easier to query by example -- give examples of the information they want (or examples of both what they want and what they do not want) and let the system reverse engineer the desired query from the examples. This approach has been heavily used in the setting of relational databases. We provide here an investigation of the reverse engineering problem in the context of SPARQL. We first provide a theoretical study, formalising variants of the reverse engineering problem and giving tight bounds on its complexity. We next explain an implementation of a reverse engineering tool for positive examples. An experimental analysis of the tool shows that it scales well in the data size, number of examples, and in the size of the smallest query that fits the data. We also give evidence that reverse engineering tools can provide benefits on real-life datasets.

语义Web系统为最终用户提供开放接口，通过强大的高级查询语言SPARQL访问数据。但是，不熟悉SPARQL细节或目标数据集属性的用户可能会发现通过示例进行查询更容易——给出他们想要的信息的示例(或他们想要的和不想要的示例)，并让系统从示例中反向工程所需的查询。这种方法在关系数据库的设置中被大量使用。我们在这里对SPARQL上下文中的逆向工程问题进行了调查。我们首先提供了一个理论研究，形式化了逆向工程问题的变体，并给出了其复杂性的严格界限。接下来，我们将为正例解释反向工程工具的实现。对该工具的实验分析表明，它在数据大小、示例数量和适合数据的最小查询大小方面都具有良好的可伸缩性。我们还提供了证据，证明逆向工程工具可以为现实生活中的数据集提供好处。

引用次数: 60

MapWatch: Detecting and Monitoring International Border Personalization on Online Maps MapWatch:检测和监控在线地图上的国际边界个性化

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883016

Gary Soeller, Karrie Karahalios, Christian Sandvig, Christo Wilson

Maps have long played a crucial role in enabling people to conceptualize and navigate the world around them. However, maps also encode the world-views of their creators. Disputed international borders are one example of this: governments may mandate that cartographers produce maps that conform to their view of a territorial dispute. Today, online maps maintained by private corporations have become the norm. However, these new maps are still subject to old debates. Companies like Google and Bing resolve these disputes by localizing their maps to meet government requirements and user preferences, i.e., users in different locations are shown maps with different international boundaries. We argue that this non-transparent personalization of maps may exacerbate nationalistic disputes by promoting divergent views of geopolitical realities. To address this problem, we present MapWatch, our system for detecting and cataloging personalization of international borders in online maps. Our system continuously crawls all map tiles from Google and Bing maps, and leverages crowdworkers to identify border personalization. In this paper, we present the architecture of MapWatch, and analyze the instances of border personalization on Google and Bing, including one border change that MapWatch identified live, as Google was rolling out the update.

长期以来，地图一直在帮助人们对周围的世界进行概念化和导航方面发挥着至关重要的作用。然而，地图也编码了其创造者的世界观。有争议的国际边界就是一个例子:政府可能会要求制图师绘制符合他们对领土争端看法的地图。如今，由私人公司维护的在线地图已成为常态。然而，这些新地图仍然受到旧争论的影响。谷歌(Google)和必应(Bing)等公司通过本地化地图来解决这些争议，以满足政府的要求和用户的偏好，即向不同地点的用户展示具有不同国际边界的地图。我们认为，这种不透明的地图个性化可能会通过促进对地缘政治现实的不同看法而加剧民族主义争端。为了解决这个问题，我们提出了MapWatch，这是我们在在线地图中检测和编目个性化国际边界的系统。我们的系统不断地从谷歌和必应地图中抓取所有地图，并利用众包工作者来识别边界个性化。在本文中，我们介绍了MapWatch的架构，并分析了谷歌和必应上的边界个性化实例，包括MapWatch在谷歌推出更新时实时识别的一个边界更改。

{"title":"MapWatch: Detecting and Monitoring International Border Personalization on Online Maps","authors":"Gary Soeller, Karrie Karahalios, Christian Sandvig, Christo Wilson","doi":"10.1145/2872427.2883016","DOIUrl":"https://doi.org/10.1145/2872427.2883016","url":null,"abstract":"Maps have long played a crucial role in enabling people to conceptualize and navigate the world around them. However, maps also encode the world-views of their creators. Disputed international borders are one example of this: governments may mandate that cartographers produce maps that conform to their view of a territorial dispute. Today, online maps maintained by private corporations have become the norm. However, these new maps are still subject to old debates. Companies like Google and Bing resolve these disputes by localizing their maps to meet government requirements and user preferences, i.e., users in different locations are shown maps with different international boundaries. We argue that this non-transparent personalization of maps may exacerbate nationalistic disputes by promoting divergent views of geopolitical realities. To address this problem, we present MapWatch, our system for detecting and cataloging personalization of international borders in online maps. Our system continuously crawls all map tiles from Google and Bing maps, and leverages crowdworkers to identify border personalization. In this paper, we present the architecture of MapWatch, and analyze the instances of border personalization on Google and Bing, including one border change that MapWatch identified live, as Google was rolling out the update.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72722232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Strengthening Weak Identities Through Inter-Domain Trust Transfer 通过域间信任转移增强弱身份

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883015

Giridhari Venkatadri, Oana Goga, Changtao Zhong, Bimal Viswanath, K. Gummadi, Nishanth R. Sastry

On most current websites untrustworthy or spammy identities are easily created. Existing proposals to detect untrustworthy identities rely on reputation signals obtained by observing the activities of identities over time within a single site or domain; thus, there is a time lag before which websites cannot easily distinguish attackers and legitimate users. In this paper, we investigate the feasibility of leveraging information about identities that is aggregated across multiple domains to reason about their trustworthiness. Our key insight is that while honest users naturally maintain identities across multiple domains (where they have proven their trustworthiness and have acquired reputation over time), attackers are discouraged by the additional effort and costs to do the same. We propose a flexible framework to transfer trust between domains that can be implemented in today's systems without significant loss of privacy or significant implementation overheads. We demonstrate the potential for inter-domain trust assessment using extensive data collected from Pinterest, Facebook, and Twitter. Our results show that newer domains such as Pinterest can benefit by transferring trust from more established domains such as Facebook and Twitter by being able to declare more users as likely to be trustworthy much earlier on (approx. one year earlier).

在大多数当前的网站上，不可信或垃圾身份很容易创建。现有的检测不可信身份的建议依赖于通过观察单个站点或域内随时间变化的身份活动而获得的声誉信号;因此，存在一个时间滞后，在此之前网站无法轻易区分攻击者和合法用户。在本文中，我们研究了利用跨多个域聚合的身份信息来推断其可信度的可行性。我们的关键见解是，虽然诚实的用户自然会跨多个域维护身份(他们已经证明了自己的可信度，并随着时间的推移获得了声誉)，但攻击者会因额外的努力和成本而感到沮丧。我们提出了一个灵活的框架来在域之间传输信任，可以在今天的系统中实现，而不会造成重大的隐私损失或重大的实现开销。我们使用从Pinterest、Facebook和Twitter收集的大量数据来展示域间信任评估的潜力。我们的研究结果表明，像Pinterest这样的新域名可以通过从更成熟的域名(如Facebook和Twitter)转移信任而受益，因为它能够更早地宣布更多的用户可能是值得信赖的。一年前)。

{"title":"Strengthening Weak Identities Through Inter-Domain Trust Transfer","authors":"Giridhari Venkatadri, Oana Goga, Changtao Zhong, Bimal Viswanath, K. Gummadi, Nishanth R. Sastry","doi":"10.1145/2872427.2883015","DOIUrl":"https://doi.org/10.1145/2872427.2883015","url":null,"abstract":"On most current websites untrustworthy or spammy identities are easily created. Existing proposals to detect untrustworthy identities rely on reputation signals obtained by observing the activities of identities over time within a single site or domain; thus, there is a time lag before which websites cannot easily distinguish attackers and legitimate users. In this paper, we investigate the feasibility of leveraging information about identities that is aggregated across multiple domains to reason about their trustworthiness. Our key insight is that while honest users naturally maintain identities across multiple domains (where they have proven their trustworthiness and have acquired reputation over time), attackers are discouraged by the additional effort and costs to do the same. We propose a flexible framework to transfer trust between domains that can be implemented in today's systems without significant loss of privacy or significant implementation overheads. We demonstrate the potential for inter-domain trust assessment using extensive data collected from Pinterest, Facebook, and Twitter. Our results show that newer domains such as Pinterest can benefit by transferring trust from more established domains such as Facebook and Twitter by being able to declare more users as likely to be trustworthy much earlier on (approx. one year earlier).","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80282735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Tell Me About Yourself: The Malicious CAPTCHA Attack 告诉我关于你自己:恶意CAPTCHA攻击

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883005

Nethanel Gelernter, A. Herzberg

We present the malicious CAPTCHA attack, allowing a rogue website to trick users into unknowingly disclosing their private information. The rogue site displays the private information to the user in obfuscated manner, as if it is a CAPTCHA challenge; the user is unaware that solving the CAPTCHA, results in disclosing private information. This circumvents the Same Origin Policy (SOP), whose goal is to prevent access by rogue sites to private information, by exploiting the fact that many websites allow display of private information (to the user), upon requests from any (even rogue) website. Information so disclosed includes name, phone number, email and physical addresses, search history, preferences, partial credit card numbers, and more. The vulnerability is common and the attack works for many popular sites, including nine out of the ten most popular websites. We evaluated the attack using IRB-approved, ethical user experiments.

我们提出了恶意CAPTCHA攻击，允许流氓网站欺骗用户在不知情的情况下泄露他们的私人信息。流氓网站以模糊的方式向用户显示私人信息，就好像这是一个CAPTCHA挑战;用户不知道解决CAPTCHA会导致泄露私人信息。这规避了同源策略(SOP)，其目的是通过利用许多网站允许在来自任何(甚至是流氓)网站的请求时(向用户)显示私有信息这一事实来防止流氓网站访问私有信息。如此披露的信息包括姓名、电话号码、电子邮件和实际地址、搜索历史、偏好、部分信用卡号码等。这个漏洞很常见，许多热门网站都受到了攻击，包括10个最受欢迎的网站中的9个。我们使用irb批准的道德用户实验来评估攻击。

引用次数: 8

Competition on Price and Quality in Cloud Computing 云计算中的价格和质量竞争

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883043

Cinar Kilcioglu, Justin M. Rao

The public cloud "infrastructure as a service" market possesses unique features that make it difficult to predict long-run economic behavior. On the one hand, major providers buy their hardware from the same manufacturers, operate in similar locations and offer a similar menu of products. On the other hand, the competitors use different proprietary "fabric" to manage virtualization, resource allocation and data transfer. The menus offered by each provider involve a discrete number of choices (virtual machine sizes) and allow providers to locate in different parts of the price-quality space. We document this differentiation empirically by running benchmarking tests. This allows us to calibrate a model of firm technology. Firm technology is an input into our theoretical model of price-quality competition. The monopoly case highlights the importance of competition in blocking "bad equilibrium" where performance is intentionally slowed down or options are unduly limited. In duopoly, price competition is fierce, but prices do not converge to the same level because of price-quality differentiation. The model helps explain market trends, such the healthy operating profit margin recently reported by Amazon Web Services. Our empirically calibrated model helps not only explain price cutting behavior but also how providers can manage a profit despite predictions that the market "should be" totally commoditized.

公共云“基础设施即服务”市场具有独特的特点，很难预测长期的经济行为。一方面，主要供应商从相同的制造商那里购买硬件，在相似的地点运营，并提供类似的产品菜单。另一方面，竞争对手使用不同的专有“结构”来管理虚拟化、资源分配和数据传输。每个供应商提供的菜单包含离散数量的选择(虚拟机大小)，并允许供应商定位在价格-质量空间的不同部分。我们通过运行基准测试来记录这种差异。这使我们能够校准企业技术模型。企业技术是我们的价格质量竞争理论模型的一个输入。垄断案例突出了竞争在阻止“不良均衡”方面的重要性，在这种均衡中，业绩被故意拖慢或选择被过度限制。在双寡头垄断中，价格竞争是激烈的，但由于价格质量的差异，价格不会趋同于同一水平。该模型有助于解释市场趋势，比如亚马逊网络服务(Amazon Web Services)最近公布的健康的营业利润率。我们的经验校准模型不仅有助于解释降价行为，还有助于解释供应商如何在市场“应该”完全商品化的预测下管理利润。

{"title":"Competition on Price and Quality in Cloud Computing","authors":"Cinar Kilcioglu, Justin M. Rao","doi":"10.1145/2872427.2883043","DOIUrl":"https://doi.org/10.1145/2872427.2883043","url":null,"abstract":"The public cloud \"infrastructure as a service\" market possesses unique features that make it difficult to predict long-run economic behavior. On the one hand, major providers buy their hardware from the same manufacturers, operate in similar locations and offer a similar menu of products. On the other hand, the competitors use different proprietary \"fabric\" to manage virtualization, resource allocation and data transfer. The menus offered by each provider involve a discrete number of choices (virtual machine sizes) and allow providers to locate in different parts of the price-quality space. We document this differentiation empirically by running benchmarking tests. This allows us to calibrate a model of firm technology. Firm technology is an input into our theoretical model of price-quality competition. The monopoly case highlights the importance of competition in blocking \"bad equilibrium\" where performance is intentionally slowed down or options are unduly limited. In duopoly, price competition is fierce, but prices do not converge to the same level because of price-quality differentiation. The model helps explain market trends, such the healthy operating profit margin recently reported by Amazon Web Services. Our empirically calibrated model helps not only explain price cutting behavior but also how providers can manage a profit despite predictions that the market \"should be\" totally commoditized.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"53 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82705249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

An Empirical Study of Web Cookies 网络cookie的实证研究

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2882991

Aaron Cahn, Scott Alfeld, P. Barford, S. Muthukrishnan

Web cookies are used widely by publishers and 3rd parties to track users and their behaviors. Despite the ubiquitous use of cookies, there is little prior work on their characteristics such as standard attributes, placement policies, and the knowledge that can be amassed via 3rd party cookies. In this paper, we present an empirical study of web cookie characteristics, placement practices and information transmission. To conduct this study, we implemented a lightweight web crawler that tracks and stores the cookies as it navigates to websites. We use this crawler to collect over 3.2M cookies from the two crawls, separated by 18 months, of the top 100K Alexa web sites. We report on the general cookie characteristics and add context via a cookie category index and website genre labels. We consider privacy implications by examining specific cookie attributes and placement behavior of 3rd party cookies. We find that 3rd party cookies outnumber 1st party cookies by a factor of two, and we illuminate the connection between domain genres and cookie attributes. We find that less than 1% of the entities that place cookies can aggregate information across 75% of web sites. Finally, we consider the issue of information transmission and aggregation by domains via 3rd party cookies. We develop a mathematical framework to quantify user information leakage for a broad class of users, and present findings using real world domains. In particular, we demonstrate the interplay between a domain's footprint across the Internet and the browsing behavior of users, which has significant impact on information transmission.

网络cookie被出版商和第三方广泛用于跟踪用户及其行为。尽管cookie的使用无处不在，但很少有关于其特征的工作，例如标准属性，放置策略以及可以通过第三方cookie积累的知识。在本文中，我们提出了一个实证研究网络cookie的特征，放置做法和信息传输。为了进行这项研究，我们实现了一个轻量级的网络爬虫，跟踪和存储cookie，因为它导航到网站。我们使用这个爬虫从两次爬虫中收集了超过3.2万个cookie，间隔18个月，前10万个Alexa网站。我们报告一般的cookie特征，并通过cookie类别索引和网站类型标签添加上下文。我们通过检查特定的cookie属性和第三方cookie的放置行为来考虑隐私影响。我们发现第三方cookie的数量是第一方cookie的两倍，并且我们阐明了域类型和cookie属性之间的联系。我们发现，放置cookie的实体中只有不到1%可以在75%的网站上聚合信息。最后，我们考虑了通过第三方cookie的信息传输和聚合问题。我们开发了一个数学框架来量化广泛用户类别的用户信息泄漏，并使用现实世界域呈现研究结果。特别是，我们展示了域名在互联网上的足迹和用户浏览行为之间的相互作用，这对信息传输有重大影响。

{"title":"An Empirical Study of Web Cookies","authors":"Aaron Cahn, Scott Alfeld, P. Barford, S. Muthukrishnan","doi":"10.1145/2872427.2882991","DOIUrl":"https://doi.org/10.1145/2872427.2882991","url":null,"abstract":"Web cookies are used widely by publishers and 3rd parties to track users and their behaviors. Despite the ubiquitous use of cookies, there is little prior work on their characteristics such as standard attributes, placement policies, and the knowledge that can be amassed via 3rd party cookies. In this paper, we present an empirical study of web cookie characteristics, placement practices and information transmission. To conduct this study, we implemented a lightweight web crawler that tracks and stores the cookies as it navigates to websites. We use this crawler to collect over 3.2M cookies from the two crawls, separated by 18 months, of the top 100K Alexa web sites. We report on the general cookie characteristics and add context via a cookie category index and website genre labels. We consider privacy implications by examining specific cookie attributes and placement behavior of 3rd party cookies. We find that 3rd party cookies outnumber 1st party cookies by a factor of two, and we illuminate the connection between domain genres and cookie attributes. We find that less than 1% of the entities that place cookies can aggregate information across 75% of web sites. Finally, we consider the issue of information transmission and aggregation by domains via 3rd party cookies. We develop a mathematical framework to quantify user information leakage for a broad class of users, and present findings using real world domains. In particular, we demonstrate the interplay between a domain's footprint across the Internet and the browsing behavior of users, which has significant impact on information transmission.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90345347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 86

An In-depth Study of Mobile Browser Performance 手机浏览器性能的深入研究

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883014

Javad Nejati, A. Balasubramanian

Mobile page load times are an order of magnitude slower compared to non-mobile pages. It is not clear what causes the poor performance: the slower network, the slower computational speeds, or other reasons. Further, most Web optimizations are designed for non-mobile browsers and do not translate well to the mobile browser. Towards understanding mobile Web page load times, in this paper we: (1) perform an in-depth pairwise comparison of loading a page on a mobile versus a non-mobile browser, and (2) characterize the bottlenecks in the mobile browser {em vis-a-vis} non-mobile browsers. To this end, we build a testbed that allows us to directly compare the low-level page load activities and bottlenecks when loading a page on a mobile versus a non-mobile browser. We find that computation is the main bottleneck when loading a page on mobile browsers. This is in contrast to non-mobile browsers where networking is the main bottleneck. We also find that the composition of the critical path during page load is different when loading pages on the mobile versus the non-mobile browser. A key takeaway of our work is that we need to fundamentally rethink optimizations for mobile browsers.

与非移动页面相比，移动页面加载时间要慢一个数量级。目前还不清楚导致性能差的原因:网络速度较慢，计算速度较慢，还是其他原因。此外，大多数Web优化都是为非移动浏览器设计的，不能很好地转换为移动浏览器。为了理解移动Web页面加载时间，本文中我们:(1)对在移动和非移动浏览器上加载页面进行了深入的两两比较，(2)描述了移动浏览器(相对于非移动浏览器)的瓶颈。为此，我们构建了一个测试平台，它允许我们直接比较在移动端和非移动端浏览器上加载页面时的低级页面加载活动和瓶颈。我们发现，在移动浏览器上加载页面时，计算是主要瓶颈。这与网络是主要瓶颈的非移动浏览器形成了鲜明对比。我们还发现，在移动端和非移动端浏览器上加载页面时，页面加载过程中关键路径的组成是不同的。我们工作的一个关键收获是，我们需要从根本上重新考虑针对移动浏览器的优化。

{"title":"An In-depth Study of Mobile Browser Performance","authors":"Javad Nejati, A. Balasubramanian","doi":"10.1145/2872427.2883014","DOIUrl":"https://doi.org/10.1145/2872427.2883014","url":null,"abstract":"Mobile page load times are an order of magnitude slower compared to non-mobile pages. It is not clear what causes the poor performance: the slower network, the slower computational speeds, or other reasons. Further, most Web optimizations are designed for non-mobile browsers and do not translate well to the mobile browser. Towards understanding mobile Web page load times, in this paper we: (1) perform an in-depth pairwise comparison of loading a page on a mobile versus a non-mobile browser, and (2) characterize the bottlenecks in the mobile browser {em vis-a-vis} non-mobile browsers. To this end, we build a testbed that allows us to directly compare the low-level page load activities and bottlenecks when loading a page on a mobile versus a non-mobile browser. We find that computation is the main bottleneck when loading a page on mobile browsers. This is in contrast to non-mobile browsers where networking is the main bottleneck. We also find that the composition of the critical path during page load is different when loading pages on the mobile versus the non-mobile browser. A key takeaway of our work is that we need to fundamentally rethink optimizations for mobile browsers.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90372984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 89

Foundations of JSON Schema JSON模式的基础

Proceedings of the 25th International Conference on World Wide Web

Pub Date : 2016-04-11 DOI: 10.1145/2872427.2883029

Felipe Pezoa, Juan L. Reutter, F. Suárez, M. Ugarte, D. Vrgoc

JSON -- the most popular data format for sending API requests and responses -- is still lacking a standardized schema or meta-data definition that allows the developers to specify the structure of JSON documents. JSON Schema is an attempt to provide a general purpose schema language for JSON, but it is still work in progress, and the formal specification has not yet been agreed upon. Why this could be a problem becomes evident when examining the behaviour of numerous tools for validating JSON documents against this initial schema proposal: although they agree on most general cases, when presented with the greyer areas of the specification they tend to differ significantly. In this paper we provide the first formal definition of syntax and semantics for JSON Schema and use it to show that implementing this layer on top of JSON is feasible in practice. This is done both by analysing the theoretical aspects of the validation problem and by showing how to set up and validate a JSON Schema for Wikidata, the central storage for Wikimedia.

JSON——用于发送API请求和响应的最流行的数据格式——仍然缺乏允许开发人员指定JSON文档结构的标准化模式或元数据定义。JSON Schema试图为JSON提供一种通用的模式语言，但它仍在进行中，正式的规范尚未达成一致。当检查用于根据初始模式建议验证JSON文档的许多工具的行为时，这可能成为一个问题的原因变得明显:尽管它们在大多数一般情况下是一致的，但当呈现在规范的灰色区域时，它们往往会有很大的不同。在本文中，我们为JSON Schema提供了第一个正式的语法和语义定义，并使用它来证明在JSON之上实现这一层在实践中是可行的。这是通过分析验证问题的理论方面以及展示如何为Wikidata(维基媒体的中央存储)设置和验证JSON模式来完成的。

引用次数: 293