首页 > 最新文献

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)最新文献

英文 中文
A Multi-Granularity Pattern-Based Sequence Classification Framework for Educational Data 基于多粒度模式的教育数据序列分类框架
Mohammad Jaber, P. Wood, P. Papapetrou, A. González‐Marcos
In many application domains, such as education, sequences of events occurring over time need to be studied in order to understand the generative process behind these sequences, and hence classify new examples. In this paper, we propose a novel multi-granularity sequence classification framework that generates features based on frequent patterns at multiple levels of time granularity. Feature selection techniques are applied to identify the most informative features that are then used to construct the classification model. We show the applicability and suitability of the proposed framework to the area of educational data mining by experimenting on an educational dataset collected from an asynchronous communication tool in which students interact to accomplish an underlying group project. The experimental results showed that our model can achieve competitive performance in detecting the students' roles in their corresponding projects, compared to a baseline similarity-based approach.
在许多应用领域,例如教育,需要研究随时间发生的事件序列,以便了解这些序列背后的生成过程,从而对新示例进行分类。在本文中,我们提出了一种新的多粒度序列分类框架,该框架基于多个时间粒度级别的频繁模式生成特征。特征选择技术用于识别信息最丰富的特征,然后用于构建分类模型。我们通过对从异步通信工具收集的教育数据集进行实验,展示了所提出的框架在教育数据挖掘领域的适用性和适用性,在异步通信工具中,学生通过交互来完成一个潜在的小组项目。实验结果表明,与基于基线相似度的方法相比,我们的模型在检测学生在相应项目中的角色方面可以达到竞争性能。
{"title":"A Multi-Granularity Pattern-Based Sequence Classification Framework for Educational Data","authors":"Mohammad Jaber, P. Wood, P. Papapetrou, A. González‐Marcos","doi":"10.1109/DSAA.2016.46","DOIUrl":"https://doi.org/10.1109/DSAA.2016.46","url":null,"abstract":"In many application domains, such as education, sequences of events occurring over time need to be studied in order to understand the generative process behind these sequences, and hence classify new examples. In this paper, we propose a novel multi-granularity sequence classification framework that generates features based on frequent patterns at multiple levels of time granularity. Feature selection techniques are applied to identify the most informative features that are then used to construct the classification model. We show the applicability and suitability of the proposed framework to the area of educational data mining by experimenting on an educational dataset collected from an asynchronous communication tool in which students interact to accomplish an underlying group project. The experimental results showed that our model can achieve competitive performance in detecting the students' roles in their corresponding projects, compared to a baseline similarity-based approach.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128405209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Task Composition in Crowdsourcing 众包中的任务构成
S. Amer-Yahia, Éric Gaussier, V. Leroy, Julien Pilourdault, R. M. Borromeo, Motomichi Toyama
Crowdsourcing has gained popularity in a variety of domains as an increasing number of jobs are "taskified" and completed independently by a set of workers. A central process in crowdsourcing is the mechanism through which workers find tasks. On popular platforms such as Amazon Mechanical Turk, tasks can be sorted by dimensions such as creation date or reward amount. Research efforts on task assignment have focused on adopting a requester-centric approach whereby tasks are proposed to workers in order to maximize overall task throughput, result quality and cost. In this paper, we advocate the need to complement that with a worker-centric approach to task assignment, and examine the problem of producing, for each worker, a personalized summary of tasks that preserves overall task throughput. We formalize task composition for workers as an optimization problem that finds a representative set of k valid and relevant Composite Tasks (CTs). Validity enforces that a composite task complies with the task arrival rate and satisfies the worker's expected wage. Relevance imposes that tasks match the worker's qualifications. We show empirically that workers' experience is greatly improved due to task homogeneity in each CT and to the adequation of CTs with workers' skills. As a result task throughput is improved.
随着越来越多的工作被“分配任务”并由一组工人独立完成,众包在各个领域都受到了欢迎。众包的一个核心过程是工人找到任务的机制。在亚马逊土耳其机器人(Amazon Mechanical Turk)等热门平台上,任务可以按创建日期或奖励金额等维度进行排序。任务分配的研究工作集中在采用以请求者为中心的方法,即向工作人员提出任务,以最大限度地提高总体任务吞吐量、结果质量和成本。在本文中,我们主张需要用以工人为中心的任务分配方法来补充这一点,并研究为每个工人生成保持整体任务吞吐量的个性化任务摘要的问题。我们将工人的任务组成形式化为一个优化问题,该问题找到k个有效且相关的复合任务(CTs)的代表性集。有效性强制复合任务符合任务到达率并满足工人的预期工资。相关性要求任务与工人的资格相匹配。我们的经验表明,由于每个CT的任务同质性和CT与工人技能的充分性,工人的经验大大提高。因此,任务吞吐量得到了提高。
{"title":"Task Composition in Crowdsourcing","authors":"S. Amer-Yahia, Éric Gaussier, V. Leroy, Julien Pilourdault, R. M. Borromeo, Motomichi Toyama","doi":"10.1109/DSAA.2016.27","DOIUrl":"https://doi.org/10.1109/DSAA.2016.27","url":null,"abstract":"Crowdsourcing has gained popularity in a variety of domains as an increasing number of jobs are \"taskified\" and completed independently by a set of workers. A central process in crowdsourcing is the mechanism through which workers find tasks. On popular platforms such as Amazon Mechanical Turk, tasks can be sorted by dimensions such as creation date or reward amount. Research efforts on task assignment have focused on adopting a requester-centric approach whereby tasks are proposed to workers in order to maximize overall task throughput, result quality and cost. In this paper, we advocate the need to complement that with a worker-centric approach to task assignment, and examine the problem of producing, for each worker, a personalized summary of tasks that preserves overall task throughput. We formalize task composition for workers as an optimization problem that finds a representative set of k valid and relevant Composite Tasks (CTs). Validity enforces that a composite task complies with the task arrival rate and satisfies the worker's expected wage. Relevance imposes that tasks match the worker's qualifications. We show empirically that workers' experience is greatly improved due to task homogeneity in each CT and to the adequation of CTs with workers' skills. As a result task throughput is improved.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131729438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Maritime Pattern Extraction from AIS Data Using a Genetic Algorithm 基于遗传算法的AIS数据海事模式提取
Andrej Dobrkovic, M. Iacob, J. Hillegersberg
The long term prediction of maritime vessels' destinations and arrival times is essential for making an effective logistics planning. As ships are influenced by various factors over a long period of time, the solution cannot be achieved by analyzing sailing patterns of each entity separately. Instead, an approach is required, that can extract maritime patterns for the area in question and represent it in a form suitable for querying all possible routes any vessel in that region can take. To tackle this problem we use a genetic algorithm (GA) to cluster vessel position data obtained from the publicly available Automatic Identification System (AIS). The resulting clusters are treated as route waypoints (WP), and by connecting them we get nodes and edges of a directed graph depicting maritime patterns. Since standard clustering algorithms have difficulties in handling data with varying density, and genetic algorithms are slow when handling large data volumes, in this paper we investigate how to enhance the genetic algorithm to allow fast and accurate waypoint identification. We also include a quad tree structure to preprocess data and reduce the input for the GA. When the route graph is created, we add post processing to remove inconsistencies caused by noise in the AIS data. Finally, we validate the results produced by the GA by comparing resulting patterns with known inland water routes for two Dutch provinces.
海上船舶的目的地和到达时间的长期预测是制定有效的物流规划的必要条件。由于船舶长期受到各种因素的影响,无法通过单独分析各个实体的航行模式来解决问题。相反,需要一种方法,可以提取有关区域的海洋模式,并将其表示为适合查询该区域任何船只可能采取的所有可能路线的形式。为了解决这个问题,我们使用遗传算法(GA)对从公开可用的自动识别系统(AIS)获得的船舶位置数据进行聚类。产生的聚类被视为路径路径点(WP),通过将它们连接起来,我们得到描绘海洋模式的有向图的节点和边。由于标准聚类算法在处理不同密度的数据时存在困难,而遗传算法在处理大数据量时速度较慢,因此本文研究了如何对遗传算法进行改进以实现快速准确的路点识别。我们还包括一个四叉树结构来预处理数据并减少遗传算法的输入。在创建路线图时,我们添加后处理以消除AIS数据中噪声引起的不一致。最后,我们通过将结果模式与荷兰两个省的已知内陆水道进行比较,验证了遗传算法产生的结果。
{"title":"Maritime Pattern Extraction from AIS Data Using a Genetic Algorithm","authors":"Andrej Dobrkovic, M. Iacob, J. Hillegersberg","doi":"10.1109/DSAA.2016.73","DOIUrl":"https://doi.org/10.1109/DSAA.2016.73","url":null,"abstract":"The long term prediction of maritime vessels' destinations and arrival times is essential for making an effective logistics planning. As ships are influenced by various factors over a long period of time, the solution cannot be achieved by analyzing sailing patterns of each entity separately. Instead, an approach is required, that can extract maritime patterns for the area in question and represent it in a form suitable for querying all possible routes any vessel in that region can take. To tackle this problem we use a genetic algorithm (GA) to cluster vessel position data obtained from the publicly available Automatic Identification System (AIS). The resulting clusters are treated as route waypoints (WP), and by connecting them we get nodes and edges of a directed graph depicting maritime patterns. Since standard clustering algorithms have difficulties in handling data with varying density, and genetic algorithms are slow when handling large data volumes, in this paper we investigate how to enhance the genetic algorithm to allow fast and accurate waypoint identification. We also include a quad tree structure to preprocess data and reduce the input for the GA. When the route graph is created, we add post processing to remove inconsistencies caused by noise in the AIS data. Finally, we validate the results produced by the GA by comparing resulting patterns with known inland water routes for two Dutch provinces.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"829 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116422551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
What Did I Do Wrong in My MOBA Game? Mining Patterns Discriminating Deviant Behaviours 我在MOBA游戏中做错了什么?识别异常行为的挖掘模式
Olivier Cavadenti, Víctor Codocedo, Jean-François Boulicaut, Mehdi Kaytoue-Uberall
The success of electronic sports (eSports), where professional gamers participate in competitive leagues and tournaments, brings new challenges for the video game industry. Other than fun, games must be difficult and challenging for eSports professionals but still easy and enjoyable for amateurs. In this article, we consider Multi-player Online Battle Arena games (MOBA) and particularly, "Defense of the Ancients 2", commonly known simply as DOTA2. In this context, a challenge is to propose data analysis methods and metrics that help players to improve their skills. We design a data mining-based method that discovers strategic patterns from historical behavioral traces: Given a model encoding an expected way of playing (the norm), we are interested in patterns deviating from the norm that may explain a game outcome from which player can learn more efficient ways of playing. The method is formally introduced and shown to be adaptable to different scenarios. Finally, we provide an experimental evaluation over a dataset of 10 000 behavioral game traces.
电子竞技(eSports)的成功给电子游戏行业带来了新的挑战。电子竞技是职业玩家参加竞技联盟和锦标赛的地方。除了有趣之外,游戏对于电子竞技专业人士来说必须是困难和具有挑战性的,但对于业余爱好者来说仍然是简单和愉快的。在本文中,我们将着眼于多人在线竞技游戏(MOBA),特别是《Defense of the Ancients 2》,即我们所熟知的DOTA2。在这种情况下,一个挑战是提出数据分析方法和指标,帮助玩家提高他们的技能。我们设计了一种基于数据挖掘的方法,可以从历史行为痕迹中发现战略模式:给定一个编码预期游戏方式(规范)的模型,我们对偏离规范的模式感兴趣,这些模式可能解释玩家可以从中学习更有效的游戏方式的游戏结果。正式介绍了该方法,并证明了它适用于不同的场景。最后,我们对10000个行为游戏轨迹的数据集进行了实验评估。
{"title":"What Did I Do Wrong in My MOBA Game? Mining Patterns Discriminating Deviant Behaviours","authors":"Olivier Cavadenti, Víctor Codocedo, Jean-François Boulicaut, Mehdi Kaytoue-Uberall","doi":"10.1109/DSAA.2016.75","DOIUrl":"https://doi.org/10.1109/DSAA.2016.75","url":null,"abstract":"The success of electronic sports (eSports), where professional gamers participate in competitive leagues and tournaments, brings new challenges for the video game industry. Other than fun, games must be difficult and challenging for eSports professionals but still easy and enjoyable for amateurs. In this article, we consider Multi-player Online Battle Arena games (MOBA) and particularly, \"Defense of the Ancients 2\", commonly known simply as DOTA2. In this context, a challenge is to propose data analysis methods and metrics that help players to improve their skills. We design a data mining-based method that discovers strategic patterns from historical behavioral traces: Given a model encoding an expected way of playing (the norm), we are interested in patterns deviating from the norm that may explain a game outcome from which player can learn more efficient ways of playing. The method is formally introduced and shown to be adaptable to different scenarios. Finally, we provide an experimental evaluation over a dataset of 10 000 behavioral game traces.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125772749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Advanced Analytics for Train Delay Prediction Systems by Including Exogenous Weather Data 包含外生天气数据的列车延误预测系统的高级分析
L. Oneto, Emanuele Fumeo, Giorgio Clerico, Renzo Canepa, Federico Papa, C. Dambra, N. Mazzino, D. Anguita
State-of-the-art train delay prediction systems neither exploit historical data about train movements, nor exogenous data about phenomena that can affect railway operations. They rely, instead, on static rules built by experts of the railway infrastructure based on classical univariate statistics. The purpose of this paper is to build a data-driven train delay prediction system that exploits the most recent analytics tools. The train delay prediction problem has been mapped into a multivariate regression problem and the performance of kernel methods, ensemble methods and feed-forward neural networks have been compared. Firstly, it is shown that it is possible to build a reliable and robust data-driven model based only on the historical data about the train movements. Additionally, the model can be further improved by including data coming from exogenous sources, in particular the weather information provided by national weather services. Results on real world data coming from the Italian railway network show that the proposal of this paper is able to remarkably improve the current state-of-the-art train delay prediction systems. Moreover, the performed simulations show that the inclusion of weather data into the model has a significant positive impact on its performance.
最先进的列车延误预测系统既不利用列车运行的历史数据,也不利用可能影响铁路运行的现象的外生数据。相反,他们依赖于铁路基础设施专家基于经典单变量统计建立的静态规则。本文的目的是利用最新的分析工具构建一个数据驱动的列车延误预测系统。将列车延误预测问题映射为一个多元回归问题,并比较了核方法、集成方法和前馈神经网络的性能。首先,证明了仅基于列车运行的历史数据就可以建立可靠、鲁棒的数据驱动模型。此外,通过纳入来自外部来源的数据,特别是国家气象服务机构提供的天气信息,可以进一步改进该模式。来自意大利铁路网的真实数据的结果表明,本文的建议能够显着改善目前最先进的列车延误预测系统。此外,模拟结果表明,将天气数据纳入模式对其性能有显著的积极影响。
{"title":"Advanced Analytics for Train Delay Prediction Systems by Including Exogenous Weather Data","authors":"L. Oneto, Emanuele Fumeo, Giorgio Clerico, Renzo Canepa, Federico Papa, C. Dambra, N. Mazzino, D. Anguita","doi":"10.1109/DSAA.2016.57","DOIUrl":"https://doi.org/10.1109/DSAA.2016.57","url":null,"abstract":"State-of-the-art train delay prediction systems neither exploit historical data about train movements, nor exogenous data about phenomena that can affect railway operations. They rely, instead, on static rules built by experts of the railway infrastructure based on classical univariate statistics. The purpose of this paper is to build a data-driven train delay prediction system that exploits the most recent analytics tools. The train delay prediction problem has been mapped into a multivariate regression problem and the performance of kernel methods, ensemble methods and feed-forward neural networks have been compared. Firstly, it is shown that it is possible to build a reliable and robust data-driven model based only on the historical data about the train movements. Additionally, the model can be further improved by including data coming from exogenous sources, in particular the weather information provided by national weather services. Results on real world data coming from the Italian railway network show that the proposal of this paper is able to remarkably improve the current state-of-the-art train delay prediction systems. Moreover, the performed simulations show that the inclusion of weather data into the model has a significant positive impact on its performance.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115225704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Web Behavior Analysis Using Sparse Non-Negative Matrix Factorization 基于稀疏非负矩阵分解的网络行为分析
Akihiro Demachi, Shin Matsushima, K. Yamanishi
We are concerned with the issue of discovering behavioral patterns on the web. When a large amount of web access logs are given, we are interested in how they are categorized and how they are related to activities in real life. In order to conduct that analysis, we develop a novel algorithm for sparse non-negative matrix factorization (SNMF), which can discover patterns of web behaviors. Although there exist a number of variants of SNMFs, our algorithm is novel in that it updates parameters in a multiplicative way with performance guaranteed, thereby works more robustly than existing ones, even when the rank of factorized matrices is large. We demonstrate the effectiveness of our algorithm using artificial data sets. We then apply our algorithm into a large scale web log data obtained from 70,000 monitors to discover meaningful relations among web behavioral patterns and real life activities. We employ the information-theoretic measure to demonstrate that our algorithm is able to extract more significant relations among web behavior patterns and real life activities than competitive methods.
我们关心的问题是发现网络上的行为模式。当提供大量的网络访问日志时,我们感兴趣的是如何对它们进行分类,以及它们如何与现实生活中的活动相关联。为了进行这种分析,我们开发了一种新的稀疏非负矩阵分解(SNMF)算法,该算法可以发现网络行为的模式。尽管存在许多snmf变体,但我们的算法是新颖的,因为它以保证性能的乘法方式更新参数,因此即使在分解矩阵的秩很大的情况下,也比现有的算法更健壮。我们用人工数据集证明了算法的有效性。然后,我们将我们的算法应用于从70,000个监视器中获得的大规模网络日志数据,以发现网络行为模式与现实生活活动之间的有意义的关系。我们采用信息论的方法来证明,我们的算法能够提取网络行为模式和现实生活活动之间比竞争方法更重要的关系。
{"title":"Web Behavior Analysis Using Sparse Non-Negative Matrix Factorization","authors":"Akihiro Demachi, Shin Matsushima, K. Yamanishi","doi":"10.1109/DSAA.2016.85","DOIUrl":"https://doi.org/10.1109/DSAA.2016.85","url":null,"abstract":"We are concerned with the issue of discovering behavioral patterns on the web. When a large amount of web access logs are given, we are interested in how they are categorized and how they are related to activities in real life. In order to conduct that analysis, we develop a novel algorithm for sparse non-negative matrix factorization (SNMF), which can discover patterns of web behaviors. Although there exist a number of variants of SNMFs, our algorithm is novel in that it updates parameters in a multiplicative way with performance guaranteed, thereby works more robustly than existing ones, even when the rank of factorized matrices is large. We demonstrate the effectiveness of our algorithm using artificial data sets. We then apply our algorithm into a large scale web log data obtained from 70,000 monitors to discover meaningful relations among web behavioral patterns and real life activities. We employ the information-theoretic measure to demonstrate that our algorithm is able to extract more significant relations among web behavior patterns and real life activities than competitive methods.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114908371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Anonymizing NYC Taxi Data: Does It Matter? 匿名纽约市出租车数据:重要吗?
Marie Douriez, Harish Doraiswamy, J. Freire, Cláudio T. Silva
The widespread use of location-based services has led to an increasing availability of trajectory data from urban environments. These data carry rich information that are useful for improving cities through traffic management and city planning. Yet, it also contains information about individuals which can jeopardize their privacy. In this study, we work with the New York City (NYC) taxi trips data set publicly released by the Taxi and Limousine Commission (TLC). This data set contains information about every taxi cab ride that happened in NYC. A bad hashing of the medallion numbers (the ID corresponding to a taxi) allowed the recovery of all the medallion numbers and led to a privacy breach for the drivers, whose income could be easily extracted. In this work, we initiate a study to evaluate whether "perfect" anonymity is possible and if such an identity disclosure can be avoided given the availability of diverse sets of external data sets through which the hidden information can be recovered. This is accomplished through a spatio-temporal join based attack which matches the taxi data with an external medallion data that can be easily gathered by an adversary. Using a simulation of the medallion data, we show that our attack can re-identify over 91% of the taxis that ply in NYC even when using a perfect pseudonymization of medallion numbers. We also explore the effectiveness of trajectory anonymization strategies and demonstrate that our attack can still identify a significant fraction of the taxis in NYC. Given the restrictions in publishing the taxi data by TLC, our results indicate that unless the utility of the data set is significantly compromised, it will not be possible to maintain the privacy of taxi medallion owners and drivers.
基于位置的服务的广泛使用使得来自城市环境的轨迹数据的可用性越来越高。这些数据包含丰富的信息,有助于通过交通管理和城市规划来改善城市。然而,它也包含可能危及其隐私的个人信息。在这项研究中,我们使用了纽约市出租车和豪华轿车委员会(TLC)公开发布的出租车出行数据集。这个数据集包含了在纽约发生的每一次出租车乘坐的信息。对车牌号码(出租车对应的ID)进行错误的散列处理,可以恢复所有车牌号码,并导致司机的隐私泄露,他们的收入很容易被提取出来。在这项工作中,我们发起了一项研究,以评估“完美”匿名是否可能,以及考虑到各种外部数据集的可用性,是否可以避免这种身份披露,通过这些数据集可以恢复隐藏的信息。这是通过基于时空连接的攻击来实现的,该攻击将出租车数据与可以被对手轻松收集的外部奖章数据相匹配。通过对牌照数据的模拟,我们发现,即使使用完美的牌照数字假名,我们的攻击也可以重新识别在纽约市运营的91%以上的出租车。我们还探索了轨迹匿名化策略的有效性,并证明我们的攻击仍然可以识别纽约市很大一部分出租车。鉴于TLC发布出租车数据的限制,我们的研究结果表明,除非数据集的效用受到严重损害,否则不可能维护出租车牌照所有者和司机的隐私。
{"title":"Anonymizing NYC Taxi Data: Does It Matter?","authors":"Marie Douriez, Harish Doraiswamy, J. Freire, Cláudio T. Silva","doi":"10.1109/DSAA.2016.21","DOIUrl":"https://doi.org/10.1109/DSAA.2016.21","url":null,"abstract":"The widespread use of location-based services has led to an increasing availability of trajectory data from urban environments. These data carry rich information that are useful for improving cities through traffic management and city planning. Yet, it also contains information about individuals which can jeopardize their privacy. In this study, we work with the New York City (NYC) taxi trips data set publicly released by the Taxi and Limousine Commission (TLC). This data set contains information about every taxi cab ride that happened in NYC. A bad hashing of the medallion numbers (the ID corresponding to a taxi) allowed the recovery of all the medallion numbers and led to a privacy breach for the drivers, whose income could be easily extracted. In this work, we initiate a study to evaluate whether \"perfect\" anonymity is possible and if such an identity disclosure can be avoided given the availability of diverse sets of external data sets through which the hidden information can be recovered. This is accomplished through a spatio-temporal join based attack which matches the taxi data with an external medallion data that can be easily gathered by an adversary. Using a simulation of the medallion data, we show that our attack can re-identify over 91% of the taxis that ply in NYC even when using a perfect pseudonymization of medallion numbers. We also explore the effectiveness of trajectory anonymization strategies and demonstrate that our attack can still identify a significant fraction of the taxis in NYC. Given the restrictions in publishing the taxi data by TLC, our results indicate that unless the utility of the data set is significantly compromised, it will not be possible to maintain the privacy of taxi medallion owners and drivers.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122650308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 57
Mining Pre-Exposure Prophylaxis Trends in Social Media 挖掘社交媒体的暴露前预防趋势
P. Breen, Jane M Kelly, T. Heckman, Shannon P. Quinn
Pre-Exposure Prophylaxis (PrEP) is a ground-breaking biomedical approach to curbing the transmission of Human Immunodeficiency Virus (HIV). Truvada, the most common form of PrEP, is a combination of tenofovir and emtricitabine and is a once-daily oral mediation taken by HIV-seronegative persons at elevated risk for HIV infection. When taken reliably every day, PrEP can reduce one's risk for HIV infection by as much as 99%. While highly efficacious, PrEP is expensive, somewhat stigmatized, and many health care providers remain uninformed about its benefits. Data mining of social media can monitor the spread of HIV in the United States, but no study has investigated PrEP use and sentiment via social media. This paper describes a data mining and machine learning strategy using natural language processing (NLP) that monitors Twitter social media data to identify PrEP discussion trends. Results showed that we can identify PrEP and HIV discussion dynamics over time, and assign PrEP-related tweets positive or negative sentiment. Results can enable public health professionals to monitor PrEP discussion trends and identify strategies to improve HIV prevention via PrEP.
暴露前预防(PrEP)是一种突破性的生物医学方法,用于遏制人类免疫缺陷病毒(HIV)的传播。特鲁瓦达是PrEP最常见的形式,是替诺福韦和恩曲他滨的组合,是艾滋病毒血清阴性的艾滋病毒感染风险较高的人每天服用一次的口服药物。如果每天可靠地服用PrEP,可以将感染艾滋病毒的风险降低99%。虽然PrEP非常有效,但价格昂贵,有些污名化,许多卫生保健提供者仍然不了解其益处。社交媒体的数据挖掘可以监测艾滋病毒在美国的传播,但没有研究调查PrEP在社交媒体上的使用和情绪。本文描述了一种使用自然语言处理(NLP)的数据挖掘和机器学习策略,该策略监控Twitter社交媒体数据以识别PrEP讨论趋势。结果表明,我们可以识别PrEP和HIV讨论动态随时间的变化,并分配PrEP相关推文的积极或消极情绪。结果可使公共卫生专业人员监测预防措施的讨论趋势,并确定通过预防措施改善艾滋病毒预防的战略。
{"title":"Mining Pre-Exposure Prophylaxis Trends in Social Media","authors":"P. Breen, Jane M Kelly, T. Heckman, Shannon P. Quinn","doi":"10.1109/DSAA.2016.29","DOIUrl":"https://doi.org/10.1109/DSAA.2016.29","url":null,"abstract":"Pre-Exposure Prophylaxis (PrEP) is a ground-breaking biomedical approach to curbing the transmission of Human Immunodeficiency Virus (HIV). Truvada, the most common form of PrEP, is a combination of tenofovir and emtricitabine and is a once-daily oral mediation taken by HIV-seronegative persons at elevated risk for HIV infection. When taken reliably every day, PrEP can reduce one's risk for HIV infection by as much as 99%. While highly efficacious, PrEP is expensive, somewhat stigmatized, and many health care providers remain uninformed about its benefits. Data mining of social media can monitor the spread of HIV in the United States, but no study has investigated PrEP use and sentiment via social media. This paper describes a data mining and machine learning strategy using natural language processing (NLP) that monitors Twitter social media data to identify PrEP discussion trends. Results showed that we can identify PrEP and HIV discussion dynamics over time, and assign PrEP-related tweets positive or negative sentiment. Results can enable public health professionals to monitor PrEP discussion trends and identify strategies to improve HIV prevention via PrEP.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128595396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Churn Prediction in Mobile Social Games: Towards a Complete Assessment Using Survival Ensembles 手机社交游戏用户流失预测:基于生存组合的完整评估
Á. Periáñez, A. Saas, Anna Guitart, Colin Magne
Reducing user attrition, i.e. churn, is a broad challenge faced by several industries. In mobile social games, decreasing churn is decisive to increase player retention and rise revenues. Churn prediction models allow to understand player loyalty and to anticipate when they will stop playing a game. Thanks to these predictions, several initiatives can be taken to retain those players who are more likely to churn. Survival analysis focuses on predicting the time of occurrence of a certain event, churn in our case. Classical methods, like regressions, could be applied only when all players have left the game. The challenge arises for datasets with incomplete churning information for all players, as most of them still connect to the game. This is called a censored data problem and is in the nature of churn. Censoring is commonly dealt with survival analysis techniques, but due to the inflexibility of the survival statistical algorithms, the accuracy achieved is often poor. In contrast, novel ensemble learning techniques, increasingly popular in a variety of scientific fields, provide high-class prediction results. In this work, we develop, for the first time in the social games domain, a survival ensemble model which provides a comprehensive analysis together with an accurate prediction of churn. For each player, we predict the probability of churning as function of time, which permits to distinguish various levels of loyalty profiles. Additionally, we assess the risk factors that explain the predicted player survival times. Our results show that churn prediction by survival ensembles significantly improves the accuracy and robustness of traditional analyses, like Cox regression.
减少用户流失(即流失率)是许多行业面临的广泛挑战。在手机社交游戏中,减少流失率是提高玩家留存率和收益的关键。流失预测模型能够帮助我们理解玩家的忠诚度,并预测他们何时会停止玩游戏。基于这些预测,开发者可以采取一些措施留住那些更有可能流失的玩家。生存分析的重点是预测某个事件发生的时间,在我们的例子中是动荡。像回归这样的经典方法,只能在所有玩家都离开游戏时才适用。对于所有玩家来说,不完整的搅动信息的数据集是一个挑战,因为他们中的大多数人仍然与游戏有联系。这就是所谓的删减数据问题,这是数据流失的本质。审查通常是处理生存分析技术,但由于不灵活的生存统计算法,实现的准确性往往很差。相比之下,在各种科学领域日益流行的新型集成学习技术提供了高水平的预测结果。在这项工作中,我们首次在社交游戏领域开发了一个生存集成模型,该模型提供了全面的分析和准确的流失率预测。对于每个玩家,我们预测了搅动的概率作为时间的函数,这允许区分不同级别的忠诚度概况。此外,我们还评估了解释预测玩家生存时间的风险因素。我们的研究结果表明,生存集合的流失预测显著提高了传统分析(如Cox回归)的准确性和稳健性。
{"title":"Churn Prediction in Mobile Social Games: Towards a Complete Assessment Using Survival Ensembles","authors":"Á. Periáñez, A. Saas, Anna Guitart, Colin Magne","doi":"10.1109/DSAA.2016.84","DOIUrl":"https://doi.org/10.1109/DSAA.2016.84","url":null,"abstract":"Reducing user attrition, i.e. churn, is a broad challenge faced by several industries. In mobile social games, decreasing churn is decisive to increase player retention and rise revenues. Churn prediction models allow to understand player loyalty and to anticipate when they will stop playing a game. Thanks to these predictions, several initiatives can be taken to retain those players who are more likely to churn. Survival analysis focuses on predicting the time of occurrence of a certain event, churn in our case. Classical methods, like regressions, could be applied only when all players have left the game. The challenge arises for datasets with incomplete churning information for all players, as most of them still connect to the game. This is called a censored data problem and is in the nature of churn. Censoring is commonly dealt with survival analysis techniques, but due to the inflexibility of the survival statistical algorithms, the accuracy achieved is often poor. In contrast, novel ensemble learning techniques, increasingly popular in a variety of scientific fields, provide high-class prediction results. In this work, we develop, for the first time in the social games domain, a survival ensemble model which provides a comprehensive analysis together with an accurate prediction of churn. For each player, we predict the probability of churning as function of time, which permits to distinguish various levels of loyalty profiles. Additionally, we assess the risk factors that explain the predicted player survival times. Our results show that churn prediction by survival ensembles significantly improves the accuracy and robustness of traditional analyses, like Cox regression.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":" 14","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120828419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 79
Combining Static and Dynamic Features for Multivariate Sequence Classification 结合静态和动态特征的多变量序列分类
A. Leontjeva, Ilya Kuzovkin
Model precision in a classification task is highly dependent on the feature space that is used to train the model. Moreover, whether the features are sequential or static will dictate which classification method can be applied as most of the machine learning algorithms are designed to deal with either one or another type of data. In real-life scenarios, however, it is often the case that both static and dynamic features are present, or can be extracted from the data. In this work, we demonstrate how generative models such as Hidden Markov Models (HMM) and Long Short-Term Memory (LSTM) artificial neural networks can be used to extract temporal information from the dynamic data. We explore how the extracted information can be combined with the static features in order to improve the classification performance. We evaluate the existing techniques and suggest a hybrid approach, which outperforms other methods on several public datasets.
分类任务中的模型精度高度依赖于用于训练模型的特征空间。此外,特征是顺序的还是静态的将决定哪种分类方法可以应用,因为大多数机器学习算法都是为了处理一种或另一种类型的数据而设计的。然而,在现实场景中,静态和动态特征通常都存在,或者可以从数据中提取。在这项工作中,我们展示了如何使用隐马尔可夫模型(HMM)和长短期记忆(LSTM)人工神经网络等生成模型从动态数据中提取时间信息。我们探索了如何将提取的信息与静态特征相结合,以提高分类性能。我们评估了现有的技术,并提出了一种混合方法,该方法在几个公共数据集上优于其他方法。
{"title":"Combining Static and Dynamic Features for Multivariate Sequence Classification","authors":"A. Leontjeva, Ilya Kuzovkin","doi":"10.1109/DSAA.2016.10","DOIUrl":"https://doi.org/10.1109/DSAA.2016.10","url":null,"abstract":"Model precision in a classification task is highly dependent on the feature space that is used to train the model. Moreover, whether the features are sequential or static will dictate which classification method can be applied as most of the machine learning algorithms are designed to deal with either one or another type of data. In real-life scenarios, however, it is often the case that both static and dynamic features are present, or can be extracted from the data. In this work, we demonstrate how generative models such as Hidden Markov Models (HMM) and Long Short-Term Memory (LSTM) artificial neural networks can be used to extract temporal information from the dynamic data. We explore how the extracted information can be combined with the static features in order to improve the classification performance. We evaluate the existing techniques and suggest a hybrid approach, which outperforms other methods on several public datasets.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132767583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
期刊
2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1