Taxis provide a flexible and indispensable service to satisfy the urban travel demand of public commuters. Understanding taxi supply and commuter demand, especially the imbalance between the supply and the demand, would directly help to improve the quality of taxi service and eventually increase a city's traffic system efficiency. In this paper, we consider the taxi demand from a region during a period of time to include two parts: satisfied demand, i.e., passengers successfully receive taxi service during this period of time, and unmet demand, i.e., passengers are still waiting for taxi service. To properly estimate the demand-supply level (short for "the level of the taxi demand vs. supply imbalance"), we propose a novel indicator that reflects how fast an available taxi is taken in any given region. Accordingly, we design and implement a taxi analytics system to provide such information in near real time. Finally, we use the passenger waiting time survey data and the taxi streaming data to validate the proposed indicator on the built taxi analytics system.
{"title":"Estimating Taxi Demand-Supply Level Using Taxi Trajectory Data Stream","authors":"Dongxu Shao, Wei Wu, Shili Xiang, Yu Lu","doi":"10.1109/ICDMW.2015.250","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.250","url":null,"abstract":"Taxis provide a flexible and indispensable service to satisfy the urban travel demand of public commuters. Understanding taxi supply and commuter demand, especially the imbalance between the supply and the demand, would directly help to improve the quality of taxi service and eventually increase a city's traffic system efficiency. In this paper, we consider the taxi demand from a region during a period of time to include two parts: satisfied demand, i.e., passengers successfully receive taxi service during this period of time, and unmet demand, i.e., passengers are still waiting for taxi service. To properly estimate the demand-supply level (short for \"the level of the taxi demand vs. supply imbalance\"), we propose a novel indicator that reflects how fast an available taxi is taken in any given region. Accordingly, we design and implement a taxi analytics system to provide such information in near real time. Finally, we use the passenger waiting time survey data and the taxi streaming data to validate the proposed indicator on the built taxi analytics system.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134154136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In multiple classification, there is a type of commonproblems where each instance is associated with an ordinal label, which arises in various settings such as text mining, visual recognition and other information retrieval tasks. The support vectorordinal regression (SVOR) is a good model widely used for ordinalregression. In some applications such as document classification, data usually appears in a high dimensional feature space andlinear SVOR becomes a good choice. In this work, we developan efficient solver for training large-scale linear SVOR basedon alternating direction method of multipliers(ADMM). Whencompared empirically on benchmark data sets, the proposedsolver enjoys advantages in terms of both training speed andgeneralization performance over the method based on SMO, which invalidate the effectiveness and efficiency of our algorithm.
{"title":"Large-Scale Linear Support Vector Ordinal Regression Solver","authors":"Yong Shi, Huadong Wang, Lingfeng Niu","doi":"10.1109/ICDMW.2015.257","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.257","url":null,"abstract":"In multiple classification, there is a type of commonproblems where each instance is associated with an ordinal label, which arises in various settings such as text mining, visual recognition and other information retrieval tasks. The support vectorordinal regression (SVOR) is a good model widely used for ordinalregression. In some applications such as document classification, data usually appears in a high dimensional feature space andlinear SVOR becomes a good choice. In this work, we developan efficient solver for training large-scale linear SVOR basedon alternating direction method of multipliers(ADMM). Whencompared empirically on benchmark data sets, the proposedsolver enjoys advantages in terms of both training speed andgeneralization performance over the method based on SMO, which invalidate the effectiveness and efficiency of our algorithm.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"41 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113938582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a supervised machine learning system capable of matching internet devices to web cookies through filtering, feature engineering, binary classification, and post processing. The system builds a reasonably sized training and testing data set through filtering and feature engineering. We build 415 features in total. Some of these features were engineered to be O(n) time, stand alone classifiers for this problem. Other features use various natural language processing (NLP) techniques. Meta features are created by ridge regression and Adaboost. Then binary classification through two different gradient boosting (XGBoost with logarithmic loss) models is performed. A post processing pipeline connects devices and cookies in a way that maximizes F_0.5 score. Our machine learning system obtained a private F_0.5 score of 0.849562 for a final rank of 12th/340 on the ICDM 2015: Drawbridge Cross-Device Connections challenge.
{"title":"Connecting Devices to Cookies via Filtering, Feature Engineering, and Boosting","authors":"M. Kim, Jiwei Liu, Xiaozhou Wang, Wei Yang","doi":"10.1109/ICDMW.2015.236","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.236","url":null,"abstract":"We present a supervised machine learning system capable of matching internet devices to web cookies through filtering, feature engineering, binary classification, and post processing. The system builds a reasonably sized training and testing data set through filtering and feature engineering. We build 415 features in total. Some of these features were engineered to be O(n) time, stand alone classifiers for this problem. Other features use various natural language processing (NLP) techniques. Meta features are created by ridge regression and Adaboost. Then binary classification through two different gradient boosting (XGBoost with logarithmic loss) models is performed. A post processing pipeline connects devices and cookies in a way that maximizes F_0.5 score. Our machine learning system obtained a private F_0.5 score of 0.849562 for a final rank of 12th/340 on the ICDM 2015: Drawbridge Cross-Device Connections challenge.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127899133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce the concept of Least Similar Nearest Neighbours (LeSiNN) and use LeSiNN to detect anomalies directly. Although there is an existing method which is a special case of LeSiNN, this paper is the first to clearly articulate the underlying concept, as far as we know. LeSiNN is the first ensemble method which works well with models trained using samples of one instance. LeSiNN has linear time complexity with respect to data size and the number of dimensions, and it is one of the few anomaly detectors which can apply directly to both numeric and categorical data sets. Our extensive empirical evaluation shows that LeSiNN is either competitive to or better than six state-of-the-art anomaly detectors in terms of detection accuracy and runtime.
{"title":"LeSiNN: Detecting Anomalies by Identifying Least Similar Nearest Neighbours","authors":"Guansong Pang, K. Ting, D. Albrecht","doi":"10.1109/ICDMW.2015.62","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.62","url":null,"abstract":"We introduce the concept of Least Similar Nearest Neighbours (LeSiNN) and use LeSiNN to detect anomalies directly. Although there is an existing method which is a special case of LeSiNN, this paper is the first to clearly articulate the underlying concept, as far as we know. LeSiNN is the first ensemble method which works well with models trained using samples of one instance. LeSiNN has linear time complexity with respect to data size and the number of dimensions, and it is one of the few anomaly detectors which can apply directly to both numeric and categorical data sets. Our extensive empirical evaluation shows that LeSiNN is either competitive to or better than six state-of-the-art anomaly detectors in terms of detection accuracy and runtime.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121199882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mobile crowd sensing (MCS) is as a promising people-centric sensing paradigm which allows ordinary citizens to contribute sensing data using mobile communication devices. In this paper we study correlation between users' mobility and their role as contributors in MCS applications. We propose a new trajectory-based approach for task allocation in MCS environments and model participants' spatio-temporal competences by analyzing their mobile traces. By allocating MCS tasks only to participant who are familiar with the target location we significantly increase the reliability of contributed data and reduce total communication cost. We introduce novel metric to estimate participants' competence to conduct MCS tasks and propose fair ranking approach allowing newcomers to compete with experienced senior contributors. Additionally, we group similar expert contributors and thus open up new possibilities for physical collaboration between them. We evaluate our work using GeoLife trajectory dataset and the experimental results show the advantages of our approach.
{"title":"Trajectory-Based Task Allocation for Reliable Mobile Crowd Sensing Systems","authors":"Petar Mrazovic, M. Matskin, Nima Dokoohaki","doi":"10.1109/ICDMW.2015.90","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.90","url":null,"abstract":"Mobile crowd sensing (MCS) is as a promising people-centric sensing paradigm which allows ordinary citizens to contribute sensing data using mobile communication devices. In this paper we study correlation between users' mobility and their role as contributors in MCS applications. We propose a new trajectory-based approach for task allocation in MCS environments and model participants' spatio-temporal competences by analyzing their mobile traces. By allocating MCS tasks only to participant who are familiar with the target location we significantly increase the reliability of contributed data and reduce total communication cost. We introduce novel metric to estimate participants' competence to conduct MCS tasks and propose fair ranking approach allowing newcomers to compete with experienced senior contributors. Additionally, we group similar expert contributors and thus open up new possibilities for physical collaboration between them. We evaluate our work using GeoLife trajectory dataset and the experimental results show the advantages of our approach.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128586440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this work we present a Conversation Classifierbased on Multiple Classifiers, to detect Life Events on SocialMedia. In one hand, conversations can provide more contextand help disambiguate life event detection, compared with single posts. On the other hand, the increase in number of messages and the way they interact with each other within the conversation cannot be trivially modeled by a classifier. To tackle this problem, we focus on creating a set of classifiers from different feature sets, and combining their classification outputs to improve accuracy. The experiments show that multiple classifiers are promising for this problem, being able to present an increase of about 45% in the F-Score.
{"title":"A Multiple Classifier System for Classifying Life Events on Social Media","authors":"P. Cavalin, L. G. Moyano, Pedro P. Miranda","doi":"10.1109/ICDMW.2015.182","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.182","url":null,"abstract":"In this work we present a Conversation Classifierbased on Multiple Classifiers, to detect Life Events on SocialMedia. In one hand, conversations can provide more contextand help disambiguate life event detection, compared with single posts. On the other hand, the increase in number of messages and the way they interact with each other within the conversation cannot be trivially modeled by a classifier. To tackle this problem, we focus on creating a set of classifiers from different feature sets, and combining their classification outputs to improve accuracy. The experiments show that multiple classifiers are promising for this problem, being able to present an increase of about 45% in the F-Score.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117087498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shwetabh Khanduja, Vinod Nair, S. Sundararajan, Ameya Raul, Ajesh Babu Shaj, S. Keerthi
We demonstrate a near real-time service monitoring system for detecting and diagnosing issues from high-dimensional time series data. For detection, we have implemented a learning algorithm that constructs a hierarchy of detectors from data. It is scalable, does not require labelled examples of issues for learning, runs in near real-time, and identifles a subset of counter time series as being relevant for a detected issue. For diagnosis, we provide efflcient algorithms as post-detection diagnosis aids to flnd further relevant counter time series at issue times, a SQL-like query language for writing flexible queries that apply these algorithms on the time series data, and a graphical user interface for visualizing the detection and diagnosis results. Our solution has been deployed in production as an end-to-end system for monitoring Microsoft's internal distributed data storage and computing platform consisting of tens of thousands of machines and currently analyses about 12000 counter time series.
{"title":"Near Real-Time Service Monitoring Using High-Dimensional Time Series","authors":"Shwetabh Khanduja, Vinod Nair, S. Sundararajan, Ameya Raul, Ajesh Babu Shaj, S. Keerthi","doi":"10.1109/ICDMW.2015.254","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.254","url":null,"abstract":"We demonstrate a near real-time service monitoring system for detecting and diagnosing issues from high-dimensional time series data. For detection, we have implemented a learning algorithm that constructs a hierarchy of detectors from data. It is scalable, does not require labelled examples of issues for learning, runs in near real-time, and identifles a subset of counter time series as being relevant for a detected issue. For diagnosis, we provide efflcient algorithms as post-detection diagnosis aids to flnd further relevant counter time series at issue times, a SQL-like query language for writing flexible queries that apply these algorithms on the time series data, and a graphical user interface for visualizing the detection and diagnosis results. Our solution has been deployed in production as an end-to-end system for monitoring Microsoft's internal distributed data storage and computing platform consisting of tens of thousands of machines and currently analyses about 12000 counter time series.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115357151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Requirements for a system are often discovered during negotiation process, at the time when stakeholders of the system are thinking over their premises or backgrounds behind other stakeholders' requirements, rather than at the time when stakeholders thinking about their own requirements. Disagreements and conflicts between stakeholders are utilized as a driver to discover requirements for the system. In this paper, we propose a support tool for discovering conflicts among stakeholders, called an extended goal graph. We implemented a prototype of the tool and applied the prototype to a requirements meeting to confirm feasibility for discovering conflicts.
{"title":"Extended Goal Graph: A Support Tool for Discovering Conflicts among Stakeholders and Promoting Requirements Elicitation with Goal Orientation","authors":"N. Kushiro, Takuro Shimizu","doi":"10.1109/ICDMW.2015.52","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.52","url":null,"abstract":"Requirements for a system are often discovered during negotiation process, at the time when stakeholders of the system are thinking over their premises or backgrounds behind other stakeholders' requirements, rather than at the time when stakeholders thinking about their own requirements. Disagreements and conflicts between stakeholders are utilized as a driver to discover requirements for the system. In this paper, we propose a support tool for discovering conflicts among stakeholders, called an extended goal graph. We implemented a prototype of the tool and applied the prototype to a requirements meeting to confirm feasibility for discovering conflicts.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114759653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mihir Shekhar, Veera Raghavendra Chikka, Lini T. Thomas, Sunil Mandhan, K. Karlapalem
We present an automated disease term classification model using machine learning techniques that classifies a medical term to a specific disease class. We work on five particular diseases: Cancer, AIDS, Arthritis, Diabetes and heart related ailments. We identify and classify medical terms like drug names, symptoms, abbreviations, disease names, tests, etc., into their specific diseases classes. The results illustrate that our model for disease term classification finds all disease term classes with an average F-score of 0.966.
{"title":"Identifying Medical Terms Related to Specific Diseases","authors":"Mihir Shekhar, Veera Raghavendra Chikka, Lini T. Thomas, Sunil Mandhan, K. Karlapalem","doi":"10.1109/ICDMW.2015.71","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.71","url":null,"abstract":"We present an automated disease term classification model using machine learning techniques that classifies a medical term to a specific disease class. We work on five particular diseases: Cancer, AIDS, Arthritis, Diabetes and heart related ailments. We identify and classify medical terms like drug names, symptoms, abbreviations, disease names, tests, etc., into their specific diseases classes. The results illustrate that our model for disease term classification finds all disease term classes with an average F-score of 0.966.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127398213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data sparseness is one of the most challenging problems in collaborative filtering(CF) based recommendation systems. Exploiting social tag information is becoming a popular way to alleviate the problem and improve the performance. To this end, in recent recommendation methods the relationships between users/items and tags are often taken into consideration, however, the correlations among tags from different itemdomains are always ignored. For that, in this paper we propose a novel way to exploit the rating patterns across multiple domains by transferring the tag co-occurrence matrix information, which could be used for revealing common user pattern. With extensive experiments we demonstrate the effectiveness of our approach for the cross-domain information recommendation.
{"title":"Cross-Domain Recommendation via Tag Matrix Transfer","authors":"Zhou Fang, Sheng Gao, B. Li, Juncen Li, J. Liao","doi":"10.1109/ICDMW.2015.133","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.133","url":null,"abstract":"Data sparseness is one of the most challenging problems in collaborative filtering(CF) based recommendation systems. Exploiting social tag information is becoming a popular way to alleviate the problem and improve the performance. To this end, in recent recommendation methods the relationships between users/items and tags are often taken into consideration, however, the correlations among tags from different itemdomains are always ignored. For that, in this paper we propose a novel way to exploit the rating patterns across multiple domains by transferring the tag co-occurrence matrix information, which could be used for revealing common user pattern. With extensive experiments we demonstrate the effectiveness of our approach for the cross-domain information recommendation.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126691070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}