S. Ruggles, T. Kugler, Catherine A. Fitch, D. V. Riper
Terra Populus, part of National Science Foundation's DataNet initiative, is developing organizational and technical infrastructure to integrate, preserve, and disseminate data describing changes in the human population and environment over time. A large number of high-quality environmental and population datasets are available, but they are widely dispersed, have incompatible or inadequate metadata, and have incompatible geographic identifiers. The new Terra Populus infrastructure enables researchers to identify and merge data from heterogeneous sources to study the relationships between human behavior and the natural world.
Terra Populus是美国国家科学基金会数据网计划的一部分,正在开发组织和技术基础设施,以整合、保存和传播描述人口和环境随时间变化的数据。有大量高质量的环境和人口数据集,但它们分布广泛,元数据不兼容或不充分,地理标识符也不兼容。新的Terra Populus基础设施使研究人员能够识别和合并来自不同来源的数据,以研究人类行为与自然世界之间的关系。
{"title":"Terra Populus: Integrated Data on Population and Environment","authors":"S. Ruggles, T. Kugler, Catherine A. Fitch, D. V. Riper","doi":"10.1109/ICDMW.2015.204","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.204","url":null,"abstract":"Terra Populus, part of National Science Foundation's DataNet initiative, is developing organizational and technical infrastructure to integrate, preserve, and disseminate data describing changes in the human population and environment over time. A large number of high-quality environmental and population datasets are available, but they are widely dispersed, have incompatible or inadequate metadata, and have incompatible geographic identifiers. The new Terra Populus infrastructure enables researchers to identify and merge data from heterogeneous sources to study the relationships between human behavior and the natural world.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132700032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In multiple classification, there is a type of commonproblems where each instance is associated with an ordinal label, which arises in various settings such as text mining, visual recognition and other information retrieval tasks. The support vectorordinal regression (SVOR) is a good model widely used for ordinalregression. In some applications such as document classification, data usually appears in a high dimensional feature space andlinear SVOR becomes a good choice. In this work, we developan efficient solver for training large-scale linear SVOR basedon alternating direction method of multipliers(ADMM). Whencompared empirically on benchmark data sets, the proposedsolver enjoys advantages in terms of both training speed andgeneralization performance over the method based on SMO, which invalidate the effectiveness and efficiency of our algorithm.
{"title":"Large-Scale Linear Support Vector Ordinal Regression Solver","authors":"Yong Shi, Huadong Wang, Lingfeng Niu","doi":"10.1109/ICDMW.2015.257","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.257","url":null,"abstract":"In multiple classification, there is a type of commonproblems where each instance is associated with an ordinal label, which arises in various settings such as text mining, visual recognition and other information retrieval tasks. The support vectorordinal regression (SVOR) is a good model widely used for ordinalregression. In some applications such as document classification, data usually appears in a high dimensional feature space andlinear SVOR becomes a good choice. In this work, we developan efficient solver for training large-scale linear SVOR basedon alternating direction method of multipliers(ADMM). Whencompared empirically on benchmark data sets, the proposedsolver enjoys advantages in terms of both training speed andgeneralization performance over the method based on SMO, which invalidate the effectiveness and efficiency of our algorithm.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"41 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113938582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a supervised machine learning system capable of matching internet devices to web cookies through filtering, feature engineering, binary classification, and post processing. The system builds a reasonably sized training and testing data set through filtering and feature engineering. We build 415 features in total. Some of these features were engineered to be O(n) time, stand alone classifiers for this problem. Other features use various natural language processing (NLP) techniques. Meta features are created by ridge regression and Adaboost. Then binary classification through two different gradient boosting (XGBoost with logarithmic loss) models is performed. A post processing pipeline connects devices and cookies in a way that maximizes F_0.5 score. Our machine learning system obtained a private F_0.5 score of 0.849562 for a final rank of 12th/340 on the ICDM 2015: Drawbridge Cross-Device Connections challenge.
{"title":"Connecting Devices to Cookies via Filtering, Feature Engineering, and Boosting","authors":"M. Kim, Jiwei Liu, Xiaozhou Wang, Wei Yang","doi":"10.1109/ICDMW.2015.236","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.236","url":null,"abstract":"We present a supervised machine learning system capable of matching internet devices to web cookies through filtering, feature engineering, binary classification, and post processing. The system builds a reasonably sized training and testing data set through filtering and feature engineering. We build 415 features in total. Some of these features were engineered to be O(n) time, stand alone classifiers for this problem. Other features use various natural language processing (NLP) techniques. Meta features are created by ridge regression and Adaboost. Then binary classification through two different gradient boosting (XGBoost with logarithmic loss) models is performed. A post processing pipeline connects devices and cookies in a way that maximizes F_0.5 score. Our machine learning system obtained a private F_0.5 score of 0.849562 for a final rank of 12th/340 on the ICDM 2015: Drawbridge Cross-Device Connections challenge.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127899133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce the concept of Least Similar Nearest Neighbours (LeSiNN) and use LeSiNN to detect anomalies directly. Although there is an existing method which is a special case of LeSiNN, this paper is the first to clearly articulate the underlying concept, as far as we know. LeSiNN is the first ensemble method which works well with models trained using samples of one instance. LeSiNN has linear time complexity with respect to data size and the number of dimensions, and it is one of the few anomaly detectors which can apply directly to both numeric and categorical data sets. Our extensive empirical evaluation shows that LeSiNN is either competitive to or better than six state-of-the-art anomaly detectors in terms of detection accuracy and runtime.
{"title":"LeSiNN: Detecting Anomalies by Identifying Least Similar Nearest Neighbours","authors":"Guansong Pang, K. Ting, D. Albrecht","doi":"10.1109/ICDMW.2015.62","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.62","url":null,"abstract":"We introduce the concept of Least Similar Nearest Neighbours (LeSiNN) and use LeSiNN to detect anomalies directly. Although there is an existing method which is a special case of LeSiNN, this paper is the first to clearly articulate the underlying concept, as far as we know. LeSiNN is the first ensemble method which works well with models trained using samples of one instance. LeSiNN has linear time complexity with respect to data size and the number of dimensions, and it is one of the few anomaly detectors which can apply directly to both numeric and categorical data sets. Our extensive empirical evaluation shows that LeSiNN is either competitive to or better than six state-of-the-art anomaly detectors in terms of detection accuracy and runtime.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121199882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mobile crowd sensing (MCS) is as a promising people-centric sensing paradigm which allows ordinary citizens to contribute sensing data using mobile communication devices. In this paper we study correlation between users' mobility and their role as contributors in MCS applications. We propose a new trajectory-based approach for task allocation in MCS environments and model participants' spatio-temporal competences by analyzing their mobile traces. By allocating MCS tasks only to participant who are familiar with the target location we significantly increase the reliability of contributed data and reduce total communication cost. We introduce novel metric to estimate participants' competence to conduct MCS tasks and propose fair ranking approach allowing newcomers to compete with experienced senior contributors. Additionally, we group similar expert contributors and thus open up new possibilities for physical collaboration between them. We evaluate our work using GeoLife trajectory dataset and the experimental results show the advantages of our approach.
{"title":"Trajectory-Based Task Allocation for Reliable Mobile Crowd Sensing Systems","authors":"Petar Mrazovic, M. Matskin, Nima Dokoohaki","doi":"10.1109/ICDMW.2015.90","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.90","url":null,"abstract":"Mobile crowd sensing (MCS) is as a promising people-centric sensing paradigm which allows ordinary citizens to contribute sensing data using mobile communication devices. In this paper we study correlation between users' mobility and their role as contributors in MCS applications. We propose a new trajectory-based approach for task allocation in MCS environments and model participants' spatio-temporal competences by analyzing their mobile traces. By allocating MCS tasks only to participant who are familiar with the target location we significantly increase the reliability of contributed data and reduce total communication cost. We introduce novel metric to estimate participants' competence to conduct MCS tasks and propose fair ranking approach allowing newcomers to compete with experienced senior contributors. Additionally, we group similar expert contributors and thus open up new possibilities for physical collaboration between them. We evaluate our work using GeoLife trajectory dataset and the experimental results show the advantages of our approach.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128586440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this work we present a Conversation Classifierbased on Multiple Classifiers, to detect Life Events on SocialMedia. In one hand, conversations can provide more contextand help disambiguate life event detection, compared with single posts. On the other hand, the increase in number of messages and the way they interact with each other within the conversation cannot be trivially modeled by a classifier. To tackle this problem, we focus on creating a set of classifiers from different feature sets, and combining their classification outputs to improve accuracy. The experiments show that multiple classifiers are promising for this problem, being able to present an increase of about 45% in the F-Score.
{"title":"A Multiple Classifier System for Classifying Life Events on Social Media","authors":"P. Cavalin, L. G. Moyano, Pedro P. Miranda","doi":"10.1109/ICDMW.2015.182","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.182","url":null,"abstract":"In this work we present a Conversation Classifierbased on Multiple Classifiers, to detect Life Events on SocialMedia. In one hand, conversations can provide more contextand help disambiguate life event detection, compared with single posts. On the other hand, the increase in number of messages and the way they interact with each other within the conversation cannot be trivially modeled by a classifier. To tackle this problem, we focus on creating a set of classifiers from different feature sets, and combining their classification outputs to improve accuracy. The experiments show that multiple classifiers are promising for this problem, being able to present an increase of about 45% in the F-Score.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117087498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shwetabh Khanduja, Vinod Nair, S. Sundararajan, Ameya Raul, Ajesh Babu Shaj, S. Keerthi
We demonstrate a near real-time service monitoring system for detecting and diagnosing issues from high-dimensional time series data. For detection, we have implemented a learning algorithm that constructs a hierarchy of detectors from data. It is scalable, does not require labelled examples of issues for learning, runs in near real-time, and identifles a subset of counter time series as being relevant for a detected issue. For diagnosis, we provide efflcient algorithms as post-detection diagnosis aids to flnd further relevant counter time series at issue times, a SQL-like query language for writing flexible queries that apply these algorithms on the time series data, and a graphical user interface for visualizing the detection and diagnosis results. Our solution has been deployed in production as an end-to-end system for monitoring Microsoft's internal distributed data storage and computing platform consisting of tens of thousands of machines and currently analyses about 12000 counter time series.
{"title":"Near Real-Time Service Monitoring Using High-Dimensional Time Series","authors":"Shwetabh Khanduja, Vinod Nair, S. Sundararajan, Ameya Raul, Ajesh Babu Shaj, S. Keerthi","doi":"10.1109/ICDMW.2015.254","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.254","url":null,"abstract":"We demonstrate a near real-time service monitoring system for detecting and diagnosing issues from high-dimensional time series data. For detection, we have implemented a learning algorithm that constructs a hierarchy of detectors from data. It is scalable, does not require labelled examples of issues for learning, runs in near real-time, and identifles a subset of counter time series as being relevant for a detected issue. For diagnosis, we provide efflcient algorithms as post-detection diagnosis aids to flnd further relevant counter time series at issue times, a SQL-like query language for writing flexible queries that apply these algorithms on the time series data, and a graphical user interface for visualizing the detection and diagnosis results. Our solution has been deployed in production as an end-to-end system for monitoring Microsoft's internal distributed data storage and computing platform consisting of tens of thousands of machines and currently analyses about 12000 counter time series.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115357151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Requirements for a system are often discovered during negotiation process, at the time when stakeholders of the system are thinking over their premises or backgrounds behind other stakeholders' requirements, rather than at the time when stakeholders thinking about their own requirements. Disagreements and conflicts between stakeholders are utilized as a driver to discover requirements for the system. In this paper, we propose a support tool for discovering conflicts among stakeholders, called an extended goal graph. We implemented a prototype of the tool and applied the prototype to a requirements meeting to confirm feasibility for discovering conflicts.
{"title":"Extended Goal Graph: A Support Tool for Discovering Conflicts among Stakeholders and Promoting Requirements Elicitation with Goal Orientation","authors":"N. Kushiro, Takuro Shimizu","doi":"10.1109/ICDMW.2015.52","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.52","url":null,"abstract":"Requirements for a system are often discovered during negotiation process, at the time when stakeholders of the system are thinking over their premises or backgrounds behind other stakeholders' requirements, rather than at the time when stakeholders thinking about their own requirements. Disagreements and conflicts between stakeholders are utilized as a driver to discover requirements for the system. In this paper, we propose a support tool for discovering conflicts among stakeholders, called an extended goal graph. We implemented a prototype of the tool and applied the prototype to a requirements meeting to confirm feasibility for discovering conflicts.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114759653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mihir Shekhar, Veera Raghavendra Chikka, Lini T. Thomas, Sunil Mandhan, K. Karlapalem
We present an automated disease term classification model using machine learning techniques that classifies a medical term to a specific disease class. We work on five particular diseases: Cancer, AIDS, Arthritis, Diabetes and heart related ailments. We identify and classify medical terms like drug names, symptoms, abbreviations, disease names, tests, etc., into their specific diseases classes. The results illustrate that our model for disease term classification finds all disease term classes with an average F-score of 0.966.
{"title":"Identifying Medical Terms Related to Specific Diseases","authors":"Mihir Shekhar, Veera Raghavendra Chikka, Lini T. Thomas, Sunil Mandhan, K. Karlapalem","doi":"10.1109/ICDMW.2015.71","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.71","url":null,"abstract":"We present an automated disease term classification model using machine learning techniques that classifies a medical term to a specific disease class. We work on five particular diseases: Cancer, AIDS, Arthritis, Diabetes and heart related ailments. We identify and classify medical terms like drug names, symptoms, abbreviations, disease names, tests, etc., into their specific diseases classes. The results illustrate that our model for disease term classification finds all disease term classes with an average F-score of 0.966.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127398213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data sparseness is one of the most challenging problems in collaborative filtering(CF) based recommendation systems. Exploiting social tag information is becoming a popular way to alleviate the problem and improve the performance. To this end, in recent recommendation methods the relationships between users/items and tags are often taken into consideration, however, the correlations among tags from different itemdomains are always ignored. For that, in this paper we propose a novel way to exploit the rating patterns across multiple domains by transferring the tag co-occurrence matrix information, which could be used for revealing common user pattern. With extensive experiments we demonstrate the effectiveness of our approach for the cross-domain information recommendation.
{"title":"Cross-Domain Recommendation via Tag Matrix Transfer","authors":"Zhou Fang, Sheng Gao, B. Li, Juncen Li, J. Liao","doi":"10.1109/ICDMW.2015.133","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.133","url":null,"abstract":"Data sparseness is one of the most challenging problems in collaborative filtering(CF) based recommendation systems. Exploiting social tag information is becoming a popular way to alleviate the problem and improve the performance. To this end, in recent recommendation methods the relationships between users/items and tags are often taken into consideration, however, the correlations among tags from different itemdomains are always ignored. For that, in this paper we propose a novel way to exploit the rating patterns across multiple domains by transferring the tag co-occurrence matrix information, which could be used for revealing common user pattern. With extensive experiments we demonstrate the effectiveness of our approach for the cross-domain information recommendation.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126691070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}