2015 IEEE International Conference on Data Mining Workshop (ICDMW)最新文献

英文中文

Estimating Taxi Demand-Supply Level Using Taxi Trajectory Data Stream 利用出租车轨迹数据流估计出租车供需水平

2015 IEEE International Conference on Data Mining Workshop (ICDMW)

Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.250

Dongxu Shao, Wei Wu, Shili Xiang, Yu Lu

Taxis provide a flexible and indispensable service to satisfy the urban travel demand of public commuters. Understanding taxi supply and commuter demand, especially the imbalance between the supply and the demand, would directly help to improve the quality of taxi service and eventually increase a city's traffic system efficiency. In this paper, we consider the taxi demand from a region during a period of time to include two parts: satisfied demand, i.e., passengers successfully receive taxi service during this period of time, and unmet demand, i.e., passengers are still waiting for taxi service. To properly estimate the demand-supply level (short for "the level of the taxi demand vs. supply imbalance"), we propose a novel indicator that reflects how fast an available taxi is taken in any given region. Accordingly, we design and implement a taxi analytics system to provide such information in near real time. Finally, we use the passenger waiting time survey data and the taxi streaming data to validate the proposed indicator on the built taxi analytics system.

出租车为满足公共通勤者的城市出行需求提供了一种灵活而不可或缺的服务。了解出租车供给和通勤需求，特别是供需失衡的问题，将直接有助于提高出租车服务质量，最终提高城市交通系统效率。在本文中，我们考虑一个地区在一段时间内的出租车需求，包括两部分:满足的需求，即乘客在这段时间内成功地获得了出租车服务;未满足的需求，即乘客仍在等待出租车服务。为了正确估计供需水平(简称“出租车需求与供应失衡水平”)，我们提出了一个新的指标，反映任何给定地区可用出租车的使用速度。因此，我们设计并实现了一个出租车分析系统，以近乎实时地提供这些信息。最后，我们使用乘客等待时间调查数据和出租车流数据在构建的出租车分析系统上验证了所提出的指标。

引用次数: 27

Large-Scale Linear Support Vector Ordinal Regression Solver 大规模线性支持向量有序回归求解器

2015 IEEE International Conference on Data Mining Workshop (ICDMW)

Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.257

Yong Shi, Huadong Wang, Lingfeng Niu

In multiple classification, there is a type of commonproblems where each instance is associated with an ordinal label, which arises in various settings such as text mining, visual recognition and other information retrieval tasks. The support vectorordinal regression (SVOR) is a good model widely used for ordinalregression. In some applications such as document classification, data usually appears in a high dimensional feature space andlinear SVOR becomes a good choice. In this work, we developan efficient solver for training large-scale linear SVOR basedon alternating direction method of multipliers(ADMM). Whencompared empirically on benchmark data sets, the proposedsolver enjoys advantages in terms of both training speed andgeneralization performance over the method based on SMO, which invalidate the effectiveness and efficiency of our algorithm.

在多重分类中，存在一种常见问题，其中每个实例都与一个顺序标签相关联，这种问题出现在文本挖掘、视觉识别和其他信息检索任务等各种设置中。支持向量有序回归(SVOR)是一种广泛应用于有序回归的良好模型。在文档分类等应用中，数据通常出现在高维特征空间中，线性SVOR成为一个很好的选择。在本工作中，我们开发了一种基于乘法器交替方向法(ADMM)的大规模线性SVOR训练的高效求解器。在基准数据集上的经验比较表明，该方法在训练速度和泛化性能上都优于基于SMO的方法，从而验证了算法的有效性和高效性。

引用次数: 0

Connecting Devices to Cookies via Filtering, Feature Engineering, and Boosting 通过过滤、特征工程和增强将设备连接到cookie

2015 IEEE International Conference on Data Mining Workshop (ICDMW)

Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.236

M. Kim, Jiwei Liu, Xiaozhou Wang, Wei Yang

We present a supervised machine learning system capable of matching internet devices to web cookies through filtering, feature engineering, binary classification, and post processing. The system builds a reasonably sized training and testing data set through filtering and feature engineering. We build 415 features in total. Some of these features were engineered to be O(n) time, stand alone classifiers for this problem. Other features use various natural language processing (NLP) techniques. Meta features are created by ridge regression and Adaboost. Then binary classification through two different gradient boosting (XGBoost with logarithmic loss) models is performed. A post processing pipeline connects devices and cookies in a way that maximizes F_0.5 score. Our machine learning system obtained a private F_0.5 score of 0.849562 for a final rank of 12th/340 on the ICDM 2015: Drawbridge Cross-Device Connections challenge.

我们提出了一个有监督的机器学习系统，能够通过过滤、特征工程、二进制分类和后处理将互联网设备与web cookie匹配。该系统通过过滤和特征工程构建了一个合理规模的训练和测试数据集。我们总共构建了415个特性。其中一些特征被设计成O(n)时间的独立分类器来解决这个问题。其他功能使用各种自然语言处理(NLP)技术。元特征由脊回归和Adaboost创建。然后通过两种不同的梯度增强模型(带对数损失的XGBoost)进行二值分类。后处理管道以一种最大化F_0.5分数的方式连接设备和cookie。我们的机器学习系统在ICDM 2015:吊桥跨设备连接挑战中获得了私人F_0.5分数0.849562，最终排名第12 /340。

引用次数: 13

LeSiNN: Detecting Anomalies by Identifying Least Similar Nearest Neighbours LeSiNN:通过识别最小相似近邻来检测异常

2015 IEEE International Conference on Data Mining Workshop (ICDMW)

Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.62

Guansong Pang, K. Ting, D. Albrecht

We introduce the concept of Least Similar Nearest Neighbours (LeSiNN) and use LeSiNN to detect anomalies directly. Although there is an existing method which is a special case of LeSiNN, this paper is the first to clearly articulate the underlying concept, as far as we know. LeSiNN is the first ensemble method which works well with models trained using samples of one instance. LeSiNN has linear time complexity with respect to data size and the number of dimensions, and it is one of the few anomaly detectors which can apply directly to both numeric and categorical data sets. Our extensive empirical evaluation shows that LeSiNN is either competitive to or better than six state-of-the-art anomaly detectors in terms of detection accuracy and runtime.

我们引入了最小相似近邻(LeSiNN)的概念，并利用LeSiNN直接检测异常。虽然已有一种方法是LeSiNN的特例，但据我们所知，本文是第一次清晰地阐述了其底层概念。LeSiNN是第一种集成方法，它可以很好地处理使用单个实例样本训练的模型。LeSiNN在数据大小和维数方面具有线性时间复杂度，是少数可以直接应用于数字和分类数据集的异常检测器之一。我们广泛的经验评估表明，LeSiNN在检测精度和运行时间方面与六个最先进的异常检测器相竞争或更好。

引用次数: 42

Trajectory-Based Task Allocation for Reliable Mobile Crowd Sensing Systems 基于轨迹的可靠移动人群传感系统任务分配

2015 IEEE International Conference on Data Mining Workshop (ICDMW)

Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.90

Petar Mrazovic, M. Matskin, Nima Dokoohaki

Mobile crowd sensing (MCS) is as a promising people-centric sensing paradigm which allows ordinary citizens to contribute sensing data using mobile communication devices. In this paper we study correlation between users' mobility and their role as contributors in MCS applications. We propose a new trajectory-based approach for task allocation in MCS environments and model participants' spatio-temporal competences by analyzing their mobile traces. By allocating MCS tasks only to participant who are familiar with the target location we significantly increase the reliability of contributed data and reduce total communication cost. We introduce novel metric to estimate participants' competence to conduct MCS tasks and propose fair ranking approach allowing newcomers to compete with experienced senior contributors. Additionally, we group similar expert contributors and thus open up new possibilities for physical collaboration between them. We evaluate our work using GeoLife trajectory dataset and the experimental results show the advantages of our approach.

移动人群传感(MCS)是一种有前途的以人为中心的传感范式，它允许普通公民使用移动通信设备贡献传感数据。本文研究了MCS应用中用户移动性与用户角色之间的关系。我们提出了一种新的基于轨迹的MCS任务分配方法，并通过分析参与者的移动轨迹来模拟参与者的时空能力。通过将MCS任务只分配给熟悉目标位置的参与者，我们大大提高了所提供数据的可靠性并降低了总通信成本。我们引入了新的指标来评估参与者执行MCS任务的能力，并提出了公平的排名方法，允许新人与经验丰富的资深贡献者竞争。此外，我们将相似的专家贡献者分组，从而为他们之间的物理协作开辟了新的可能性。我们使用GeoLife轨迹数据集对我们的工作进行了评估，实验结果表明了我们的方法的优势。

引用次数: 4

A Multiple Classifier System for Classifying Life Events on Social Media 社交媒体生活事件分类的多分类器系统

2015 IEEE International Conference on Data Mining Workshop (ICDMW)

Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.182

P. Cavalin, L. G. Moyano, Pedro P. Miranda

In this work we present a Conversation Classifierbased on Multiple Classifiers, to detect Life Events on SocialMedia. In one hand, conversations can provide more contextand help disambiguate life event detection, compared with single posts. On the other hand, the increase in number of messages and the way they interact with each other within the conversation cannot be trivially modeled by a classifier. To tackle this problem, we focus on creating a set of classifiers from different feature sets, and combining their classification outputs to improve accuracy. The experiments show that multiple classifiers are promising for this problem, being able to present an increase of about 45% in the F-Score.

在这项工作中，我们提出了一个基于多分类器的会话分类器，用于检测社交媒体上的生活事件。一方面，与单个帖子相比，对话可以提供更多的上下文，帮助消除生活事件检测的歧义。另一方面，消息数量的增加以及它们在会话中相互交互的方式不能由分类器简单地建模。为了解决这个问题，我们专注于从不同的特征集创建一组分类器，并结合它们的分类输出来提高准确率。实验表明，对于这个问题，多个分类器是有希望的，能够在F-Score中增加约45%。

引用次数: 11

Near Real-Time Service Monitoring Using High-Dimensional Time Series 基于高维时间序列的近实时服务监控

2015 IEEE International Conference on Data Mining Workshop (ICDMW)

Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.254

Shwetabh Khanduja, Vinod Nair, S. Sundararajan, Ameya Raul, Ajesh Babu Shaj, S. Keerthi

We demonstrate a near real-time service monitoring system for detecting and diagnosing issues from high-dimensional time series data. For detection, we have implemented a learning algorithm that constructs a hierarchy of detectors from data. It is scalable, does not require labelled examples of issues for learning, runs in near real-time, and identifles a subset of counter time series as being relevant for a detected issue. For diagnosis, we provide efflcient algorithms as post-detection diagnosis aids to flnd further relevant counter time series at issue times, a SQL-like query language for writing flexible queries that apply these algorithms on the time series data, and a graphical user interface for visualizing the detection and diagnosis results. Our solution has been deployed in production as an end-to-end system for monitoring Microsoft's internal distributed data storage and computing platform consisting of tens of thousands of machines and currently analyses about 12000 counter time series.

我们展示了一个近实时的服务监控系统，用于从高维时间序列数据中检测和诊断问题。对于检测，我们实现了一个学习算法，该算法从数据中构建检测器的层次结构。它是可扩展的，不需要标记的问题示例来学习，在接近实时的情况下运行，并识别与检测到的问题相关的计数器时间序列子集。对于诊断，我们提供了高效的算法作为检测后诊断辅助工具，在发布时间找到进一步相关的计数器时间序列，一种类似sql的查询语言，用于编写灵活的查询，将这些算法应用于时间序列数据，以及用于可视化检测和诊断结果的图形用户界面。我们的解决方案已经部署在生产中，作为一个端到端的系统，用于监控微软内部由数万台机器组成的分布式数据存储和计算平台，目前分析大约12000个计数器时间序列。

引用次数: 3

Extended Goal Graph: A Support Tool for Discovering Conflicts among Stakeholders and Promoting Requirements Elicitation with Goal Orientation 扩展目标图:发现涉众之间冲突的支持工具，并通过目标导向促进需求的获取

2015 IEEE International Conference on Data Mining Workshop (ICDMW)

Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.52

N. Kushiro, Takuro Shimizu

Requirements for a system are often discovered during negotiation process, at the time when stakeholders of the system are thinking over their premises or backgrounds behind other stakeholders' requirements, rather than at the time when stakeholders thinking about their own requirements. Disagreements and conflicts between stakeholders are utilized as a driver to discover requirements for the system. In this paper, we propose a support tool for discovering conflicts among stakeholders, called an extended goal graph. We implemented a prototype of the tool and applied the prototype to a requirements meeting to confirm feasibility for discovering conflicts.

系统的需求通常是在协商过程中发现的，此时系统的涉众正在考虑其他涉众需求背后的前提或背景，而不是在涉众考虑他们自己的需求时发现的。涉众之间的分歧和冲突被用作发现系统需求的驱动因素。在本文中，我们提出了一种发现利益相关者之间冲突的支持工具，称为扩展目标图。我们实现了工具的原型，并将原型应用于需求会议，以确认发现冲突的可行性。

引用次数: 3

Identifying Medical Terms Related to Specific Diseases 识别与特定疾病相关的医学术语

2015 IEEE International Conference on Data Mining Workshop (ICDMW)

Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.71

Mihir Shekhar, Veera Raghavendra Chikka, Lini T. Thomas, Sunil Mandhan, K. Karlapalem

We present an automated disease term classification model using machine learning techniques that classifies a medical term to a specific disease class. We work on five particular diseases: Cancer, AIDS, Arthritis, Diabetes and heart related ailments. We identify and classify medical terms like drug names, symptoms, abbreviations, disease names, tests, etc., into their specific diseases classes. The results illustrate that our model for disease term classification finds all disease term classes with an average F-score of 0.966.

我们提出了一个使用机器学习技术的自动疾病术语分类模型，该模型将医学术语分类到特定的疾病类别。我们研究五种特殊的疾病:癌症、艾滋病、关节炎、糖尿病和心脏病。我们识别和分类医学术语，如药物名称、症状、缩写、疾病名称、测试等，并将其归类到特定的疾病类别中。结果表明，我们的疾病术语分类模型发现所有疾病术语类别的平均f值为0.966。

引用次数: 3

Cross-Domain Recommendation via Tag Matrix Transfer 基于标签矩阵转移的跨领域推荐

2015 IEEE International Conference on Data Mining Workshop (ICDMW)

Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.133

Zhou Fang, Sheng Gao, B. Li, Juncen Li, J. Liao

Data sparseness is one of the most challenging problems in collaborative filtering(CF) based recommendation systems. Exploiting social tag information is becoming a popular way to alleviate the problem and improve the performance. To this end, in recent recommendation methods the relationships between users/items and tags are often taken into consideration, however, the correlations among tags from different itemdomains are always ignored. For that, in this paper we propose a novel way to exploit the rating patterns across multiple domains by transferring the tag co-occurrence matrix information, which could be used for revealing common user pattern. With extensive experiments we demonstrate the effectiveness of our approach for the cross-domain information recommendation.

数据稀疏性是基于协同过滤的推荐系统中最具挑战性的问题之一。利用社会标签信息正在成为缓解这一问题和提高性能的一种流行方法。为此，在最近的推荐方法中，经常考虑用户/项目与标签之间的关系，然而，不同项目域的标签之间的相关性总是被忽略。为此，本文提出了一种通过传递标签共现矩阵信息来挖掘多域评分模式的新方法，该方法可用于揭示共同的用户模式。通过大量的实验，我们证明了该方法在跨领域信息推荐中的有效性。

引用次数: 20

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2015 IEEE International Conference on Data Mining Workshop (ICDMW)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀