Source-Aware Partitioning for Robust Cross-Validation

2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA) Pub Date : 2015-12-01 DOI:10.1109/ICMLA.2015.216

Ozsel Kilinc, Ismail Uysal

{"title":"Source-Aware Partitioning for Robust Cross-Validation","authors":"Ozsel Kilinc, Ismail Uysal","doi":"10.1109/ICMLA.2015.216","DOIUrl":null,"url":null,"abstract":"One of the most critical components of engineering a machine learning algorithm for a live application is robust performance assessment prior to its implementation. Cross-validation is used to forecast a specific algorithm's classification or prediction accuracy on new input data given a finite dataset for training and testing the algorithm. Two most well known cross-validation techniques, random subsampling (RSS) and K-fold, are used to generalize the assessment results of machine learning algorithms in a non-exhaustive random manner. In this work we first show that for an inertia based activity recognition problem where data is collected from different users of a wrist-worn wireless accelerometer, random partitioning of the data, regardless of cross-validation technique, results in statistically similar average accuracies for a standard feed-forward neural network classifier. We propose a novel source-aware partitioning technique where samples from specific users are completely left out of the training/validation sets in rotation. The average error for the proposed cross-validation method is significantly higher with lower standard variation, which is a major indicator of cross-validation robustness. Approximately 30% increase in average error rate implies that source-aware cross validation could be a better indication of live algorithm performance where test data statistics would be significantly different than training data due to source (or user)-sensitive nature of process data.","PeriodicalId":288427,"journal":{"name":"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2015.216","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

One of the most critical components of engineering a machine learning algorithm for a live application is robust performance assessment prior to its implementation. Cross-validation is used to forecast a specific algorithm's classification or prediction accuracy on new input data given a finite dataset for training and testing the algorithm. Two most well known cross-validation techniques, random subsampling (RSS) and K-fold, are used to generalize the assessment results of machine learning algorithms in a non-exhaustive random manner. In this work we first show that for an inertia based activity recognition problem where data is collected from different users of a wrist-worn wireless accelerometer, random partitioning of the data, regardless of cross-validation technique, results in statistically similar average accuracies for a standard feed-forward neural network classifier. We propose a novel source-aware partitioning technique where samples from specific users are completely left out of the training/validation sets in rotation. The average error for the proposed cross-validation method is significantly higher with lower standard variation, which is a major indicator of cross-validation robustness. Approximately 30% increase in average error rate implies that source-aware cross validation could be a better indication of live algorithm performance where test data statistics would be significantly different than training data due to source (or user)-sensitive nature of process data.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

鲁棒交叉验证的源感知分区

为实时应用程序设计机器学习算法的最关键组成部分之一是在其实现之前进行稳健的性能评估。交叉验证用于预测特定算法在给定有限数据集的新输入数据上的分类或预测精度，用于训练和测试算法。两种最著名的交叉验证技术，随机子抽样(RSS)和K-fold，用于以非穷举随机方式概括机器学习算法的评估结果。在这项工作中，我们首先展示了基于惯性的活动识别问题，其中从腕带无线加速度计的不同用户收集数据，数据的随机划分，无论交叉验证技术如何，都会导致标准前馈神经网络分类器的统计平均精度相似。我们提出了一种新的源感知分区技术，其中来自特定用户的样本完全排除在训练/验证集之外。该交叉验证方法的平均误差显著较高，标准差较低，这是交叉验证稳健性的主要指标。大约30%的平均错误率增加意味着源感知交叉验证可以更好地指示实时算法性能，其中由于过程数据对源(或用户)敏感的性质，测试数据统计将与训练数据显著不同。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)

自引率

0.00%

发文量