首页 > 最新文献

2008 IEEE International Conference on Data Mining Workshops最新文献

英文 中文
Character String Analysis and Customer Path in Stream Data 流数据中的字符串分析和客户路径
Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.41
K. Yada
This purpose of this study is to propose a knowledge-discovery system that can abstract helpful information from character strings representing shopper visits to product sections associated with positive and negative purchasing events by applying character string parsing technologies to stream data describing customer purchasing behavior inside a store. Taking data that traced customers' movements we focus on the number of times customers stop by particular product sections, and by representing those visits in the form of character strings, we propose a way to efficiently handle large stream data. During our experiment, we abstract store-section visiting patterns that characterize customers who purchase a relatively larger volume of items, and are able to show the usefulness of these visiting patterns. In addition, we examine index functions, calculation time, and prediction accuracy, and clarify technological issues warranting further research. In the present study, we demonstrate the feasibility of employing stream data in the marketing field and the usefulness of the employing character parsing techniques.
本研究的目的是提出一个知识发现系统,该系统可以通过应用字符串解析技术来描述商店内客户购买行为的流数据,从表示购物者访问与积极和消极购买事件相关的产品部分的字符串中提取有用信息。通过追踪客户活动的数据,我们关注客户在特定产品区域停留的次数,并通过以字符串形式表示这些访问,我们提出了一种有效处理大型流数据的方法。在我们的实验中,我们抽象了商店-区域访问模式,这些模式描述了购买相对大量商品的客户,并且能够显示这些访问模式的有用性。此外,我们考察了指标函数、计算时间和预测精度,并澄清了需要进一步研究的技术问题。在本研究中,我们论证了在营销领域使用流数据的可行性以及使用字符解析技术的实用性。
{"title":"Character String Analysis and Customer Path in Stream Data","authors":"K. Yada","doi":"10.1109/ICDMW.2008.41","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.41","url":null,"abstract":"This purpose of this study is to propose a knowledge-discovery system that can abstract helpful information from character strings representing shopper visits to product sections associated with positive and negative purchasing events by applying character string parsing technologies to stream data describing customer purchasing behavior inside a store. Taking data that traced customers' movements we focus on the number of times customers stop by particular product sections, and by representing those visits in the form of character strings, we propose a way to efficiently handle large stream data. During our experiment, we abstract store-section visiting patterns that characterize customers who purchase a relatively larger volume of items, and are able to show the usefulness of these visiting patterns. In addition, we examine index functions, calculation time, and prediction accuracy, and clarify technological issues warranting further research. In the present study, we demonstrate the feasibility of employing stream data in the marketing field and the usefulness of the employing character parsing techniques.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134122758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scalable Sparse Bayesian Network Learning for Spatial Applications 空间应用的可扩展稀疏贝叶斯网络学习
Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.124
T. Liebig, Christine Kopp, M. May
Traffic routes through a street network contain patterns and are no random walks. Such patterns exist for instance along streets or between neighbouring street segments. The extraction of these patterns is a challenging task due to the enormous size of city street networks, the large number of required training data and the unknown distribution of the latter. We apply Bayesian Networks to model the correlations between the locations in space-time trajectories and address the following tasks. We introduce and examine a Bayesian Network Learning algorithm enabling us to handle the complexity and performance requirements of the spatial context. Furthermore, we apply our method to German cities, evaluate the accuracy and analyse the runtime behaviour for different parameter settings.
通过街道网络的交通路线包含模式,而不是随机行走。这种模式存在于街道沿线或相邻街道段之间。由于城市街道网络的规模巨大,需要大量的训练数据,并且训练数据的分布是未知的,因此这些模式的提取是一项具有挑战性的任务。我们应用贝叶斯网络来模拟时空轨迹中位置之间的相关性,并解决以下任务。我们介绍并研究了一种贝叶斯网络学习算法,使我们能够处理空间上下文的复杂性和性能要求。此外,我们将我们的方法应用于德国城市,评估准确性并分析不同参数设置的运行时行为。
{"title":"Scalable Sparse Bayesian Network Learning for Spatial Applications","authors":"T. Liebig, Christine Kopp, M. May","doi":"10.1109/ICDMW.2008.124","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.124","url":null,"abstract":"Traffic routes through a street network contain patterns and are no random walks. Such patterns exist for instance along streets or between neighbouring street segments. The extraction of these patterns is a challenging task due to the enormous size of city street networks, the large number of required training data and the unknown distribution of the latter. We apply Bayesian Networks to model the correlations between the locations in space-time trajectories and address the following tasks. We introduce and examine a Bayesian Network Learning algorithm enabling us to handle the complexity and performance requirements of the spatial context. Furthermore, we apply our method to German cities, evaluate the accuracy and analyse the runtime behaviour for different parameter settings.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133601559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Semi-supervised Collaborative Clustering with Partial Background Knowledge 基于部分背景知识的半监督协同聚类
Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.116
G. Forestier, Cédric Wemmert, P. Gançarski
In this paper we present a new algorithm for semisupervised clustering. We assume to have a small set of labeled samples and we use it in a clustering algorithm to discover relevant patterns. We study how our algorithm works against two other semisupervised algorithms when the data are multimodal. Then, we study the case where the user is able to produce few samples for some classes but not for each class of the dataset. Indeed, in complex problems, the user is not always able to produce samples for each class present in the dataset. The challenging task is consequently to use the set of labeled samples to discover other members of these classes, but also to keep a degree of freedom to discover unknown clusters, for which samples are not available. We address this problem through a series of experimentations on synthetic datasets, to show the relevance of the proposed method.
本文提出了一种新的半监督聚类算法。我们假设有一个小的标记样本集,我们在聚类算法中使用它来发现相关的模式。我们研究了当数据是多模态时,我们的算法如何与其他两种半监督算法相比较。然后,我们研究用户能够为某些类生成少量样本,但不能为数据集的每个类生成少量样本的情况。实际上,在复杂的问题中,用户并不总是能够为数据集中的每个类生成样本。因此,具有挑战性的任务是使用标记的样本集来发现这些类的其他成员,但也要保持一定程度的自由来发现未知的集群,因为样本是不可用的。我们通过在合成数据集上的一系列实验来解决这个问题,以显示所提出方法的相关性。
{"title":"Semi-supervised Collaborative Clustering with Partial Background Knowledge","authors":"G. Forestier, Cédric Wemmert, P. Gançarski","doi":"10.1109/ICDMW.2008.116","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.116","url":null,"abstract":"In this paper we present a new algorithm for semisupervised clustering. We assume to have a small set of labeled samples and we use it in a clustering algorithm to discover relevant patterns. We study how our algorithm works against two other semisupervised algorithms when the data are multimodal. Then, we study the case where the user is able to produce few samples for some classes but not for each class of the dataset. Indeed, in complex problems, the user is not always able to produce samples for each class present in the dataset. The challenging task is consequently to use the set of labeled samples to discover other members of these classes, but also to keep a degree of freedom to discover unknown clusters, for which samples are not available. We address this problem through a series of experimentations on synthetic datasets, to show the relevance of the proposed method.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"213 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134061803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Wavelet-Based Data Perturbation for Simultaneous Privacy-Preserving and Statistics-Preserving 基于小波的数据摄动同时隐私保护和统计保护
Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.77
Lian Liu, Jie Wang, Jun Zhang
With the rapid development of data mining technologies, preserving privacy in certain data becomes a challenge to data mining applications in many fields, especially in medical, financial and homeland security fields. We present a privacy-preserving strategy based on wavelet perturbation to keep the data privacy and data statistical properties and data mining utilities at the same time. Our mathematical analyses and experimental results show that this method can keep the distance before and after perturbation and it can preserve the basic statistical properties of the original data while maximizing the data utilities. Through experiments on real-life datasets, we conclude that this method is a promising privacy-preserving and statistics-preserving technique.
随着数据挖掘技术的快速发展,如何保护特定数据的隐私成为数据挖掘在许多领域应用的挑战,特别是在医疗、金融和国土安全领域。提出了一种基于小波摄动的隐私保护策略,既能保证数据的隐私性,又能保证数据的统计特性和数据挖掘的实用性。数学分析和实验结果表明,该方法既能保持扰动前后的距离,又能保持原始数据的基本统计特性,同时又能最大限度地提高数据的效用。通过对真实数据集的实验,我们得出结论,该方法是一种很有前途的隐私保护和统计保护技术。
{"title":"Wavelet-Based Data Perturbation for Simultaneous Privacy-Preserving and Statistics-Preserving","authors":"Lian Liu, Jie Wang, Jun Zhang","doi":"10.1109/ICDMW.2008.77","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.77","url":null,"abstract":"With the rapid development of data mining technologies, preserving privacy in certain data becomes a challenge to data mining applications in many fields, especially in medical, financial and homeland security fields. We present a privacy-preserving strategy based on wavelet perturbation to keep the data privacy and data statistical properties and data mining utilities at the same time. Our mathematical analyses and experimental results show that this method can keep the distance before and after perturbation and it can preserve the basic statistical properties of the original data while maximizing the data utilities. Through experiments on real-life datasets, we conclude that this method is a promising privacy-preserving and statistics-preserving technique.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133463207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 44
Bounding and Estimating Association Rule Support from Clusters on Binary Data 二值数据上聚类关联规则支持度的边界和估计
Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.47
C. Ordonez, Kai Zhao, Zhibo Chen
The theoretical relationship between association rules and machine learning techniques needs to be studied in more depth. This article studies the use of clustering as a model for association rule mining. The clustering model is exploited to bound and estimate association rule support and confidence. We first study the efficient computation of the clustering model with K-means; we show the sufficient statistics for clustering on binary data sets is the linear sum of points. We then prove item set support can be bounded and estimated from the model. Finally, we show support bounds fulfill the set downward closure property. Experiments study model accuracy and algorithm speed, paying particular attention to error behavior in support estimation. Given a sufficiently large number of clusters, the model becomes fairly accurate to approximate support. However, as the minimum support threshold decreases accuracy also decreases. The model is fairly accurate to discover a large fraction of frequent itemsets at different support levels. The model is compared against a traditional association rule algorithm to mine frequent itemsets, exhibiting better performance at low support levels. Time complexity to compute the binary cluster model is linear on data set size, whereas the dimensionality of transaction data sets has marginal impact on time.
关联规则与机器学习技术之间的理论关系需要更深入的研究。本文研究了使用聚类作为关联规则挖掘的模型。利用聚类模型对关联规则的支持度和置信度进行绑定和估计。首先研究了K-means聚类模型的高效计算;我们证明了二值数据集聚类的充分统计量是点的线性和。然后,我们证明了项目集支持度是有界的,并且可以从模型中估计出来。最后,我们证明了支持边界满足set向下闭包属性。实验研究模型精度和算法速度,特别关注支持估计中的误差行为。给定足够大的集群数量,该模型变得相当精确,可以接近支持度。然而,随着最小支持阈值的降低,精度也会降低。该模型相当准确地发现了在不同支持水平上的大部分频繁项集。将该模型与传统的关联规则算法进行比较,发现该模型在低支持度下具有更好的性能。计算二元聚类模型的时间复杂度与数据集大小呈线性关系,而事务数据集的维数对时间的影响很小。
{"title":"Bounding and Estimating Association Rule Support from Clusters on Binary Data","authors":"C. Ordonez, Kai Zhao, Zhibo Chen","doi":"10.1109/ICDMW.2008.47","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.47","url":null,"abstract":"The theoretical relationship between association rules and machine learning techniques needs to be studied in more depth. This article studies the use of clustering as a model for association rule mining. The clustering model is exploited to bound and estimate association rule support and confidence. We first study the efficient computation of the clustering model with K-means; we show the sufficient statistics for clustering on binary data sets is the linear sum of points. We then prove item set support can be bounded and estimated from the model. Finally, we show support bounds fulfill the set downward closure property. Experiments study model accuracy and algorithm speed, paying particular attention to error behavior in support estimation. Given a sufficiently large number of clusters, the model becomes fairly accurate to approximate support. However, as the minimum support threshold decreases accuracy also decreases. The model is fairly accurate to discover a large fraction of frequent itemsets at different support levels. The model is compared against a traditional association rule algorithm to mine frequent itemsets, exhibiting better performance at low support levels. Time complexity to compute the binary cluster model is linear on data set size, whereas the dimensionality of transaction data sets has marginal impact on time.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123256331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Risk Assessment of Atmospheric Hazard Releases Using K-Means Clustering 基于k均值聚类的大气危害释放风险评估
Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.89
G. Cervone, P. Franzese, Y. Ezber, Z. Boybeyi
Unsupervised machine learning algorithms are used to perform statistical analysis of several transport and dispersion model runs which simulate emissions from a fixed source under different atmospheric conditions. A clustering algorithm is used to automatically group the results of the transport and dispersion simulations according to their respective cloud characteristics. Each cluster of clouds describes a distinct area at risk from potentially hazardous atmospheric contamination. Overimposing the resulting risk areas with ground maps, it is possible to assess the impact of the population exposure to the contaminants. The releases were simulated in the Bosphorus channel. Simulations were performed for one year at weekly interval, both day and night, to sample all different potential atmospheric conditions.
无监督机器学习算法用于对几种传输和扩散模型运行进行统计分析,这些模型运行模拟了不同大气条件下固定源的排放。采用聚类算法将输运和弥散模拟结果根据各自的云特征自动分组。每一团云都描述了一个有潜在危险的大气污染风险的不同区域。在地面地图上覆盖由此产生的风险区域,可以评估人口暴露于污染物的影响。这些释放是在博斯普鲁斯海峡模拟的。模拟每隔一周进行一次,包括白天和晚上,对所有不同的潜在大气条件进行采样。
{"title":"Risk Assessment of Atmospheric Hazard Releases Using K-Means Clustering","authors":"G. Cervone, P. Franzese, Y. Ezber, Z. Boybeyi","doi":"10.1109/ICDMW.2008.89","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.89","url":null,"abstract":"Unsupervised machine learning algorithms are used to perform statistical analysis of several transport and dispersion model runs which simulate emissions from a fixed source under different atmospheric conditions. A clustering algorithm is used to automatically group the results of the transport and dispersion simulations according to their respective cloud characteristics. Each cluster of clouds describes a distinct area at risk from potentially hazardous atmospheric contamination. Overimposing the resulting risk areas with ground maps, it is possible to assess the impact of the population exposure to the contaminants. The releases were simulated in the Bosphorus channel. Simulations were performed for one year at weekly interval, both day and night, to sample all different potential atmospheric conditions.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123045150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Semantic Features for Multi-view Semi-supervised and Active Learning of Text Classification 文本分类多视图半监督主动学习的语义特征
Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.13
Shiliang Sun
For multi-view learning, existing methods usually exploit originally provided features for classifier training, which ignore the latent correlation between different views. In this paper, semantic features integrating information from multiple views are extracted for pattern representation. Canonical correlation analysis is used to learn the representation of semantic spaces where semantic features are projections of original features on the basis vectors of the spaces. We investigate the feasibility of semantic features on two learning paradigms: semi-supervised learning and active learning. Experiments on text classification with two state-of-the-art multi-view learning algorithms co-training and co-testing indicate that this use of semantic features can lead to a significant improvement of performance.
对于多视图学习,现有方法通常利用原有的特征进行分类器训练,忽略了不同视图之间的潜在相关性。本文从多个视图中提取集成信息的语义特征进行模式表示。典型相关分析用于学习语义空间的表示,其中语义特征是原始特征在空间基向量上的投影。研究了语义特征在半监督学习和主动学习两种学习范式上的可行性。在两种最先进的多视图学习算法的文本分类实验中,共同训练和共同测试表明,这种语义特征的使用可以显著提高性能。
{"title":"Semantic Features for Multi-view Semi-supervised and Active Learning of Text Classification","authors":"Shiliang Sun","doi":"10.1109/ICDMW.2008.13","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.13","url":null,"abstract":"For multi-view learning, existing methods usually exploit originally provided features for classifier training, which ignore the latent correlation between different views. In this paper, semantic features integrating information from multiple views are extracted for pattern representation. Canonical correlation analysis is used to learn the representation of semantic spaces where semantic features are projections of original features on the basis vectors of the spaces. We investigate the feasibility of semantic features on two learning paradigms: semi-supervised learning and active learning. Experiments on text classification with two state-of-the-art multi-view learning algorithms co-training and co-testing indicate that this use of semantic features can lead to a significant improvement of performance.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"88 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120890599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Mining Unstructured Text at Gigabyte per Second Speeds 以每秒千兆字节的速度挖掘非结构化文本
Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.9
A. Ratner
Humans communicate with text in thousands of languages, in dozens of scripts, in a variety of binary codes, on millions of topics. There is a need, for both government and commercial applications, to identify these text characteristics to enable follow-on processing such as transcoding, translation, transliteration, routing and prioritization. This paper deals with the implementation of real-time mining of unstructured text on high-speed hardware capable of processing network data streams at gigabyte per second speeds.
人类用数千种语言、数十种脚本、各种二进制代码、数百万个主题进行文本交流。政府和商业应用程序都需要识别这些文本特征,以便进行后续处理,如转码、翻译、音译、路由和优先级排序。本文讨论了在能够以每秒千兆字节的速度处理网络数据流的高速硬件上实现非结构化文本的实时挖掘。
{"title":"Mining Unstructured Text at Gigabyte per Second Speeds","authors":"A. Ratner","doi":"10.1109/ICDMW.2008.9","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.9","url":null,"abstract":"Humans communicate with text in thousands of languages, in dozens of scripts, in a variety of binary codes, on millions of topics. There is a need, for both government and commercial applications, to identify these text characteristics to enable follow-on processing such as transcoding, translation, transliteration, routing and prioritization. This paper deals with the implementation of real-time mining of unstructured text on high-speed hardware capable of processing network data streams at gigabyte per second speeds.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"84 12","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132477657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Adaptive Pre-filtering Technique for Error-Reduction Sampling in Active Learning 主动学习中误差减小采样的自适应预滤波技术
Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.52
Michael Davy, S. Luz
Error-reduction sampling (ERS) is a high performing (but computationally expensive) query selection strategy for active learning. Subset optimisation has been proposed to reduce computational expense by applying ERS to only a subset of examples from the pool. This paper compares techniques used to construct the subset, namely random sub-sampling and pre-filtering. We focus on pre-filtering which populates the subset with more informative examples by filtering from the unlabelled pool using a query selection strategy. In this paper we establish whether pre-filtering outperforms sub-sampling optimisation, examine the effect of subset size, and propose a novel adaptive pre-filtering technique which dynamically switches between several alternative pre-filtering techniques using a multi-armed bandit algorithm. Empirical evaluations conducted on two benchmark text categorisation datasets demonstrate that pre-filtered ERS achieve higher levels of accuracy when compared to sub-sampled ERS. The proposed adaptive pre-filtering technique is also shown to be competitive with the optimal pre-filtering technique on the majority of problems and is never the worst technique.
差错减少抽样(ERS)是一种高性能(但计算代价昂贵)的主动学习查询选择策略。子集优化已被提出,通过将ERS仅应用于池中的一个子集来减少计算费用。本文比较了用于构建子集的技术,即随机子采样和预滤波。我们将重点放在预过滤上,通过使用查询选择策略从未标记的池中过滤,用更多信息丰富的示例填充子集。在本文中,我们确定了预滤波是否优于子采样优化,检查了子集大小的影响,并提出了一种新的自适应预滤波技术,该技术使用多臂强盗算法在几种备选预滤波技术之间动态切换。在两个基准文本分类数据集上进行的经验评估表明,与子采样的ERS相比,预过滤的ERS具有更高的准确性。所提出的自适应预滤波技术在大多数问题上都能与最优预滤波技术相竞争,而不是最差的预滤波技术。
{"title":"An Adaptive Pre-filtering Technique for Error-Reduction Sampling in Active Learning","authors":"Michael Davy, S. Luz","doi":"10.1109/ICDMW.2008.52","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.52","url":null,"abstract":"Error-reduction sampling (ERS) is a high performing (but computationally expensive) query selection strategy for active learning. Subset optimisation has been proposed to reduce computational expense by applying ERS to only a subset of examples from the pool. This paper compares techniques used to construct the subset, namely random sub-sampling and pre-filtering. We focus on pre-filtering which populates the subset with more informative examples by filtering from the unlabelled pool using a query selection strategy. In this paper we establish whether pre-filtering outperforms sub-sampling optimisation, examine the effect of subset size, and propose a novel adaptive pre-filtering technique which dynamically switches between several alternative pre-filtering techniques using a multi-armed bandit algorithm. Empirical evaluations conducted on two benchmark text categorisation datasets demonstrate that pre-filtered ERS achieve higher levels of accuracy when compared to sub-sampled ERS. The proposed adaptive pre-filtering technique is also shown to be competitive with the optimal pre-filtering technique on the majority of problems and is never the worst technique.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131466529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
High Granularity Remote Sensing and Crop Production over Space and Time: NDVI over the Growing Season and Prediction of Cotton Yields at the Farm Field Level in Texas 时空上的高粒度遥感与作物生产:德克萨斯州种植季NDVI与棉花产量预测
Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.91
B. Little, M. Schucking, B. Gartrell, Bing Chen, K. Ross, R. McKellip
Remote sensing has been applied to agriculture at very coarse levels of granularity (i.e., national levels) but few investigations have focused on yield prediction at the farm unit level. Specific aims of the present investigation are to analyze the ability of Moderate Resolution Imaging Spectroradiometer (MODIS) data to predict cotton yields in two highly homogeneous counties in west Texas. In one study county > 90% of cotton grown is irrigated, while the other study county 40 miles south has >85% non-irrigated cotton. Regression analysis by day from April to November at the county and farm levels reveals a highly significant ability for MODIS to predict cotton yields. R values ranged from 0.90 to 0.98 for irrigated cotton and 0.80 to . 90 for non-irrigated cotton practices. The objective in future studies is to algorithmically extend these analyses to the ~300 million acres of arable land under cultivation in the United States.
遥感已应用于粒度非常粗的农业(即国家一级),但很少有调查侧重于农场单位一级的产量预测。本研究的具体目的是分析中分辨率成像光谱仪(MODIS)数据预测德克萨斯州西部两个高度均匀县棉花产量的能力。在一个研究县,90%以上的棉花种植是灌溉的,而另一个40英里以南的研究县,85%以上的棉花种植是非灌溉的。4 - 11月县域和农田逐日回归分析显示,MODIS对棉花产量的预测能力非常显著。灌溉棉的R值为0.90 ~ 0.98,灌溉棉的R值为0.80 ~ 0.98。90美元用于非灌溉棉花生产。未来研究的目标是通过算法将这些分析扩展到美国约3亿英亩的耕地。
{"title":"High Granularity Remote Sensing and Crop Production over Space and Time: NDVI over the Growing Season and Prediction of Cotton Yields at the Farm Field Level in Texas","authors":"B. Little, M. Schucking, B. Gartrell, Bing Chen, K. Ross, R. McKellip","doi":"10.1109/ICDMW.2008.91","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.91","url":null,"abstract":"Remote sensing has been applied to agriculture at very coarse levels of granularity (i.e., national levels) but few investigations have focused on yield prediction at the farm unit level. Specific aims of the present investigation are to analyze the ability of Moderate Resolution Imaging Spectroradiometer (MODIS) data to predict cotton yields in two highly homogeneous counties in west Texas. In one study county > 90% of cotton grown is irrigated, while the other study county 40 miles south has >85% non-irrigated cotton. Regression analysis by day from April to November at the county and farm levels reveals a highly significant ability for MODIS to predict cotton yields. R values ranged from 0.90 to 0.98 for irrigated cotton and 0.80 to . 90 for non-irrigated cotton practices. The objective in future studies is to algorithmically extend these analyses to the ~300 million acres of arable land under cultivation in the United States.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121643817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
2008 IEEE International Conference on Data Mining Workshops
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1