2006 5th International Conference on Machine Learning and Applications (ICMLA'06)最新文献

英文中文

Shape Recognition and Retrieval Using String of Symbols 基于符号串的形状识别与检索

2006 5th International Conference on Machine Learning and Applications (ICMLA'06)

Pub Date : 2006-12-14 DOI: 10.1109/ICMLA.2006.48

M. Daliri, V. Torre

In this paper we present two algorithms for shape recognition. Both algorithms map the contour of the shape to be recognized into a string of symbols. The first algorithm is based on supervised learning using string kernels as often used for text categorization and classification. The second algorithm is very weakly supervised and is based on the procrustes analysis and on the edit distance used for computing the similarity between strings of symbols. The second algorithm correctly recognizes 98.29% of shapes from the MPEG-7 database, i.e. better than any previous algorithms. The second algorithm is able also to retrieve similar shapes from a database

本文提出了两种形状识别算法。这两种算法都将待识别形状的轮廓映射成一串符号。第一种算法基于监督学习，使用字符串核，通常用于文本分类和分类。第二种算法是非常弱监督的，它基于procrustes分析和用于计算符号字符串之间相似性的编辑距离。第二种算法正确识别了MPEG-7数据库中98.29%的形状，优于之前的任何算法。第二种算法也能够从数据库中检索相似的形状

引用次数: 21

Prediction of Antisense Oligonucleotide Efficacy Using Local and Global Structure Information with Support Vector Machines 基于局部和全局结构信息的支持向量机反义寡核苷酸有效性预测

2006 5th International Conference on Machine Learning and Applications (ICMLA'06)

Pub Date : 2006-12-14 DOI: 10.1109/ICMLA.2006.39

R. Craig, Li Liao

Designing antisense oligonucleotides with high efficacy is of great interest both for its usefulness to the study of gene regulation and for its potential therapeutic effects. The high cost associated with experimental approaches has motivated the development of computational methods to assist in their design. Essentially, these computational methods rely on various sequential and structural features to differentiate the high efficacy antisense oligonucleotides from the low efficacy. By far, however, most of the features used are either local motifs present in primary sequences or in secondary structures. We proposed a novel approach to profiling antisense oligonucleotides and the target RNA to reflect some of the global structural features such as hairpin structures. Such profiles are then utilized for classification and prediction of high efficacy oligonucleotides using support vector machines. The method was tested on a set of 348 antisense oligonucleotides of 19 RNA targets with known activity. The performance was evaluated by cross validation and ROC scores. It was shown that the prediction accuracy was significantly enhanced

设计高效的反义寡核苷酸对基因调控的研究和潜在的治疗作用具有重要意义。与实验方法相关的高成本促使了计算方法的发展，以协助其设计。从本质上讲，这些计算方法依赖于各种序列和结构特征来区分高效反义寡核苷酸和低效反义寡核苷酸。然而，到目前为止，大多数使用的特征是存在于一级序列或二级结构中的局部基序。我们提出了一种新的方法来分析反义寡核苷酸和靶RNA，以反映一些全局结构特征，如发夹结构。然后使用支持向量机将这些剖面用于高效寡核苷酸的分类和预测。该方法在19个已知活性RNA靶标的348个反义寡核苷酸上进行了测试。采用交叉验证和ROC评分对其进行评价。结果表明，该方法显著提高了预测精度

{"title":"Prediction of Antisense Oligonucleotide Efficacy Using Local and Global Structure Information with Support Vector Machines","authors":"R. Craig, Li Liao","doi":"10.1109/ICMLA.2006.39","DOIUrl":"https://doi.org/10.1109/ICMLA.2006.39","url":null,"abstract":"Designing antisense oligonucleotides with high efficacy is of great interest both for its usefulness to the study of gene regulation and for its potential therapeutic effects. The high cost associated with experimental approaches has motivated the development of computational methods to assist in their design. Essentially, these computational methods rely on various sequential and structural features to differentiate the high efficacy antisense oligonucleotides from the low efficacy. By far, however, most of the features used are either local motifs present in primary sequences or in secondary structures. We proposed a novel approach to profiling antisense oligonucleotides and the target RNA to reflect some of the global structural features such as hairpin structures. Such profiles are then utilized for classification and prediction of high efficacy oligonucleotides using support vector machines. The method was tested on a set of 348 antisense oligonucleotides of 19 RNA targets with known activity. The performance was evaluated by cross validation and ROC scores. It was shown that the prediction accuracy was significantly enhanced","PeriodicalId":297071,"journal":{"name":"2006 5th International Conference on Machine Learning and Applications (ICMLA'06)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116485268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Regression Databases: Probabilistic Querying Using Sparse Learning Sets 回归数据库:使用稀疏学习集的概率查询

2006 5th International Conference on Machine Learning and Applications (ICMLA'06)

Pub Date : 2006-12-14 DOI: 10.1109/ICMLA.2006.44

A. Brodsky, C. Domeniconi, David Etter

We introduce regression databases (REDB) to formalize and automate probabilistic querying using sparse learning sets. The REDB data model involves observation data, learning set data, views definitions, and a regression model instance. The observation data is a collection of relational tuples over a set of attributes; the learning data set involves a subset of observation tuples, augmented with learned attributes, which are modeled as random variables; the views are expressed as linear combinations of observation and learned attributes; and the regression model involves functions that map observation tuples to probability distributions of the random variables, which are learned dynamically from the learning data set. The REDB query language extends relational algebra project-select queries with conditions on probabilities of first-order logical expressions, which in turn involve linear combinations of learned attributes and views, and arithmetic comparison operators. Such capability relies on the underlying regression model for the learned attributes. We show that REDB queries are computable by developing conceptual evaluation algorithms and by proving their correctness and termination

我们引入回归数据库(REDB)来使用稀疏学习集形式化和自动化概率查询。REDB数据模型包括观察数据、学习集数据、视图定义和回归模型实例。观测数据是一组属性上的关系元组的集合;学习数据集包括观察元组的子集，其中增加了学习属性，这些属性被建模为随机变量;视图被表示为观察属性和学习属性的线性组合;回归模型包括将观察元组映射到随机变量的概率分布的函数，这些随机变量是从学习数据集中动态学习的。REDB查询语言扩展了具有一阶逻辑表达式概率条件的关系代数项目选择查询，而一阶逻辑表达式又涉及学到的属性和视图的线性组合，以及算术比较运算符。这种能力依赖于学习属性的底层回归模型。我们通过开发概念性求值算法并证明其正确性和终止性来证明REDB查询是可计算的

{"title":"Regression Databases: Probabilistic Querying Using Sparse Learning Sets","authors":"A. Brodsky, C. Domeniconi, David Etter","doi":"10.1109/ICMLA.2006.44","DOIUrl":"https://doi.org/10.1109/ICMLA.2006.44","url":null,"abstract":"We introduce regression databases (REDB) to formalize and automate probabilistic querying using sparse learning sets. The REDB data model involves observation data, learning set data, views definitions, and a regression model instance. The observation data is a collection of relational tuples over a set of attributes; the learning data set involves a subset of observation tuples, augmented with learned attributes, which are modeled as random variables; the views are expressed as linear combinations of observation and learned attributes; and the regression model involves functions that map observation tuples to probability distributions of the random variables, which are learned dynamically from the learning data set. The REDB query language extends relational algebra project-select queries with conditions on probabilities of first-order logical expressions, which in turn involve linear combinations of learned attributes and views, and arithmetic comparison operators. Such capability relies on the underlying regression model for the learned attributes. We show that REDB queries are computable by developing conceptual evaluation algorithms and by proving their correctness and termination","PeriodicalId":297071,"journal":{"name":"2006 5th International Conference on Machine Learning and Applications (ICMLA'06)","volume":"187 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116144918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Supernova Recognition Using Support Vector Machines 利用支持向量机识别超新星

2006 5th International Conference on Machine Learning and Applications (ICMLA'06)

Pub Date : 2006-12-14 DOI: 10.1109/ICMLA.2006.49

R. Romano, C. Aragon, C. Ding

We introduce a novel application of support vector machines (SVMs) to the problem of identifying potential supernovae using photometric and geometric features computed from astronomical imagery. The challenges of this supervised learning application are significant: 1) noisy and corrupt imagery resulting in high levels of feature uncertainty, 2) features with heavy-tailed, peaked distributions, 3) extremely imbalanced and overlapping positive and negative data sets, and 4) the need to reach high positive classification rates, i.e. to find all potential supernovae, while reducing the burdensome workload of manually examining false positives. High accuracy is achieved via a sign-preserving, shifted log transform applied to features with peaked, heavy-tailed distributions. The imbalanced data problem is handled by oversampling positive examples, selectively sampling misclassified negative examples, and iteratively training multiple SVMs for improved supernova recognition on unseen test data. We present cross-validation results and demonstrate the impact on a large-scale supernova survey that currently uses the SVM decision value to rank-order 600,000 potential supernovae each night

我们介绍了一种新的应用支持向量机(svm)来识别潜在的超新星问题，利用天文图像计算的光度和几何特征。这种监督学习应用的挑战是显著的:1)噪声和腐败的图像导致高水平的特征不确定性，2)特征具有重尾，峰值分布，3)极端不平衡和重叠的正负数据集，以及4)需要达到高的正分类率，即找到所有潜在的超新星，同时减少手动检查假阳性的繁重工作量。高精度实现通过一个符号保持，移位的对数变换应用于特征的峰值，重尾分布。通过对正例进行过采样，对错分类的负例进行选择性采样，并迭代训练多个支持向量机来改进对未知测试数据的超新星识别，从而解决数据不平衡问题。我们展示了交叉验证结果，并展示了对大规模超新星调查的影响，该调查目前使用支持向量机决策值每晚对600,000个潜在超新星进行排序

{"title":"Supernova Recognition Using Support Vector Machines","authors":"R. Romano, C. Aragon, C. Ding","doi":"10.1109/ICMLA.2006.49","DOIUrl":"https://doi.org/10.1109/ICMLA.2006.49","url":null,"abstract":"We introduce a novel application of support vector machines (SVMs) to the problem of identifying potential supernovae using photometric and geometric features computed from astronomical imagery. The challenges of this supervised learning application are significant: 1) noisy and corrupt imagery resulting in high levels of feature uncertainty, 2) features with heavy-tailed, peaked distributions, 3) extremely imbalanced and overlapping positive and negative data sets, and 4) the need to reach high positive classification rates, i.e. to find all potential supernovae, while reducing the burdensome workload of manually examining false positives. High accuracy is achieved via a sign-preserving, shifted log transform applied to features with peaked, heavy-tailed distributions. The imbalanced data problem is handled by oversampling positive examples, selectively sampling misclassified negative examples, and iteratively training multiple SVMs for improved supernova recognition on unseen test data. We present cross-validation results and demonstrate the impact on a large-scale supernova survey that currently uses the SVM decision value to rank-order 600,000 potential supernovae each night","PeriodicalId":297071,"journal":{"name":"2006 5th International Conference on Machine Learning and Applications (ICMLA'06)","volume":"169 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128615481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

An Efficient Heuristic for Discovering Multiple Ill-Defined Attributes in Datasets 一种发现数据集中多个定义不清属性的有效启发式算法

2006 5th International Conference on Machine Learning and Applications (ICMLA'06)

Pub Date : 2006-12-14 DOI: 10.1109/ICMLA.2006.14

Sylvain Hallé

The accuracy of the rules produced by a concept learning system can be hindered by the presence of errors in the data, such as "ill-defined" attributes that are too general or too specific for the concept to learn. In this paper, we devise a method that uses the Boolean differences computed by a program called Newton to identify multiple ill-defined attributes in a dataset in a single pass. The method is based on a compound heuristic that assigns a real-valued rank to each possible hypothesis based on its key characteristics. We show by extensive empirical testing on randomly generated classifiers that the hypothesis with the highest rank is the correct one with an observed probability quickly converging to 100%. Moreover, the monotonicity of the function enables us to use it as a rough estimator of its own likelihood

概念学习系统产生的规则的准确性可能会受到数据中存在的错误的阻碍，例如“定义不清”的属性，这些属性对于概念来说太一般或太具体而无法学习。在本文中，我们设计了一种方法，该方法使用由Newton程序计算的布尔差值，在一次传递中识别数据集中的多个定义不清的属性。该方法基于复合启发式，根据每个可能的假设的关键特征为其分配实值秩。我们通过对随机生成的分类器进行广泛的经验检验表明，具有最高等级的假设是正确的，并且观察到的概率迅速收敛到100%。此外，函数的单调性使我们能够使用它作为其自身似然的粗略估计

引用次数: 0

Off-Line Signature Recognition and Verification by Kernel Principal Component Self-Regression 基于核主成分自回归的离线签名识别与验证

2006 5th International Conference on Machine Learning and Applications (ICMLA'06)

Pub Date : 2006-12-14 DOI: 10.1109/ICMLA.2006.37

Bai-ling Zhang

Automatic signature verification is an active area of research with numerous applications such as bank check verification, ATM access, etc. In this research, a kernel principal component self-regression (KPCSR) model is proposed for offline signature verification and recognition problems. Developed from the kernel principal component regression (KPCR), the self-regression model selects a subset of the principal components from the kernel space for the input variables to accurately characterize each user's signature, thus offering good verification and recognition performance. The model directly works on bitmap images in the preliminary experiments, showing satisfactory performance. A modular scheme with subject-specific KPCSR structure proves very efficient, from which each user is assigned an independent KPCSR model for coding the corresponding visual information. Experimental results obtained on public benchmarking signature databases demonstrate the superiority of the proposed method

自动签名验证是一个活跃的研究领域，有许多应用，如银行支票验证，ATM访问等。针对离线签名验证与识别问题，提出了一种核主成分自回归(KPCSR)模型。自回归模型是在核主成分回归(KPCR)的基础上发展起来的，从核空间中选择主成分子集作为输入变量，以准确表征每个用户的签名，从而提供了良好的验证和识别性能。在初步实验中，该模型直接作用于位图图像，取得了满意的效果。采用特定主题的KPCSR结构的模块化方案，为每个用户分配一个独立的KPCSR模型，用于编码相应的视觉信息。在公共基准特征库上的实验结果表明了该方法的优越性

引用次数: 26

Intelligent Electronic Navigational Aids: A New Approach 智能电子助航设备:一种新方法

2006 5th International Conference on Machine Learning and Applications (ICMLA'06)

Pub Date : 2006-12-14 DOI: 10.1109/ICMLA.2006.30

C. Barbu, M. Lohrenz, G. Layne

Intelligent devices, with smart clutter management capabilities, can enhance a user's situational awareness under adverse conditions. Two approaches to assist a user with target detection and clutter analysis are presented, and suggestions on how these tools could be integrated with an electronic chart system are further detailed. The first tool, which can assist a user in finding a target partially obscured by display clutter, is a multiple-view generalization of AdaBoost. The second technique determines a meaningful measure of clutter in electronic displays by clustering features in both geospatial and color space. The clutter metric correlates with preliminary, subjective, clutter ratings. The user can be warned if display clutter is a potential hazard to performance. Synthetic and real data sets are used for performance evaluation of the proposed technique compared with recent classifier fusion strategies

具有智能杂波管理能力的智能设备可以增强用户在不利条件下的态势感知。提出了两种帮助用户进行目标检测和杂波分析的方法，并进一步详细介绍了如何将这些工具与电子海图系统集成的建议。第一个工具是AdaBoost的多视图概括，它可以帮助用户找到部分被显示杂乱遮挡的目标。第二种技术通过地理空间和色彩空间的聚类特征来确定电子显示器中杂波的有意义的度量。杂波度量与初步的、主观的杂波等级相关。如果显示杂乱会对性能造成潜在危害，则可以向用户发出警告。利用合成数据集和真实数据集对该方法进行性能评估，并与现有分类器融合策略进行比较

引用次数: 2

Reducing High-Dimensional Data by Principal Component Analysis vs. Random Projection for Nearest Neighbor Classification 基于主成分分析的高维数据降维与基于随机投影的最近邻分类

2006 5th International Conference on Machine Learning and Applications (ICMLA'06)

Pub Date : 2006-12-14 DOI: 10.1109/ICMLA.2006.43

Sampath Deegalla, Henrik Boström

The computational cost of using nearest neighbor classification often prevents the method from being applied in practice when dealing with high-dimensional data, such as images and micro arrays. One possible solution to this problem is to reduce the dimensionality of the data, ideally without loosing predictive performance. Two different dimensionality reduction methods, principle component analysis (PCA) and random projection (RP), are investigated for this purpose and compared w.r.t. the performance of the resulting nearest neighbor classifier on five image data sets and five micro array data sets. The experiment results demonstrate that PCA outperforms RP for all data sets used in this study. However, the experiments also show that PCA is more sensitive to the choice of the number of reduced dimensions. After reaching a peak, the accuracy degrades with the number of dimensions for PCA, while the accuracy for RP increases with the number of dimensions. The experiments also show that the use of PCA and RP may even outperform using the non-reduced feature set (in 9 respectively 6 cases out of 10), hence not only resulting in more efficient, but also more effective, nearest neighbor classification

在处理高维数据(如图像和微阵列)时，使用最近邻分类的计算成本往往阻碍该方法在实践中的应用。这个问题的一个可能的解决方案是降低数据的维数，理想情况下不会失去预测性能。为此研究了主成分分析(PCA)和随机投影(RP)两种不同的降维方法，并比较了在5个图像数据集和5个微阵列数据集上得到的最近邻分类器的性能。实验结果表明，对于本研究中使用的所有数据集，PCA都优于RP。然而，实验也表明，主成分分析对降维数的选择更为敏感。在达到峰值后，PCA的准确率随着维数的增加而下降，而RP的准确率随着维数的增加而增加。实验还表明，使用PCA和RP甚至可能优于使用非约简特征集(在9个案例中分别为6 / 10)，因此不仅产生更高效，而且更有效的最近邻分类

{"title":"Reducing High-Dimensional Data by Principal Component Analysis vs. Random Projection for Nearest Neighbor Classification","authors":"Sampath Deegalla, Henrik Boström","doi":"10.1109/ICMLA.2006.43","DOIUrl":"https://doi.org/10.1109/ICMLA.2006.43","url":null,"abstract":"The computational cost of using nearest neighbor classification often prevents the method from being applied in practice when dealing with high-dimensional data, such as images and micro arrays. One possible solution to this problem is to reduce the dimensionality of the data, ideally without loosing predictive performance. Two different dimensionality reduction methods, principle component analysis (PCA) and random projection (RP), are investigated for this purpose and compared w.r.t. the performance of the resulting nearest neighbor classifier on five image data sets and five micro array data sets. The experiment results demonstrate that PCA outperforms RP for all data sets used in this study. However, the experiments also show that PCA is more sensitive to the choice of the number of reduced dimensions. After reaching a peak, the accuracy degrades with the number of dimensions for PCA, while the accuracy for RP increases with the number of dimensions. The experiments also show that the use of PCA and RP may even outperform using the non-reduced feature set (in 9 respectively 6 cases out of 10), hence not only resulting in more efficient, but also more effective, nearest neighbor classification","PeriodicalId":297071,"journal":{"name":"2006 5th International Conference on Machine Learning and Applications (ICMLA'06)","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132115579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 107

Rule Extraction from Opaque Models-- A Slightly Different Perspective 从不透明模型中提取规则——一个稍微不同的视角

2006 5th International Conference on Machine Learning and Applications (ICMLA'06)

Pub Date : 2006-12-14 DOI: 10.1109/ICMLA.2006.46

U. Johansson, Tuwe Löfström, Rikard König, Cecilia Sönströd, L. Niklasson

When performing predictive modeling, the key criterion is always accuracy. With this in mind, complex techniques like neural networks or ensembles are normally used, resulting in opaque models impossible to interpret. When models need to be comprehensible, accuracy is often sacrificed by using simpler techniques directly producing transparent models; a tradeoff termed the accuracy vs. comprehensibility tradeoff. In order to reduce this tradeoff, the opaque model can be transformed into another, interpretable, model; an activity termed rule extraction. In this paper, it is argued that rule extraction algorithms should gain from using oracle data; i.e. test set instances, together with corresponding predictions from the opaque model. The experiments, using 17 publicly available data sets, clearly show that rules extracted using only oracle data were significantly more accurate than both rules extracted by the same algorithm, using training data, and standard decision tree algorithms. In addition, the same rules were also significantly more compact; thus providing better comprehensibility. The overall implication is that rules extracted in this fashion explain the predictions made on novel data better than rules extracted in the standard way; i.e. using training data only

在进行预测建模时，关键的标准总是准确性。考虑到这一点，通常使用复杂的技术，如神经网络或集成，导致不透明的模型无法解释。当模型需要易于理解时，使用更简单的技术直接生成透明模型往往会牺牲准确性;这种权衡被称为准确性与可理解性的权衡。为了减少这种权衡，可以将不透明模型转换为另一种可解释的模型;称为规则提取的活动。本文认为规则提取算法应该从使用oracle数据中获益;即测试集实例，以及来自不透明模型的相应预测。使用17个公开可用数据集的实验清楚地表明，仅使用oracle数据提取的规则明显比使用相同算法(使用训练数据和标准决策树算法)提取的规则更准确。此外，同样的规则也明显更加紧凑;从而提供更好的可理解性。总的含义是，以这种方式提取的规则比以标准方式提取的规则更能解释对新数据做出的预测;即只使用训练数据

{"title":"Rule Extraction from Opaque Models-- A Slightly Different Perspective","authors":"U. Johansson, Tuwe Löfström, Rikard König, Cecilia Sönströd, L. Niklasson","doi":"10.1109/ICMLA.2006.46","DOIUrl":"https://doi.org/10.1109/ICMLA.2006.46","url":null,"abstract":"When performing predictive modeling, the key criterion is always accuracy. With this in mind, complex techniques like neural networks or ensembles are normally used, resulting in opaque models impossible to interpret. When models need to be comprehensible, accuracy is often sacrificed by using simpler techniques directly producing transparent models; a tradeoff termed the accuracy vs. comprehensibility tradeoff. In order to reduce this tradeoff, the opaque model can be transformed into another, interpretable, model; an activity termed rule extraction. In this paper, it is argued that rule extraction algorithms should gain from using oracle data; i.e. test set instances, together with corresponding predictions from the opaque model. The experiments, using 17 publicly available data sets, clearly show that rules extracted using only oracle data were significantly more accurate than both rules extracted by the same algorithm, using training data, and standard decision tree algorithms. In addition, the same rules were also significantly more compact; thus providing better comprehensibility. The overall implication is that rules extracted in this fashion explain the predictions made on novel data better than rules extracted in the standard way; i.e. using training data only","PeriodicalId":297071,"journal":{"name":"2006 5th International Conference on Machine Learning and Applications (ICMLA'06)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128392051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Trend Analysis for Large Document Streams 大文档流趋势分析

2006 5th International Conference on Machine Learning and Applications (ICMLA'06)

Pub Date : 2006-12-14 DOI: 10.1109/ICMLA.2006.51

Chengliang Zhang, Shenghuo Zhu, Yihong Gong

More and more powerful computer technology inspires people to investigate information hidden under huge amounts of documents. In this report, we are especially interested in documents with relative time order, which we also call document streams. Examples include TV news, forums, emails of company projects, call center telephone logs, etc. To get an insight into these document streams, first we need to detect the events among the document streams. We use a time-sensitive Dirichlet process mixture model to find the events in the document streams. A time sensitive Dirichlet process mixture model is a generative model, which allows a potentially infinite number of mixture components and uses a Dirichlet compound multinomial model to model the distribution of words in documents. In this report, we consider three different time sensitive Dirichlet process mixture models: an exponential decay kernel model, a polynomial decay function kernel Dirichlet process model and a sliding window kernel model. Experiments on the TDT2 dataset have shown that the time sensitive models perform 18-20% better in terms of accuracy than the Dirichlet process mixture model. The sliding windows kernel and the polynomial kernel are more promising in detecting events. We use ThemeRiver to provide a visualization of the events along the time axis. With the help of ThemeRiver, people can easily get an overall picture of how different events evolve. Besides ThemeRiver, we investigate using top words as a high-level summarization of each event. Experiment results on TDT2 dataset suggests that the sliding window kernel is a better choice both in terms of capturing the trend of the events and expressibility

越来越强大的计算机技术激发人们去调查隐藏在海量文件下的信息。在本报告中，我们对具有相对时间顺序的文档特别感兴趣，我们也将其称为文档流。例子包括电视新闻、论坛、公司项目的电子邮件、呼叫中心的电话记录等。为了深入了解这些文档流，首先我们需要检测文档流中的事件。我们使用时间敏感的Dirichlet过程混合模型来查找文档流中的事件。时间敏感Dirichlet过程混合模型是一种生成模型，它允许潜在无限数量的混合成分，并使用Dirichlet复合多项模型来模拟文档中单词的分布。在本文中，我们考虑了三种不同的时间敏感狄利克雷过程混合模型:指数衰减核模型、多项式衰减函数核狄利克雷过程模型和滑动窗口核模型。在TDT2数据集上的实验表明，时间敏感模型的精度比Dirichlet过程混合模型高18-20%。滑动窗口核和多项式核在检测事件方面更有前景。我们使用themerriver沿着时间轴提供事件的可视化。在themerriver的帮助下，人们可以很容易地获得不同事件发展的整体图景。除了themerriver，我们还研究了使用热门词汇作为每个事件的高级摘要。在TDT2数据集上的实验结果表明，滑动窗口核在捕获事件趋势和可表达性方面都是更好的选择

{"title":"Trend Analysis for Large Document Streams","authors":"Chengliang Zhang, Shenghuo Zhu, Yihong Gong","doi":"10.1109/ICMLA.2006.51","DOIUrl":"https://doi.org/10.1109/ICMLA.2006.51","url":null,"abstract":"More and more powerful computer technology inspires people to investigate information hidden under huge amounts of documents. In this report, we are especially interested in documents with relative time order, which we also call document streams. Examples include TV news, forums, emails of company projects, call center telephone logs, etc. To get an insight into these document streams, first we need to detect the events among the document streams. We use a time-sensitive Dirichlet process mixture model to find the events in the document streams. A time sensitive Dirichlet process mixture model is a generative model, which allows a potentially infinite number of mixture components and uses a Dirichlet compound multinomial model to model the distribution of words in documents. In this report, we consider three different time sensitive Dirichlet process mixture models: an exponential decay kernel model, a polynomial decay function kernel Dirichlet process model and a sliding window kernel model. Experiments on the TDT2 dataset have shown that the time sensitive models perform 18-20% better in terms of accuracy than the Dirichlet process mixture model. The sliding windows kernel and the polynomial kernel are more promising in detecting events. We use ThemeRiver to provide a visualization of the events along the time axis. With the help of ThemeRiver, people can easily get an overall picture of how different events evolve. Besides ThemeRiver, we investigate using top words as a high-level summarization of each event. Experiment results on TDT2 dataset suggests that the sliding window kernel is a better choice both in terms of capturing the trend of the events and expressibility","PeriodicalId":297071,"journal":{"name":"2006 5th International Conference on Machine Learning and Applications (ICMLA'06)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123325848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2006 5th International Conference on Machine Learning and Applications (ICMLA'06)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀