首页 > 最新文献

2008 IEEE International Conference on Data Mining Workshops最新文献

英文 中文
Domain Driven Data Mining (D3M) 领域驱动数据挖掘(D3M)
Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.98
Longbing Cao
In deploying data mining into the real-world business, we have to cater for business scenarios, organizational factors, user preferences and business needs. However, the current data mining algorithms and tools often stop at the delivery of patterns satisfying expected technical interestingness. Business people are not informed about how and what to do to take over the technical deliverables. The gap between academia and business has seriously affected the widespread employment of advanced data mining techniques in greatly promoting enterprise operational quality and productivity. To narrow down the gap, cater for realworld factors relevant to data mining, and make data mining workable in supporting decision-making actions in the real world, we propose the methodology of domain driven data mining (D3M for short). D3M aims to construct next-generation methodologies, techniques and tools for a possible paradigm shift from data-centered hidden pattern mining to domain-driven actionable knowledge delivery. In this talk, we address the concept map of D3M, theoretical underpinnings, several general and flexible frameworks, research issues, possible directions, application areas etc. related to D3M. Real-world case studies in financial data mining and social security mining are demonstrated to show the effectiveness and applicability of D3M in both research and development of real-world challenging problems.
在将数据挖掘部署到现实世界的业务中,我们必须满足业务场景、组织因素、用户偏好和业务需求。然而,当前的数据挖掘算法和工具常常止步于提供满足预期技术兴趣的模式。业务人员不知道如何以及做什么来接管技术可交付成果。学术界与企业界的差距严重影响了先进数据挖掘技术的广泛应用,极大地提高了企业的运营质量和生产力。为了缩小差距,迎合与数据挖掘相关的现实世界因素,并使数据挖掘在支持现实世界中的决策行动方面可行,我们提出了领域驱动数据挖掘(简称D3M)的方法。D3M旨在构建下一代方法、技术和工具,以实现从以数据为中心的隐藏模式挖掘到领域驱动的可操作知识交付的可能范式转变。在这次演讲中,我们讨论了D3M的概念图,理论基础,几个通用和灵活的框架,研究问题,可能的方向,D3M的应用领域等。金融数据挖掘和社会保障挖掘的实际案例研究展示了D3M在现实世界挑战性问题的研究和开发中的有效性和适用性。
{"title":"Domain Driven Data Mining (D3M)","authors":"Longbing Cao","doi":"10.1109/ICDMW.2008.98","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.98","url":null,"abstract":"In deploying data mining into the real-world business, we have to cater for business scenarios, organizational factors, user preferences and business needs. However, the current data mining algorithms and tools often stop at the delivery of patterns satisfying expected technical interestingness. Business people are not informed about how and what to do to take over the technical deliverables. The gap between academia and business has seriously affected the widespread employment of advanced data mining techniques in greatly promoting enterprise operational quality and productivity. To narrow down the gap, cater for realworld factors relevant to data mining, and make data mining workable in supporting decision-making actions in the real world, we propose the methodology of domain driven data mining (D3M for short). D3M aims to construct next-generation methodologies, techniques and tools for a possible paradigm shift from data-centered hidden pattern mining to domain-driven actionable knowledge delivery. In this talk, we address the concept map of D3M, theoretical underpinnings, several general and flexible frameworks, research issues, possible directions, application areas etc. related to D3M. Real-world case studies in financial data mining and social security mining are demonstrated to show the effectiveness and applicability of D3M in both research and development of real-world challenging problems.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130028739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Co-training by Committee: A New Semi-supervised Learning Framework 委员会共同培训:一种新的半监督学习框架
Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.27
Mohamed Farouk Abdel Hady, F. Schwenker
For many data mining applications, it is necessary to develop algorithms that use unlabeled data to improve the accuracy of the supervised learning. Co-Training is a popular semi-supervised learning algorithm. It assumes that each example is represented by two or more redundantly sufficient sets of features (views) and these views are independent given the class. However, these assumptions are not satisfied in many real-world application domains. Therefore, we present a framework called co-training by committee (CoBC), in which a set of diverse classifiers are used to learn each other. The framework is a simple, general single-view semi-supervised learner that can use any ensemble learner to build diverse committees. Experimental studies on CoBC using bagging, AdaBoost and the random subspace method (RSM) as ensemble learners demonstrate that error diversity among classifiers leads to an effective co-training that requires neither redundant and independent views nor different learning algorithms.
对于许多数据挖掘应用,有必要开发使用未标记数据的算法来提高监督学习的准确性。协同训练是一种流行的半监督学习算法。它假设每个示例由两个或更多冗余的足够的特征(视图)集表示,并且给定类,这些视图是独立的。然而,这些假设在许多实际应用领域中并不满足。因此,我们提出了一个名为委员会共同训练(CoBC)的框架,其中一组不同的分类器被用来相互学习。该框架是一个简单、通用的单视图半监督学习器,可以使用任何集成学习器来构建不同的委员会。使用bagging、AdaBoost和随机子空间方法(RSM)作为集成学习器的CoBC实验研究表明,分类器之间的误差多样性导致有效的共同训练,不需要冗余和独立的视图,也不需要不同的学习算法。
{"title":"Co-training by Committee: A New Semi-supervised Learning Framework","authors":"Mohamed Farouk Abdel Hady, F. Schwenker","doi":"10.1109/ICDMW.2008.27","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.27","url":null,"abstract":"For many data mining applications, it is necessary to develop algorithms that use unlabeled data to improve the accuracy of the supervised learning. Co-Training is a popular semi-supervised learning algorithm. It assumes that each example is represented by two or more redundantly sufficient sets of features (views) and these views are independent given the class. However, these assumptions are not satisfied in many real-world application domains. Therefore, we present a framework called co-training by committee (CoBC), in which a set of diverse classifiers are used to learn each other. The framework is a simple, general single-view semi-supervised learner that can use any ensemble learner to build diverse committees. Experimental studies on CoBC using bagging, AdaBoost and the random subspace method (RSM) as ensemble learners demonstrate that error diversity among classifiers leads to an effective co-training that requires neither redundant and independent views nor different learning algorithms.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115049251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
A New Graph-Based Algorithm for Clustering Documents 基于图的文档聚类新算法
Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.69
Airel Pérez Suárez, José Francisco Martínez Trinidad, J. A. Carrasco-Ochoa, J. Medina-Pagola
In this paper a new algorithm, called CStar, for document clustering is presented. This algorithm improves recently developed algorithms like generalized star (GStar) and ACONS algorithms, originally proposed for reducing some drawbacks presented in previous Star-like algorithms.The CStar algorithm uses the condensed star-shaped sub-graph concept defined by ACONS, but defines a new heuristic that allows to construct a new cover of the thresholded similarity graph and to reduce the drawbacks presented in GStar and ACONS algorithms. The experimentation over standard document collections shows that our proposal outperforms previously defined algorithms and other related algorithms used to document clustering.
本文提出了一种新的文档聚类算法CStar。该算法改进了最近开发的算法,如广义星(GStar)和ACONS算法,最初是为了减少以前的类星算法中存在的一些缺陷而提出的。CStar算法使用了ACONS定义的精简星形子图概念,但定义了一种新的启发式方法,允许构造阈值相似图的新覆盖,并减少了GStar和ACONS算法中存在的缺陷。在标准文档集合上的实验表明,我们的建议优于先前定义的算法和用于文档聚类的其他相关算法。
{"title":"A New Graph-Based Algorithm for Clustering Documents","authors":"Airel Pérez Suárez, José Francisco Martínez Trinidad, J. A. Carrasco-Ochoa, J. Medina-Pagola","doi":"10.1109/ICDMW.2008.69","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.69","url":null,"abstract":"In this paper a new algorithm, called CStar, for document clustering is presented. This algorithm improves recently developed algorithms like generalized star (GStar) and ACONS algorithms, originally proposed for reducing some drawbacks presented in previous Star-like algorithms.The CStar algorithm uses the condensed star-shaped sub-graph concept defined by ACONS, but defines a new heuristic that allows to construct a new cover of the thresholded similarity graph and to reduce the drawbacks presented in GStar and ACONS algorithms. The experimentation over standard document collections shows that our proposal outperforms previously defined algorithms and other related algorithms used to document clustering.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124549868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Hunting for Coherent Co-clusters in High Dimensional and Noisy Datasets 在高维和噪声数据集中寻找相干共簇
Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.20
Meghana Deodhar, Joydeep Ghosh, Gunjan Gupta, Hyuk Cho, I. Dhillon
Clustering problems often involve datasets where only a part of the data is relevant to the problem, e.g., in microarray data analysis only a subset of the genes show cohesive expressions within a subset of the conditions/features. The existence of a large number of non-informative data points and features makes it challenging to hunt for coherent and meaningful clusters from such datasets. Additionally, since clusters could exist in different subspaces of the feature space, a co-clustering algorithm that simultaneously clusters objects and features is often more suitable as compared to one that is restricted to traditional "one-sided" clustering. We propose Robust Overlapping Co-clustering (ROCC), a scalable and very versatile framework that addresses the problem of efficiently mining dense, arbitrarily positioned, possibly overlapping co-clusters from large, noisy datasets. ROCC has several desirable properties that make it extremely well suited to a number of real life applications. Through extensive experimentation we show that our approach is significantly more accurate in identifying biologically meaningful co-clusters in microarray data as compared to several other prominent approaches that have been applied to this task. We also point out other interesting applications of the proposed framework in solving difficult clustering problems.
聚类问题通常涉及只有部分数据与问题相关的数据集,例如,在微阵列数据分析中,只有基因的子集在条件/特征的子集内显示内聚表达。大量非信息性数据点和特征的存在使得从这些数据集中寻找连贯和有意义的聚类具有挑战性。此外,由于聚类可以存在于特征空间的不同子空间中,因此与传统的“片面”聚类相比,同时聚类对象和特征的共聚类算法通常更合适。我们提出了鲁棒重叠共聚类(ROCC),这是一个可扩展且非常通用的框架,可解决从大型嘈杂数据集中有效挖掘密集,任意定位,可能重叠的共聚类的问题。ROCC具有几个令人满意的特性,使其非常适合许多实际应用。通过广泛的实验,我们表明,与应用于这项任务的其他几种突出方法相比,我们的方法在识别微阵列数据中具有生物学意义的共簇方面明显更准确。我们还指出了该框架在解决困难的聚类问题方面的其他有趣应用。
{"title":"Hunting for Coherent Co-clusters in High Dimensional and Noisy Datasets","authors":"Meghana Deodhar, Joydeep Ghosh, Gunjan Gupta, Hyuk Cho, I. Dhillon","doi":"10.1109/ICDMW.2008.20","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.20","url":null,"abstract":"Clustering problems often involve datasets where only a part of the data is relevant to the problem, e.g., in microarray data analysis only a subset of the genes show cohesive expressions within a subset of the conditions/features. The existence of a large number of non-informative data points and features makes it challenging to hunt for coherent and meaningful clusters from such datasets. Additionally, since clusters could exist in different subspaces of the feature space, a co-clustering algorithm that simultaneously clusters objects and features is often more suitable as compared to one that is restricted to traditional \"one-sided\" clustering. We propose Robust Overlapping Co-clustering (ROCC), a scalable and very versatile framework that addresses the problem of efficiently mining dense, arbitrarily positioned, possibly overlapping co-clusters from large, noisy datasets. ROCC has several desirable properties that make it extremely well suited to a number of real life applications. Through extensive experimentation we show that our approach is significantly more accurate in identifying biologically meaningful co-clusters in microarray data as compared to several other prominent approaches that have been applied to this task. We also point out other interesting applications of the proposed framework in solving difficult clustering problems.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129651785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Actionable Knowledge Discovery for Threats Intelligence Support Using a Multi-dimensional Data Mining Methodology 基于多维数据挖掘方法的威胁情报支持的可操作知识发现
Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.78
Olivier Thonnard, M. Dacier
This paper describes a multi-dimensional knowledge discovery and data mining (KDD) methodology that aims at discovering actionable knowledge related to Internet threats, taking into account domain expert guidance and the integration of domain-specific intelligence during the data mining process. The objectives are twofold: i) to develop global indicators for assessing the prevalence of certain malicious activities on the Internet, and ii) to get insights into the modus operandi of new emerging attack phenomena, so as to improve our understanding of threats. In this paper, we first present the generic aspects of a domain-driven graph-based KDD methodology, which is based on two main components: a clique-based clustering technique and a concepts synthesis process using cliques' intersections. Then, to evaluate the applicability of this approach to our application domain, we use a large dataset of real-world attack traces collected since 2003. Our experimental results show that significant insights can be obtained into the domain of threat intelligence by using this multi-dimensional knowledge discovery method.
本文描述了一种多维知识发现和数据挖掘(KDD)方法,该方法旨在发现与互联网威胁相关的可操作知识,在数据挖掘过程中考虑到领域专家的指导和领域特定智能的集成。其目的有两方面:一是制定全球指标,以评估互联网上某些恶意活动的流行程度;二是深入了解新出现的攻击现象的运作方式,从而提高我们对威胁的认识。在本文中,我们首先介绍了基于域驱动图的KDD方法的一般方面,该方法基于两个主要组件:基于团的聚类技术和使用团相交的概念合成过程。然后,为了评估这种方法在我们的应用领域的适用性,我们使用了自2003年以来收集的真实攻击痕迹的大型数据集。实验结果表明,利用这种多维知识发现方法可以获得对威胁情报领域的重要见解。
{"title":"Actionable Knowledge Discovery for Threats Intelligence Support Using a Multi-dimensional Data Mining Methodology","authors":"Olivier Thonnard, M. Dacier","doi":"10.1109/ICDMW.2008.78","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.78","url":null,"abstract":"This paper describes a multi-dimensional knowledge discovery and data mining (KDD) methodology that aims at discovering actionable knowledge related to Internet threats, taking into account domain expert guidance and the integration of domain-specific intelligence during the data mining process. The objectives are twofold: i) to develop global indicators for assessing the prevalence of certain malicious activities on the Internet, and ii) to get insights into the modus operandi of new emerging attack phenomena, so as to improve our understanding of threats. In this paper, we first present the generic aspects of a domain-driven graph-based KDD methodology, which is based on two main components: a clique-based clustering technique and a concepts synthesis process using cliques' intersections. Then, to evaluate the applicability of this approach to our application domain, we use a large dataset of real-world attack traces collected since 2003. Our experimental results show that significant insights can be obtained into the domain of threat intelligence by using this multi-dimensional knowledge discovery method.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115071369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Web Query Prediction by Unifying Model 基于统一模型的Web查询预测
Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.53
Ning Liu, Jun Yan, Shuicheng Yan, Weiguo Fan, Zheng Chen
Recently, many commercial products, such as Google Trends and Yahoo! Buzz, are released to monitor the past search engine query frequency trend. However, little research has been devoted for predicting the upcoming query trend, which is of great importance in providing guidelines for future business planning. In this paper, a unified solution is presented for such a purpose. Besides the classical time series model, we propose to integrate the cosine signal hidden periodicities model to capture periodic information of query time series. Motivated by the fact that these models cannot capture the external accidental event factors which could significantly influence the query frequency, the query correlation model is also modified and integrated for predicting the upcoming query trend. Finally linear regression is utilized for model unification. Experiments based on 15,511,531 queries from a commercial search engine query log ranging within 283 days well validate the effectiveness of our proposed unified algorithm.
最近,许多商业产品,如b谷歌趋势和雅虎!Buzz,是监测过去搜索引擎发布的查询频率趋势。然而,对于预测即将到来的查询趋势的研究很少,这对于为未来的业务规划提供指导非常重要。本文提出了一个统一的解决方案。在经典时间序列模型的基础上,提出了对余弦信号隐周期模型进行集成,以获取查询时间序列的周期信息。针对这些模型无法捕捉到对查询频率有显著影响的外部偶然事件因素,本文还对查询关联模型进行了改进和集成,用于预测即将到来的查询趋势。最后利用线性回归进行模型统一。基于283天内商业搜索引擎查询日志中的15,511,531个查询的实验很好地验证了我们提出的统一算法的有效性。
{"title":"Web Query Prediction by Unifying Model","authors":"Ning Liu, Jun Yan, Shuicheng Yan, Weiguo Fan, Zheng Chen","doi":"10.1109/ICDMW.2008.53","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.53","url":null,"abstract":"Recently, many commercial products, such as Google Trends and Yahoo! Buzz, are released to monitor the past search engine query frequency trend. However, little research has been devoted for predicting the upcoming query trend, which is of great importance in providing guidelines for future business planning. In this paper, a unified solution is presented for such a purpose. Besides the classical time series model, we propose to integrate the cosine signal hidden periodicities model to capture periodic information of query time series. Motivated by the fact that these models cannot capture the external accidental event factors which could significantly influence the query frequency, the query correlation model is also modified and integrated for predicting the upcoming query trend. Finally linear regression is utilized for model unification. Experiments based on 15,511,531 queries from a commercial search engine query log ranging within 283 days well validate the effectiveness of our proposed unified algorithm.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122946548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Towards Combining Structured Pattern Mining and Graph Kernels 结构化模式挖掘与图核结合的研究
Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.125
Fabrizio Costa, Björn Bringmann
This paper presents a novel approach to feature construction for structured data in order to enhance graph prediction classification performance. To this end we combine graph mining techniques with graph kernel based classifiers. The main idea is to employ efficient mining techniques to extract a set of patterns correlated with the target concept and use these, or a selected subset of these, to annotate the original graph structures. A decomposition kernel is then defined on the enriched structured data instances. Experimental results on carcinogenic and toxicological activity prediction tasks for small molecules show that the proposed technique significantly increases classification performance.
本文提出了一种结构化数据特征构建的新方法,以提高图预测分类性能。为此,我们将图挖掘技术与基于图核的分类器相结合。其主要思想是采用有效的挖掘技术来提取与目标概念相关的一组模式,并使用这些模式或其中的一个选定子集来注释原始图结构。然后在丰富的结构化数据实例上定义分解内核。小分子致癌性和毒理学活性预测任务的实验结果表明,该方法显著提高了分类性能。
{"title":"Towards Combining Structured Pattern Mining and Graph Kernels","authors":"Fabrizio Costa, Björn Bringmann","doi":"10.1109/ICDMW.2008.125","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.125","url":null,"abstract":"This paper presents a novel approach to feature construction for structured data in order to enhance graph prediction classification performance. To this end we combine graph mining techniques with graph kernel based classifiers. The main idea is to employ efficient mining techniques to extract a set of patterns correlated with the target concept and use these, or a selected subset of these, to annotate the original graph structures. A decomposition kernel is then defined on the enriched structured data instances. Experimental results on carcinogenic and toxicological activity prediction tasks for small molecules show that the proposed technique significantly increases classification performance.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124117574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Speeding up Array Query Processing by Just-In-Time Compilation 通过即时编译加速数组查询处理
Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.73
C. Jucovschi, P. Baumann, Sorin Stancu-Mara
Interpreted languages frequently suffer from higher processing times as compared to compiled approaches. Typically this happens when complex computations are performed. Array DBMSs, which extend database functionality with multidimensional array modeling and query support, find themselves in exactly this situation: queries often involve a large number of operations, and each such operation is applied to a large number of array elements.In this paper, we propose just-in-time compilation as an optimization method for an interpreted array query language. This is achieved by grouping suitable query nodes into complex operation nodes, for which C code is generated, compiled, and loaded during runtime.We present our approach based on the array DBMS rasdaman, discuss its benefits and its embedding into the rasdaman query evaluation, and show initial, rather promising benchmark results.
与编译方法相比,解释型语言的处理时间往往更长。通常在执行复杂计算时发生这种情况。数组dbms通过多维数组建模和查询支持扩展了数据库功能,它发现自己正好处于这种情况:查询通常涉及大量操作,并且每个这样的操作都应用于大量数组元素。在本文中,我们提出了即时编译作为一种优化方法,用于解释数组查询语言。这是通过将合适的查询节点分组为复杂的操作节点来实现的,在运行时为这些操作节点生成、编译和加载C代码。我们提出了基于数组DBMS rasdaman的方法,讨论了它的优点及其嵌入rasdaman查询评估,并展示了初步的、相当有希望的基准测试结果。
{"title":"Speeding up Array Query Processing by Just-In-Time Compilation","authors":"C. Jucovschi, P. Baumann, Sorin Stancu-Mara","doi":"10.1109/ICDMW.2008.73","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.73","url":null,"abstract":"Interpreted languages frequently suffer from higher processing times as compared to compiled approaches. Typically this happens when complex computations are performed. Array DBMSs, which extend database functionality with multidimensional array modeling and query support, find themselves in exactly this situation: queries often involve a large number of operations, and each such operation is applied to a large number of array elements.In this paper, we propose just-in-time compilation as an optimization method for an interpreted array query language. This is achieved by grouping suitable query nodes into complex operation nodes, for which C code is generated, compiled, and loaded during runtime.We present our approach based on the array DBMS rasdaman, discuss its benefits and its embedding into the rasdaman query evaluation, and show initial, rather promising benchmark results.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115608655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Multiple-Instance Regression with Structured Data 结构化数据的多实例回归
Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.31
K. Wagstaff, T. Lane, A. Roper
We present a multiple-instance regression algorithm that models internal bag structure to identify the items most relevant to the bag labels. Multiple-instance regression (MIR) operates on a set of bags with real-valued labels, each containing a set of unlabeled items, in which the relevance of each item to its bag label is unknown. The goal is to predict the labels of new bags from their contents. Unlike previous MIR methods, MI-ClusterRegress can operate on bags that are structured in that they contain items drawn from a number of distinct (but unknown) distributions. MI-ClusterRegress simultaneously learns a model of the bagpsilas internal structure, the relevance of each item, and a regression model that accurately predicts labels for new bags. We evaluated this approach on the challenging MIR problem of crop yield prediction from remote sensing data. MI-ClusterRegress provided predictions that were more accurate than those obtained with non-multiple-instance approaches or MIR methods that do not model the bag structure.
我们提出了一种多实例回归算法,该算法对袋子内部结构进行建模,以识别与袋子标签最相关的物品。多实例回归(multi -instance regression, MIR)对一组具有实值标签的袋子进行操作,每个袋子包含一组未标记的物品,其中每个物品与其袋子标签的相关性是未知的。目标是根据内容物来预测新袋子的标签。与以前的MIR方法不同,MI-ClusterRegress可以对包进行操作,因为包的结构包含从许多不同(但未知)分布中提取的项。mi - clusterregression同时学习袋子内部结构的模型、每个物品的相关性,以及准确预测新袋子标签的回归模型。我们在具有挑战性的遥感作物产量预测MIR问题上对该方法进行了评估。mi - clusterregression提供的预测比使用非多实例方法或MIR方法获得的预测更准确,这些方法没有对袋子结构进行建模。
{"title":"Multiple-Instance Regression with Structured Data","authors":"K. Wagstaff, T. Lane, A. Roper","doi":"10.1109/ICDMW.2008.31","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.31","url":null,"abstract":"We present a multiple-instance regression algorithm that models internal bag structure to identify the items most relevant to the bag labels. Multiple-instance regression (MIR) operates on a set of bags with real-valued labels, each containing a set of unlabeled items, in which the relevance of each item to its bag label is unknown. The goal is to predict the labels of new bags from their contents. Unlike previous MIR methods, MI-ClusterRegress can operate on bags that are structured in that they contain items drawn from a number of distinct (but unknown) distributions. MI-ClusterRegress simultaneously learns a model of the bagpsilas internal structure, the relevance of each item, and a regression model that accurately predicts labels for new bags. We evaluated this approach on the challenging MIR problem of crop yield prediction from remote sensing data. MI-ClusterRegress provided predictions that were more accurate than those obtained with non-multiple-instance approaches or MIR methods that do not model the bag structure.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125471924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Research on Methodology of Classification Mining for Tumor Markers 肿瘤标记物分类挖掘方法研究
Pub Date : 2008-12-15 DOI: 10.1109/ICDMW.2008.74
Wei Jiang, Min Yao, Jiekai Yu
Reliability is one of the key issues in data mining. In the case of massive protein mass spectrum data from SELDI-TOF-MS, this paper proposes an effective and reliable method to extract tumor markers. First of all, an adaptive threshold approach based on wavelet transformation is put forward to eliminate the noise in raw data so as to furnish reliable foundation for tumor markers extraction. Then a kind of genetic algorithm based on SVM is designed to construct discriminating model in order to find the optimal combination of distinct protein peaks and obtain tumor markers. Finally, the method proposed in this paper is applied to extract tumor markers from the protein mass spectrum data that come from normal mouse serums and induced pancreatic cancer mouse serums to verify the feasibility and reliability of our method.
可靠性是数据挖掘中的关键问题之一。针对SELDI-TOF-MS中大量蛋白质质谱数据的情况,本文提出了一种有效可靠的肿瘤标志物提取方法。首先,提出了一种基于小波变换的自适应阈值方法,消除原始数据中的噪声,为肿瘤标志物的提取提供可靠的基础。然后设计了一种基于支持向量机的遗传算法,构建判别模型,寻找不同蛋白峰的最优组合,获得肿瘤标志物。最后,将本文提出的方法应用于正常小鼠血清和诱导胰腺癌小鼠血清的蛋白质谱数据中提取肿瘤标志物,验证了本文方法的可行性和可靠性。
{"title":"Research on Methodology of Classification Mining for Tumor Markers","authors":"Wei Jiang, Min Yao, Jiekai Yu","doi":"10.1109/ICDMW.2008.74","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.74","url":null,"abstract":"Reliability is one of the key issues in data mining. In the case of massive protein mass spectrum data from SELDI-TOF-MS, this paper proposes an effective and reliable method to extract tumor markers. First of all, an adaptive threshold approach based on wavelet transformation is put forward to eliminate the noise in raw data so as to furnish reliable foundation for tumor markers extraction. Then a kind of genetic algorithm based on SVM is designed to construct discriminating model in order to find the optimal combination of distinct protein peaks and obtain tumor markers. Finally, the method proposed in this paper is applied to extract tumor markers from the protein mass spectrum data that come from normal mouse serums and induced pancreatic cancer mouse serums to verify the feasibility and reliability of our method.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126038010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2008 IEEE International Conference on Data Mining Workshops
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1