首页 > 最新文献

Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics最新文献

英文 中文
Augmenting MATLAB with semantic objects for an interactive visual environment 增强MATLAB与语义对象的交互式视觉环境
C. Lee, J. Choo, Duen Horng Chau, Haesun Park
Analysis tools such as Matlab, R, and SAS support a myriad of built-in computational functions and various standard visualization techniques. However, most of them provide little interaction from visualizations mainly due to the fact that the tools treat the data as just numerical vectors or matrices while ignoring any semantic meaning associated with them. To solve this limitation, we augment Matlab, one of the widely used data analysis tools, with the capability of directly handling the underlying semantic objects and their meanings. Such capabilities allow users to flexibly assign essential interaction capabilities, such as brushing-and-linking and details-on-demand interactions, to visualizations. To demonstrate the capabilities, two usage scenarios in document and graph analysis domains are presented.
像Matlab、R和SAS这样的分析工具支持无数的内置计算函数和各种标准的可视化技术。然而,大多数工具提供的可视化交互很少,这主要是因为工具将数据仅仅视为数值向量或矩阵,而忽略了与它们相关的任何语义。为了解决这一限制,我们增加了Matlab这个广泛使用的数据分析工具之一,具有直接处理底层语义对象及其含义的能力。这样的功能允许用户灵活地为可视化分配必要的交互功能,例如刷刷链接和按需详细信息交互。为了演示这些功能,本文给出了文档和图形分析领域中的两个使用场景。
{"title":"Augmenting MATLAB with semantic objects for an interactive visual environment","authors":"C. Lee, J. Choo, Duen Horng Chau, Haesun Park","doi":"10.1145/2501511.2501521","DOIUrl":"https://doi.org/10.1145/2501511.2501521","url":null,"abstract":"Analysis tools such as Matlab, R, and SAS support a myriad of built-in computational functions and various standard visualization techniques. However, most of them provide little interaction from visualizations mainly due to the fact that the tools treat the data as just numerical vectors or matrices while ignoring any semantic meaning associated with them. To solve this limitation, we augment Matlab, one of the widely used data analysis tools, with the capability of directly handling the underlying semantic objects and their meanings. Such capabilities allow users to flexibly assign essential interaction capabilities, such as brushing-and-linking and details-on-demand interactions, to visualizations. To demonstrate the capabilities, two usage scenarios in document and graph analysis domains are presented.","PeriodicalId":126062,"journal":{"name":"Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130595119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Storygraph: extracting patterns from spatio-temporal data 故事图:从时空数据中提取模式
Ayush Shrestha, B. Miller, Ying Zhu, Yi Zhao
Analysis of spatio-temporal data often involves correlating different events in time and location to uncover relationships between them. It is also desirable to identify different patterns in the data. Visualizing time and space in the same chart is not trivial. Common methods includes plotting the latitude, longitude and time as three dimensions of a 3D chart. Drawbacks of these 3D charts include not being able to scale well due to cluttering, occlusion and difficulty to track time in case of clustered events. In this paper we present a novel 2D visualization technique called Storygraph which provides an integrated view of time and location to address these issues. We also present storylines based on Storygraph which show movement of the actors over time. Lastly, we present case studies to show the applications of Storygraph.
对时空数据的分析通常涉及将不同事件在时间和地点上联系起来,以揭示它们之间的关系。还需要识别数据中的不同模式。在同一张图表中可视化时间和空间并非易事。常用的方法包括绘制纬度、经度和时间作为三维图表的三个维度。这些3D图表的缺点包括由于混乱、遮挡和在聚集事件的情况下难以跟踪时间而无法很好地缩放。在本文中,我们提出了一种新的二维可视化技术,称为故事图,它提供了时间和地点的综合视图来解决这些问题。我们还呈现基于故事图的故事情节,它显示了演员随时间的运动。最后,我们通过案例研究来展示故事图的应用。
{"title":"Storygraph: extracting patterns from spatio-temporal data","authors":"Ayush Shrestha, B. Miller, Ying Zhu, Yi Zhao","doi":"10.1145/2501511.2501525","DOIUrl":"https://doi.org/10.1145/2501511.2501525","url":null,"abstract":"Analysis of spatio-temporal data often involves correlating different events in time and location to uncover relationships between them. It is also desirable to identify different patterns in the data. Visualizing time and space in the same chart is not trivial. Common methods includes plotting the latitude, longitude and time as three dimensions of a 3D chart. Drawbacks of these 3D charts include not being able to scale well due to cluttering, occlusion and difficulty to track time in case of clustered events. In this paper we present a novel 2D visualization technique called Storygraph which provides an integrated view of time and location to address these issues. We also present storylines based on Storygraph which show movement of the actors over time. Lastly, we present case studies to show the applications of Storygraph.","PeriodicalId":126062,"journal":{"name":"Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125297727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Towards anytime active learning: interrupting experts to reduce annotation costs 随时主动学习:打断专家,降低注释成本
M. E. Ramirez-Loaiza, A. Culotta, M. Bilgic
Many active learning methods use annotation cost or expert quality as part of their framework to select the best data for annotation. While these methods model expert quality, availability, or expertise, they have no direct influence on any of these elements. We present a novel framework built upon decision-theoretic active learning that allows the learner to directly control label quality by allocating a time budget to each annotation. We show that our method is able to improve performance efficiency of the active learner through an interruption mechanism trading off the induced error with the cost of annotation. Our simulation experiments on three document classification tasks show that some interruption is almost always better than none, but that the optimal interruption time varies by dataset.
许多主动学习方法使用标注成本或专家质量作为其框架的一部分来选择最佳的数据进行标注。虽然这些方法对专家质量、可用性或专业知识进行建模,但它们对这些元素中的任何一个都没有直接影响。我们提出了一个基于决策理论主动学习的新框架,该框架允许学习者通过为每个注释分配时间预算来直接控制标签质量。我们的方法能够通过中断机制来权衡诱导误差和标注成本,从而提高主动学习器的性能效率。我们对三个文档分类任务的模拟实验表明,有一些中断几乎总是比没有中断好,但最佳中断时间因数据集而异。
{"title":"Towards anytime active learning: interrupting experts to reduce annotation costs","authors":"M. E. Ramirez-Loaiza, A. Culotta, M. Bilgic","doi":"10.1145/2501511.2501524","DOIUrl":"https://doi.org/10.1145/2501511.2501524","url":null,"abstract":"Many active learning methods use annotation cost or expert quality as part of their framework to select the best data for annotation. While these methods model expert quality, availability, or expertise, they have no direct influence on any of these elements. We present a novel framework built upon decision-theoretic active learning that allows the learner to directly control label quality by allocating a time budget to each annotation. We show that our method is able to improve performance efficiency of the active learner through an interruption mechanism trading off the induced error with the cost of annotation. Our simulation experiments on three document classification tasks show that some interruption is almost always better than none, but that the optimal interruption time varies by dataset.","PeriodicalId":126062,"journal":{"name":"Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics","volume":"79 2-3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123453999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Lytic: synthesizing high-dimensional algorithmic analysis with domain-agnostic, faceted visual analytics 分析:综合高维算法分析与领域不可知论,面可视化分析
Edward Clarkson, J. Choo, John Turgeson, R. Decuir, Haesun Park
We present Lytic, a domain-independent, faceted visual analytic (VA) system for interactive exploration of large datasets. It combines a flexible UI that adapts to arbitrary character-separated value (CSV) datasets with algorithmic preprocessing to compute unsupervised dimension reduction and cluster data from high-dimensional fields. It provides a variety of visualization options that require minimal user effort to configure and a consistent user experience between visualization types and underlying datasets. Filtering, comparison and visualization operations work in concert, allowing users to hop seamlessly between actions and pursue answers to expected and unexpected data hypotheses.
我们提出了Lytic,一个领域独立的、面向面的视觉分析(VA)系统,用于大型数据集的交互式探索。它结合了一个灵活的UI,可以适应任意字符分隔值(CSV)数据集和算法预处理,以计算无监督降维和高维字段的聚类数据。它提供了各种可视化选项,这些选项需要最少的用户配置工作,并且在可视化类型和底层数据集之间提供一致的用户体验。过滤、比较和可视化操作协同工作,允许用户在操作之间无缝跳转,并寻求预期和意外数据假设的答案。
{"title":"Lytic: synthesizing high-dimensional algorithmic analysis with domain-agnostic, faceted visual analytics","authors":"Edward Clarkson, J. Choo, John Turgeson, R. Decuir, Haesun Park","doi":"10.1145/2501511.2501518","DOIUrl":"https://doi.org/10.1145/2501511.2501518","url":null,"abstract":"We present Lytic, a domain-independent, faceted visual analytic (VA) system for interactive exploration of large datasets. It combines a flexible UI that adapts to arbitrary character-separated value (CSV) datasets with algorithmic preprocessing to compute unsupervised dimension reduction and cluster data from high-dimensional fields. It provides a variety of visualization options that require minimal user effort to configure and a consistent user experience between visualization types and underlying datasets. Filtering, comparison and visualization operations work in concert, allowing users to hop seamlessly between actions and pursue answers to expected and unexpected data hypotheses.","PeriodicalId":126062,"journal":{"name":"Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116600206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Zips: mining compressing sequential patterns in streams 压缩:挖掘压缩流中的顺序模式
Hoang Thanh Lam, T. Calders, Jie Yang, F. Mörchen, Dmitriy Fradkin
We propose a streaming algorithm, based on the minimal description length (MDL) principle, for extracting non-redundant sequential patterns. For static databases, the MDL-based approach that selects patterns based on their capacity to compress data rather than their frequency, was shown to be remarkably effective for extracting meaningful patterns and solving the redundancy issue in frequent itemset and sequence mining. The existing MDL-based algorithms, however, either start from a seed set of frequent patterns, or require multiple passes through the data. As such, the existing approaches scale poorly and are unsuitable for large datasets. Therefore, our main contribution is the proposal of a new, streaming algorithm, called Zips, that does not require a seed set of patterns and requires only one scan over the data. For Zips, we extended the Lempel-Ziv (LZ) compression algorithm in three ways: first, whereas LZ assigns codes uniformly as it builds up its dictionary while scanning the input, Zips assigns codewords according to the usage of the dictionary words; more heaviliy used words get shorter code-lengths. Secondly, Zips exploits also non-consecutive occurences of dictionary words for compression. And, third, the well-known space-saving algorithm is used to evict unpromising words from the dictionary. Experiments on one synthetic and two real-world large-scale datasets show that our approach extracts meaningful compressing patterns with similar quality to the state-of-the-art multi-pass algorithms proposed for static databases of sequences. Moreover, our approach scales linearly with the size of data streams while all the existing algorithms do not.
我们提出了一种基于最小描述长度(MDL)原则的流算法,用于提取非冗余序列模式。对于静态数据库,基于mdl的方法根据压缩数据的能力而不是频率选择模式,对于提取有意义的模式和解决频繁项集和序列挖掘中的冗余问题非常有效。然而,现有的基于mdl的算法要么从频繁模式的种子集开始,要么需要多次遍历数据。因此,现有方法的可扩展性很差,不适合大型数据集。因此,我们的主要贡献是提出一种新的流算法,称为zip,它不需要模式的种子集,只需要对数据进行一次扫描。对于zip,我们以三种方式扩展了Lempel-Ziv (LZ)压缩算法:首先,LZ在扫描输入时建立字典时统一分配代码,而zip根据字典单词的使用情况分配码字;使用频率越高的单词的代码长度越短。其次,zip还利用字典中不连续出现的单词进行压缩。第三,使用了众所周知的节省空间算法来从字典中剔除无用的单词。在一个合成数据集和两个真实世界大规模数据集上的实验表明,我们的方法提取了有意义的压缩模式,其质量与针对序列静态数据库提出的最先进的多通道算法相似。此外,我们的方法随数据流的大小线性扩展,而所有现有算法都没有。
{"title":"Zips: mining compressing sequential patterns in streams","authors":"Hoang Thanh Lam, T. Calders, Jie Yang, F. Mörchen, Dmitriy Fradkin","doi":"10.1145/2501511.2501520","DOIUrl":"https://doi.org/10.1145/2501511.2501520","url":null,"abstract":"We propose a streaming algorithm, based on the minimal description length (MDL) principle, for extracting non-redundant sequential patterns. For static databases, the MDL-based approach that selects patterns based on their capacity to compress data rather than their frequency, was shown to be remarkably effective for extracting meaningful patterns and solving the redundancy issue in frequent itemset and sequence mining. The existing MDL-based algorithms, however, either start from a seed set of frequent patterns, or require multiple passes through the data. As such, the existing approaches scale poorly and are unsuitable for large datasets. Therefore, our main contribution is the proposal of a new, streaming algorithm, called Zips, that does not require a seed set of patterns and requires only one scan over the data. For Zips, we extended the Lempel-Ziv (LZ) compression algorithm in three ways: first, whereas LZ assigns codes uniformly as it builds up its dictionary while scanning the input, Zips assigns codewords according to the usage of the dictionary words; more heaviliy used words get shorter code-lengths. Secondly, Zips exploits also non-consecutive occurences of dictionary words for compression. And, third, the well-known space-saving algorithm is used to evict unpromising words from the dictionary. Experiments on one synthetic and two real-world large-scale datasets show that our approach extracts meaningful compressing patterns with similar quality to the state-of-the-art multi-pass algorithms proposed for static databases of sequences. Moreover, our approach scales linearly with the size of data streams while all the existing algorithms do not.","PeriodicalId":126062,"journal":{"name":"Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123513297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Building blocks for exploratory data analysis tools 探索性数据分析工具的构建块
S. Alspaugh, Marti A. Hearst, A. Ganapathi, R. Katz
Data exploration is largely manual and labor intensive. Although there are various tools and statistical techniques that can be applied to data sets, there is little help to identify what questions to ask of a data set, let alone what domain knowledge is useful in answering the questions. In this paper, we study user queries against production data sets in Splunk. Specifically, we characterize the interplay between data sets and the operations used to analyze them using latent semantic analysis, and discuss how this characterization serves as a building block for a data analysis recommendation system. This is a work-in-progress paper.
数据探索在很大程度上是手工和劳动密集型的。尽管有各种各样的工具和统计技术可以应用于数据集,但对于确定对数据集提出什么问题几乎没有帮助,更不用说在回答这些问题时哪些领域知识是有用的了。在本文中,我们研究了Splunk中针对生产数据集的用户查询。具体来说,我们描述了数据集和使用潜在语义分析来分析它们的操作之间的相互作用,并讨论了这种描述如何作为数据分析推荐系统的构建块。这是一篇正在进行中的论文。
{"title":"Building blocks for exploratory data analysis tools","authors":"S. Alspaugh, Marti A. Hearst, A. Ganapathi, R. Katz","doi":"10.1145/2501511.2501515","DOIUrl":"https://doi.org/10.1145/2501511.2501515","url":null,"abstract":"Data exploration is largely manual and labor intensive. Although there are various tools and statistical techniques that can be applied to data sets, there is little help to identify what questions to ask of a data set, let alone what domain knowledge is useful in answering the questions. In this paper, we study user queries against production data sets in Splunk. Specifically, we characterize the interplay between data sets and the operations used to analyze them using latent semantic analysis, and discuss how this characterization serves as a building block for a data analysis recommendation system. This is a work-in-progress paper.","PeriodicalId":126062,"journal":{"name":"Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127111295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics ACM SIGKDD交互式数据探索和分析研讨会论文集
Duen Horng Chau, Jilles Vreeken, M. Leeuwen, C. Faloutsos
We have entered the era of big data. Massive datasets, surpassing terabytes and petabytes in size are now commonplace. They arise in numerous settings in science, government, and enterprises, and technology exists by which we can collect and store such massive amounts of information. Yet, making sense of these data remains a fundamental challenge. We lack the means to exploratively analyze databases of this scale. Currently, few technologies allow us to freely "wander" around the data, and make discoveries by following our intuition, or serendipity. While standard data mining aims at finding highly interesting results, it is typically computationally demanding and time consuming, thus may not be well-suited for interactive exploration of large datasets. Interactive data mining techniques that aptly integrate human intuition, by means of visualization and intuitive human-computer interaction techniques, and machine computation support have been shown to help people gain significant insights into a wide range of problems. However, as datasets are being generated in larger volumes, higher velocity, and greater variety, creating effective interactive data mining techniques becomes a much harder task.
我们已经进入了大数据时代。超过太字节和拍字节的大规模数据集现在很常见。它们出现在科学、政府和企业的许多环境中,而且我们可以通过现有技术收集和存储如此大量的信息。然而,理解这些数据仍然是一个根本性的挑战。我们缺乏探索性分析这种规模的数据库的手段。目前,很少有技术允许我们自由地在数据中“漫游”,并根据我们的直觉或意外发现来发现。虽然标准数据挖掘的目的是寻找非常有趣的结果,但它通常需要大量的计算和时间,因此可能不太适合对大型数据集进行交互式探索。交互式数据挖掘技术,通过可视化和直观的人机交互技术,以及机器计算支持,适当地集成了人类的直觉,已经被证明可以帮助人们对广泛的问题获得重要的见解。然而,随着数据集以更大的容量、更快的速度和更多的种类生成,创建有效的交互式数据挖掘技术变得更加困难。
{"title":"Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics","authors":"Duen Horng Chau, Jilles Vreeken, M. Leeuwen, C. Faloutsos","doi":"10.1145/2501511","DOIUrl":"https://doi.org/10.1145/2501511","url":null,"abstract":"We have entered the era of big data. Massive datasets, surpassing terabytes and petabytes in size are now commonplace. They arise in numerous settings in science, government, and enterprises, and technology exists by which we can collect and store such massive amounts of information. Yet, making sense of these data remains a fundamental challenge. We lack the means to exploratively analyze databases of this scale. Currently, few technologies allow us to freely \"wander\" around the data, and make discoveries by following our intuition, or serendipity. While standard data mining aims at finding highly interesting results, it is typically computationally demanding and time consuming, thus may not be well-suited for interactive exploration of large datasets. \u0000 \u0000Interactive data mining techniques that aptly integrate human intuition, by means of visualization and intuitive human-computer interaction techniques, and machine computation support have been shown to help people gain significant insights into a wide range of problems. However, as datasets are being generated in larger volumes, higher velocity, and greater variety, creating effective interactive data mining techniques becomes a much harder task.","PeriodicalId":126062,"journal":{"name":"Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132283579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
One click mining: interactive local pattern discovery through implicit preference and performance learning 一键挖掘:通过隐式偏好和性能学习进行交互式本地模式发现
Mario Boley, M. Mampaey, Bo Kang, P. Tokmakov, S. Wrobel
It is known that productive pattern discovery from data has to interactively involve the user as directly as possible. State-of-the-art toolboxes require the specification of sophisticated workflows with an explicit selection of a data mining method, all its required parameters, and a corresponding algorithm. This hinders the desired rapid interaction---especially with users that are experts of the data domain rather than data mining experts. In this paper, we present a fundamentally new approach towards user involvement that relies exclusively on the implicit feedback available from the natural analysis behavior of the user, and at the same time allows the user to work with a multitude of pattern classes and discovery algorithms simultaneously without even knowing the details of each algorithm. To achieve this goal, we are relying on a recently proposed co-active learning model and a special feature representation of patterns to arrive at an adaptively tuned user interestingness model. At the same time, we propose an adaptive time-allocation strategy to distribute computation time among a set of underlying mining algorithms. We describe the technical details of our approach, present the user interface for gathering implicit feedback, and provide preliminary evaluation results.
众所周知,从数据中发现的生产性模式必须尽可能直接地与用户交互。最先进的工具箱需要对复杂的工作流进行规范,并明确选择数据挖掘方法、其所需的所有参数和相应的算法。这阻碍了期望的快速交互——特别是与数据领域的专家而不是数据挖掘专家的用户之间的交互。在本文中,我们提出了一种全新的用户参与方法,该方法完全依赖于用户自然分析行为提供的隐式反馈,同时允许用户同时使用多种模式类和发现算法,甚至不知道每种算法的细节。为了实现这一目标,我们依靠最近提出的协同学习模型和模式的特殊特征表示来达到自适应调整的用户兴趣模型。同时,我们提出了一种自适应时间分配策略,将计算时间分配到一组底层挖掘算法中。我们描述了我们的方法的技术细节,展示了收集隐式反馈的用户界面,并提供了初步的评估结果。
{"title":"One click mining: interactive local pattern discovery through implicit preference and performance learning","authors":"Mario Boley, M. Mampaey, Bo Kang, P. Tokmakov, S. Wrobel","doi":"10.1145/2501511.2501517","DOIUrl":"https://doi.org/10.1145/2501511.2501517","url":null,"abstract":"It is known that productive pattern discovery from data has to interactively involve the user as directly as possible. State-of-the-art toolboxes require the specification of sophisticated workflows with an explicit selection of a data mining method, all its required parameters, and a corresponding algorithm. This hinders the desired rapid interaction---especially with users that are experts of the data domain rather than data mining experts. In this paper, we present a fundamentally new approach towards user involvement that relies exclusively on the implicit feedback available from the natural analysis behavior of the user, and at the same time allows the user to work with a multitude of pattern classes and discovery algorithms simultaneously without even knowing the details of each algorithm. To achieve this goal, we are relying on a recently proposed co-active learning model and a special feature representation of patterns to arrive at an adaptively tuned user interestingness model. At the same time, we propose an adaptive time-allocation strategy to distribute computation time among a set of underlying mining algorithms. We describe the technical details of our approach, present the user interface for gathering implicit feedback, and provide preliminary evaluation results.","PeriodicalId":126062,"journal":{"name":"Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129539432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 67
Randomly sampling maximal itemsets 随机抽样最大项目集
Sandy Moens, Bart Goethals
Pattern mining techniques generally enumerate lots of uninteresting and redundant patterns. To obtain less redundant collections, techniques exist that give condensed representations of these collections. However, the proposed techniques often rely on complete enumeration of the pattern space, which can be prohibitive in terms of time and memory. Sampling can be used to filter the output space of patterns without explicit enumeration. We propose a framework for random sampling of maximal itemsets from transactional databases. The presented framework can use any monotonically decreasing measure as interestingness criteria for this purpose. Moreover, we use an approximation measure to guide the search for maximal sets to different parts of the output space. We show in our experiments that the method can rapidly generate small collections of patterns with good quality. The sampling framework has been implemented in the interactive visual data mining tool called MIME1, as such enabling users to quickly sample a collection of patterns and analyze the results.
模式挖掘技术通常会列举出大量无趣和冗余的模式。为了获得较少冗余的集合,存在提供这些集合的浓缩表示的技术。然而,建议的技术通常依赖于模式空间的完整枚举,这在时间和内存方面可能是令人望而却步的。采样可以用来过滤模式的输出空间,而不需要显式枚举。我们提出了一个从事务性数据库中随机抽取最大项集的框架。提出的框架可以使用任何单调递减的度量作为兴趣度标准。此外,我们使用近似度量来指导搜索输出空间的不同部分的最大集合。实验表明,该方法可以快速生成质量良好的小块图案集合。采样框架已经在交互式可视化数据挖掘工具MIME1中实现,这样用户就可以快速采样一组模式并分析结果。
{"title":"Randomly sampling maximal itemsets","authors":"Sandy Moens, Bart Goethals","doi":"10.1145/2501511.2501523","DOIUrl":"https://doi.org/10.1145/2501511.2501523","url":null,"abstract":"Pattern mining techniques generally enumerate lots of uninteresting and redundant patterns. To obtain less redundant collections, techniques exist that give condensed representations of these collections. However, the proposed techniques often rely on complete enumeration of the pattern space, which can be prohibitive in terms of time and memory. Sampling can be used to filter the output space of patterns without explicit enumeration. We propose a framework for random sampling of maximal itemsets from transactional databases. The presented framework can use any monotonically decreasing measure as interestingness criteria for this purpose. Moreover, we use an approximation measure to guide the search for maximal sets to different parts of the output space. We show in our experiments that the method can rapidly generate small collections of patterns with good quality. The sampling framework has been implemented in the interactive visual data mining tool called MIME1, as such enabling users to quickly sample a collection of patterns and analyze the results.","PeriodicalId":126062,"journal":{"name":"Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129673114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Methods for exploring and mining tables on Wikipedia 在维基百科上探索和挖掘表格的方法
Chandra Bhagavatula, Thanapon Noraset, Doug Downey
Knowledge bases extracted automatically from the Web present new opportunities for data mining and exploration. Given a large, heterogeneous set of extracted relations, new tools are needed for searching the knowledge and uncovering relationships of interest. We present WikiTables, a Web application that enables users to interactively explore tabular knowledge extracted from Wikipedia. In experiments, we show that WikiTables substantially outperforms baselines on the novel task of automatically joining together disparate tables to uncover "interesting" relationships between table columns. We find that a "Semantic Relatedness" measure that leverages the Wikipedia link structure accounts for a majority of this improvement. Further, on the task of keyword search for tables, we show that WikiTables performs comparably to Google Fusion Tables despite using an order of magnitude fewer tables. Our work also includes the release of a number of public resources, including over 15 million tuples of extracted tabular data, manually annotated evaluation sets, and public APIs.
从Web中自动提取的知识库为数据挖掘和探索提供了新的机会。给定一个庞大的、异构的抽取关系集,需要新的工具来搜索知识和发现感兴趣的关系。我们介绍WikiTables,这是一个Web应用程序,使用户能够交互式地探索从Wikipedia中提取的表格知识。在实验中,我们发现WikiTables在自动连接不同的表以发现表列之间的“有趣”关系的新任务上大大优于基线。我们发现,利用维基百科链接结构的“语义相关性”度量是这种改进的主要原因。此外,在表的关键字搜索任务上,我们表明wikittables的性能与Google Fusion tables相当,尽管使用的表少了一个数量级。我们的工作还包括发布大量公共资源,包括超过1500万个提取的表格数据元组、手动注释的评估集和公共api。
{"title":"Methods for exploring and mining tables on Wikipedia","authors":"Chandra Bhagavatula, Thanapon Noraset, Doug Downey","doi":"10.1145/2501511.2501516","DOIUrl":"https://doi.org/10.1145/2501511.2501516","url":null,"abstract":"Knowledge bases extracted automatically from the Web present new opportunities for data mining and exploration. Given a large, heterogeneous set of extracted relations, new tools are needed for searching the knowledge and uncovering relationships of interest. We present WikiTables, a Web application that enables users to interactively explore tabular knowledge extracted from Wikipedia. In experiments, we show that WikiTables substantially outperforms baselines on the novel task of automatically joining together disparate tables to uncover \"interesting\" relationships between table columns. We find that a \"Semantic Relatedness\" measure that leverages the Wikipedia link structure accounts for a majority of this improvement. Further, on the task of keyword search for tables, we show that WikiTables performs comparably to Google Fusion Tables despite using an order of magnitude fewer tables. Our work also includes the release of a number of public resources, including over 15 million tuples of extracted tabular data, manually annotated evaluation sets, and public APIs.","PeriodicalId":126062,"journal":{"name":"Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123653685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 92
期刊
Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1