首页 > 最新文献

HILDA '16最新文献

英文 中文
Towards reliable interactive data cleaning: a user survey and recommendations 迈向可靠的交互式数据清理:用户调查和建议
Pub Date : 2016-06-26 DOI: 10.1145/2939502.2939511
S. Krishnan, D. Haas, M. Franklin, Eugene Wu
Data cleaning is frequently an iterative process tailored to the requirements of a specific analysis task. The design and implementation of iterative data cleaning tools presents novel challenges, both technical and organizational, to the community. In this paper, we present results from a user survey (N = 29) of data analysts and infrastructure engineers from industry and academia. We highlight three important themes: (1) the iterative nature of data cleaning, (2) the lack of rigor in evaluating the correctness of data cleaning, and (3) the disconnect between the analysts who query the data and the infrastructure engineers who design the cleaning pipelines. We conclude by presenting a number of recommendations for future work in which we envision an interactive data cleaning system that accounts for the observed challenges.
数据清理通常是一个针对特定分析任务的需求量身定制的迭代过程。迭代数据清理工具的设计和实现对社区提出了技术和组织方面的新挑战。在本文中,我们展示了来自工业界和学术界的数据分析师和基础设施工程师的用户调查(N = 29)的结果。我们强调了三个重要的主题:(1)数据清理的迭代性质,(2)评估数据清理正确性时缺乏严谨性,以及(3)查询数据的分析师与设计清理管道的基础设施工程师之间的脱节。最后,我们为未来的工作提出了一些建议,在这些建议中,我们设想了一个能够解决所观察到的挑战的交互式数据清理系统。
{"title":"Towards reliable interactive data cleaning: a user survey and recommendations","authors":"S. Krishnan, D. Haas, M. Franklin, Eugene Wu","doi":"10.1145/2939502.2939511","DOIUrl":"https://doi.org/10.1145/2939502.2939511","url":null,"abstract":"Data cleaning is frequently an iterative process tailored to the requirements of a specific analysis task. The design and implementation of iterative data cleaning tools presents novel challenges, both technical and organizational, to the community. In this paper, we present results from a user survey (N = 29) of data analysts and infrastructure engineers from industry and academia. We highlight three important themes: (1) the iterative nature of data cleaning, (2) the lack of rigor in evaluating the correctness of data cleaning, and (3) the disconnect between the analysts who query the data and the infrastructure engineers who design the cleaning pipelines. We conclude by presenting a number of recommendations for future work in which we envision an interactive data cleaning system that accounts for the observed challenges.","PeriodicalId":356971,"journal":{"name":"HILDA '16","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122426842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 51
Towards a general-purpose query language for visualization recommendation 面向可视化推荐的通用查询语言
Pub Date : 2016-06-26 DOI: 10.1145/2939502.2939506
Kanit Wongsuphasawat, Dominik Moritz, Anushka Anand, J. Mackinlay, Bill Howe, Jeffrey Heer
Creating effective visualizations requires domain familiarity as well as design and analysis expertise, and may impose a tedious specification process. To address these difficulties, many visualization tools complement manual specification with recommendations. However, designing interfaces, ranking metrics, and scalable recommender systems remain important research challenges. In this paper, we propose a common framework for facilitating the development of visualization recommender systems in the form of a specification language for querying over the space of visualizations. We present the preliminary design of CompassQL, which defines (1) a partial specification that describes enumeration constraints, and (2) methods for choosing, ranking, and grouping recommended visualizations. To demonstrate the expressivity of the language, we describe existing recommender systems in terms of CompassQL queries. Finally, we discuss the prospective benefits of a common language for future visualization recommender systems.
创建有效的可视化需要熟悉领域以及设计和分析专业知识,并且可能会强加一个乏味的规范过程。为了解决这些困难,许多可视化工具用建议来补充手工规范。然而,设计界面、排名指标和可扩展的推荐系统仍然是重要的研究挑战。在本文中,我们提出了一个通用框架,以一种规范语言的形式促进可视化推荐系统的开发,用于在可视化空间上进行查询。我们介绍了CompassQL的初步设计,它定义了(1)描述枚举约束的部分规范,以及(2)选择、排序和分组推荐可视化的方法。为了展示语言的表现力,我们用CompassQL查询描述现有的推荐系统。最后,我们讨论了一种通用语言对未来可视化推荐系统的潜在好处。
{"title":"Towards a general-purpose query language for visualization recommendation","authors":"Kanit Wongsuphasawat, Dominik Moritz, Anushka Anand, J. Mackinlay, Bill Howe, Jeffrey Heer","doi":"10.1145/2939502.2939506","DOIUrl":"https://doi.org/10.1145/2939502.2939506","url":null,"abstract":"Creating effective visualizations requires domain familiarity as well as design and analysis expertise, and may impose a tedious specification process. To address these difficulties, many visualization tools complement manual specification with recommendations. However, designing interfaces, ranking metrics, and scalable recommender systems remain important research challenges. In this paper, we propose a common framework for facilitating the development of visualization recommender systems in the form of a specification language for querying over the space of visualizations. We present the preliminary design of CompassQL, which defines (1) a partial specification that describes enumeration constraints, and (2) methods for choosing, ranking, and grouping recommended visualizations. To demonstrate the expressivity of the language, we describe existing recommender systems in terms of CompassQL queries. Finally, we discuss the prospective benefits of a common language for future visualization recommender systems.","PeriodicalId":356971,"journal":{"name":"HILDA '16","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133381343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 96
VisTrees: fast indexes for interactive data exploration VisTrees:用于交互式数据探索的快速索引
Pub Date : 2016-06-26 DOI: 10.1145/2939502.2939507
Muhammad El-Hindi, Zheguang Zhao, Carsten Binnig, Tim Kraska
Visualizations are arguably the most important tool to explore, understand and convey facts about data. As part of interactive data exploration, visualizations might be used to quickly skim through the data and look for patterns. Unfortunately, database systems are not designed to efficiently support these workloads. As a result, visualizations often take very long to produce, creating a significant barrier to interactive data analysis. In this paper, we focus on the interactive computation of histograms for data exploration. To address this issue, we present a novel multi-dimensional index structure called VisTree. As a key contribution, this paper presents several techniques to better align the design of multi-dimensional indexes with the needs of visualization tools for data exploration. Our experiments show that the VisTree achieves a speed increase of up to three orders of magnitude compared to traditional multi-dimensional indexes and enables an interactive speed of below 500ms even on large data sets.
可视化可以说是探索、理解和传达数据事实的最重要工具。作为交互式数据探索的一部分,可视化可以用于快速浏览数据并查找模式。不幸的是,数据库系统的设计不能有效地支持这些工作负载。因此,可视化通常需要很长时间才能生成,这对交互式数据分析造成了重大障碍。在本文中,我们重点研究了用于数据探索的直方图的交互计算。为了解决这个问题,我们提出了一种新的多维索引结构,称为VisTree。作为一项重要贡献,本文提出了几种技术,以更好地将多维索引的设计与数据探索可视化工具的需求结合起来。我们的实验表明,与传统的多维索引相比,VisTree实现了高达三个数量级的速度提升,并且即使在大型数据集上也能实现低于500ms的交互速度。
{"title":"VisTrees: fast indexes for interactive data exploration","authors":"Muhammad El-Hindi, Zheguang Zhao, Carsten Binnig, Tim Kraska","doi":"10.1145/2939502.2939507","DOIUrl":"https://doi.org/10.1145/2939502.2939507","url":null,"abstract":"Visualizations are arguably the most important tool to explore, understand and convey facts about data. As part of interactive data exploration, visualizations might be used to quickly skim through the data and look for patterns. Unfortunately, database systems are not designed to efficiently support these workloads. As a result, visualizations often take very long to produce, creating a significant barrier to interactive data analysis.\u0000 In this paper, we focus on the interactive computation of histograms for data exploration. To address this issue, we present a novel multi-dimensional index structure called VisTree. As a key contribution, this paper presents several techniques to better align the design of multi-dimensional indexes with the needs of visualization tools for data exploration. Our experiments show that the VisTree achieves a speed increase of up to three orders of magnitude compared to traditional multi-dimensional indexes and enables an interactive speed of below 500ms even on large data sets.","PeriodicalId":356971,"journal":{"name":"HILDA '16","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114578101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Have a chat with clustine, conversational engine to query large tables 有一个聊天与群集,会话引擎查询大表
Pub Date : 2016-06-26 DOI: 10.1145/2939502.2939504
Thibault Sellam, M. Kersten
Thanks the recent advances of AI and the stellar popularity of messaging apps (e.g., WhatsApp), chatbots are no longer bound to customer support services and computer museums. Indeed, they provide a mighty, lightweight and accessible way to provide services over the Internet. In this paper, we introduce Clustine, a chatbot to help users query large tables through short messages. The main idea is to combine cluster analysis and text generation to compress query results, describe them with natural language and make recommendations. We present the architecture of our system, demonstrate it with two use cases, and present early validation experiments with 12 real datasets to show that its promises are reachable.
由于最近人工智能的进步和即时通讯应用(如WhatsApp)的流行,聊天机器人不再局限于客户支持服务和计算机博物馆。实际上,它们提供了一种强大的、轻量级的和可访问的方式来通过Internet提供服务。在本文中,我们介绍了一个聊天机器人Clustine,它可以帮助用户通过短消息查询大型表。其主要思想是将聚类分析和文本生成相结合,对查询结果进行压缩,用自然语言进行描述,并提出建议。我们展示了我们系统的架构,用两个用例进行了演示,并用12个真实数据集进行了早期验证实验,以表明它的承诺是可以实现的。
{"title":"Have a chat with clustine, conversational engine to query large tables","authors":"Thibault Sellam, M. Kersten","doi":"10.1145/2939502.2939504","DOIUrl":"https://doi.org/10.1145/2939502.2939504","url":null,"abstract":"Thanks the recent advances of AI and the stellar popularity of messaging apps (e.g., WhatsApp), chatbots are no longer bound to customer support services and computer museums. Indeed, they provide a mighty, lightweight and accessible way to provide services over the Internet. In this paper, we introduce Clustine, a chatbot to help users query large tables through short messages. The main idea is to combine cluster analysis and text generation to compress query results, describe them with natural language and make recommendations. We present the architecture of our system, demonstrate it with two use cases, and present early validation experiments with 12 real datasets to show that its promises are reachable.","PeriodicalId":356971,"journal":{"name":"HILDA '16","volume":"66 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128020980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
ModelDB: a system for machine learning model management ModelDB:一个机器学习模型管理系统
Pub Date : 2016-06-26 DOI: 10.1145/2939502.2939516
Manasi Vartak, H. Subramanyam, Wei-En Lee, S. Viswanathan, S. Husnoo, S. Madden, M. Zaharia
Building a machine learning model is an iterative process. A data scientist will build many tens to hundreds of models before arriving at one that meets some acceptance criteria (e.g. AUC cutoff, accuracy threshold). However, the current style of model building is ad-hoc and there is no practical way for a data scientist to manage models that are built over time. As a result, the data scientist must attempt to "remember" previously constructed models and insights obtained from them. This task is challenging for more than a handful of models and can hamper the process of sensemaking. Without a means to manage models, there is no easy way for a data scientist to answer questions such as "Which models were built using an incorrect feature?", "Which model performed best on American customers?" or "How did the two top models compare?" In this paper, we describe our ongoing work on ModelDB, a novel end-to-end system for the management of machine learning models. ModelDB clients automatically track machine learning models in their native environments (e.g. scikit-learn, spark.ml), the ModelDB backend introduces a common layer of abstractions to represent models and pipelines, and the ModelDB frontend allows visual exploration and analyses of models via a web-based interface.
构建机器学习模型是一个迭代的过程。数据科学家将构建数十到数百个模型,然后才能达到一些可接受的标准(例如AUC截止值、精度阈值)。然而,当前的模型构建风格是临时的,数据科学家没有实用的方法来管理随着时间的推移而构建的模型。因此,数据科学家必须尝试“记住”先前构建的模型和从中获得的见解。这项任务对很多模型来说都是具有挑战性的,并且会阻碍意义生成的过程。如果没有管理模型的方法,数据科学家就无法轻松地回答诸如“哪些模型是使用错误的特征构建的?”、“哪个模型在美国客户中表现最好?”或“两个顶级模型相比如何?”在本文中,我们描述了我们正在进行的关于ModelDB的工作,ModelDB是一种用于管理机器学习模型的新型端到端系统。ModelDB客户端在其原生环境中自动跟踪机器学习模型(例如scikit-learn, spark.ml), ModelDB后端引入了一个通用的抽象层来表示模型和管道,ModelDB前端允许通过基于web的界面对模型进行可视化探索和分析。
{"title":"ModelDB: a system for machine learning model management","authors":"Manasi Vartak, H. Subramanyam, Wei-En Lee, S. Viswanathan, S. Husnoo, S. Madden, M. Zaharia","doi":"10.1145/2939502.2939516","DOIUrl":"https://doi.org/10.1145/2939502.2939516","url":null,"abstract":"Building a machine learning model is an iterative process. A data scientist will build many tens to hundreds of models before arriving at one that meets some acceptance criteria (e.g. AUC cutoff, accuracy threshold). However, the current style of model building is ad-hoc and there is no practical way for a data scientist to manage models that are built over time. As a result, the data scientist must attempt to \"remember\" previously constructed models and insights obtained from them. This task is challenging for more than a handful of models and can hamper the process of sensemaking. Without a means to manage models, there is no easy way for a data scientist to answer questions such as \"Which models were built using an incorrect feature?\", \"Which model performed best on American customers?\" or \"How did the two top models compare?\" In this paper, we describe our ongoing work on ModelDB, a novel end-to-end system for the management of machine learning models. ModelDB clients automatically track machine learning models in their native environments (e.g. scikit-learn, spark.ml), the ModelDB backend introduces a common layer of abstractions to represent models and pipelines, and the ModelDB frontend allows visual exploration and analyses of models via a web-based interface.","PeriodicalId":356971,"journal":{"name":"HILDA '16","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124900571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 197
Data programming with DDLite: putting humans in a different part of the loop 使用DDLite进行数据编程:将人置于循环的不同部分
Pub Date : 2016-06-26 DOI: 10.1145/2939502.2939515
Henry R. Ehrenberg, Jaeho Shin, Alexander J. Ratner, Jason Alan Fries, C. Ré
Populating large-scale structured databases from unstructured sources is a critical and challenging task in data analytics. As automated feature engineering methods grow increasingly prevalent, constructing sufficiently large labeled training sets has become the primary hurdle in building machine learning information extraction systems. In light of this, we have taken a new approach called data programming [7]. Rather than hand-labeling data, in the data programming paradigm, users generate large amounts of noisy training labels by programmatically encoding domain heuristics as simple rules. Using this approach over more traditional distant supervision methods and fully supervised approaches using labeled data, we have been able to construct knowledge base systems more rapidly and with higher quality. Since the ability to quickly prototype, evaluate, and debug these rules is a key component of this paradigm, we introduce DDLite, an interactive development framework for data programming. This paper reports feedback collected from DDLite users across a diverse set of entity extraction tasks. We share observations from several DDLite hackathons in which 10 biomedical researchers prototyped information extraction pipelines for chemicals, diseases, and anatomical named entities. Initial results were promising, with the disease tagging team obtaining an F1 score within 10 points of the state-of-the-art in only a single day-long hackathon's work. Our key insights concern the challenges of writing diverse rule sets for generating labels, and exploring training data. These findings motivate several areas of active data programming research.
在数据分析中,从非结构化源填充大规模结构化数据库是一项关键且具有挑战性的任务。随着自动化特征工程方法的日益普及,构建足够大的标记训练集已成为构建机器学习信息提取系统的主要障碍。鉴于此,我们采用了一种称为数据编程的新方法[7]。在数据编程范式中,用户通过编程方式将域启发式编码为简单规则,从而生成大量嘈杂的训练标签,而不是手工标记数据。与传统的远程监督方法和使用标记数据的完全监督方法相比,使用这种方法可以更快、更高质量地构建知识库系统。由于快速构建原型、评估和调试这些规则的能力是该范式的关键组成部分,因此我们引入了DDLite,这是一种用于数据编程的交互式开发框架。本文报告了从不同实体提取任务集的DDLite用户收集的反馈。我们分享了几次DDLite黑客马拉松的观察结果,在这些黑客马拉松中,10名生物医学研究人员为化学品、疾病和解剖命名实体设计了信息提取管道的原型。最初的结果是有希望的,疾病标签团队在一天的黑客马拉松工作中获得了F1分数,与最先进的技术相差不到10分。我们的主要见解涉及编写用于生成标签的不同规则集和探索训练数据的挑战。这些发现激发了几个活跃的数据编程研究领域。
{"title":"Data programming with DDLite: putting humans in a different part of the loop","authors":"Henry R. Ehrenberg, Jaeho Shin, Alexander J. Ratner, Jason Alan Fries, C. Ré","doi":"10.1145/2939502.2939515","DOIUrl":"https://doi.org/10.1145/2939502.2939515","url":null,"abstract":"Populating large-scale structured databases from unstructured sources is a critical and challenging task in data analytics. As automated feature engineering methods grow increasingly prevalent, constructing sufficiently large labeled training sets has become the primary hurdle in building machine learning information extraction systems. In light of this, we have taken a new approach called data programming [7]. Rather than hand-labeling data, in the data programming paradigm, users generate large amounts of noisy training labels by programmatically encoding domain heuristics as simple rules. Using this approach over more traditional distant supervision methods and fully supervised approaches using labeled data, we have been able to construct knowledge base systems more rapidly and with higher quality. Since the ability to quickly prototype, evaluate, and debug these rules is a key component of this paradigm, we introduce DDLite, an interactive development framework for data programming. This paper reports feedback collected from DDLite users across a diverse set of entity extraction tasks. We share observations from several DDLite hackathons in which 10 biomedical researchers prototyped information extraction pipelines for chemicals, diseases, and anatomical named entities. Initial results were promising, with the disease tagging team obtaining an F1 score within 10 points of the state-of-the-art in only a single day-long hackathon's work. Our key insights concern the challenges of writing diverse rule sets for generating labels, and exploring training data. These findings motivate several areas of active data programming research.","PeriodicalId":356971,"journal":{"name":"HILDA '16","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125577021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Visual exploration of machine learning results using data cube analysis 使用数据立方体分析对机器学习结果进行可视化探索
Pub Date : 2016-06-26 DOI: 10.1145/2939502.2939503
Minsuk Kahng, Dezhi Fang, Duen Horng Chau
As complex machine learning systems become more widely adopted, it becomes increasingly challenging for users to understand models or interpret the results generated from the models. We present our ongoing work on developing interactive and visual approaches for exploring and understanding machine learning results using data cube analysis. We propose MLCube, a data cube inspired framework that enables users to define instance subsets using feature conditions and computes aggregate statistics and evaluation metrics over the subsets. We also design MLCube Explorer, an interactive visualization tool for comparing models' performances over the subsets. Users can interactively specify operations, such as drilling down to specific instance subsets, to perform more in-depth exploration. Through a usage scenario, we demonstrate how MLCube Explorer works with a public advertisement click log data set, to help a user build new advertisement click prediction models that advance over an existing model.
随着复杂的机器学习系统被越来越广泛地采用,用户理解模型或解释模型生成的结果变得越来越具有挑战性。我们介绍了我们正在进行的工作,开发交互式和可视化的方法来探索和理解使用数据立方体分析的机器学习结果。我们提出了MLCube,这是一个受数据立方体启发的框架,它使用户能够使用特征条件定义实例子集,并计算子集上的汇总统计信息和评估指标。我们还设计了MLCube Explorer,这是一个用于在子集上比较模型性能的交互式可视化工具。用户可以交互式地指定操作,例如向下钻取到特定的实例子集,以执行更深入的探索。通过一个使用场景,我们演示了MLCube Explorer如何处理公共广告点击日志数据集,以帮助用户构建新的广告点击预测模型,这些模型可以在现有模型之上进行改进。
{"title":"Visual exploration of machine learning results using data cube analysis","authors":"Minsuk Kahng, Dezhi Fang, Duen Horng Chau","doi":"10.1145/2939502.2939503","DOIUrl":"https://doi.org/10.1145/2939502.2939503","url":null,"abstract":"As complex machine learning systems become more widely adopted, it becomes increasingly challenging for users to understand models or interpret the results generated from the models. We present our ongoing work on developing interactive and visual approaches for exploring and understanding machine learning results using data cube analysis. We propose MLCube, a data cube inspired framework that enables users to define instance subsets using feature conditions and computes aggregate statistics and evaluation metrics over the subsets. We also design MLCube Explorer, an interactive visualization tool for comparing models' performances over the subsets. Users can interactively specify operations, such as drilling down to specific instance subsets, to perform more in-depth exploration. Through a usage scenario, we demonstrate how MLCube Explorer works with a public advertisement click log data set, to help a user build new advertisement click prediction models that advance over an existing model.","PeriodicalId":356971,"journal":{"name":"HILDA '16","volume":"71 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131998743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 69
Interactive online learning for clinical entity recognition 临床实体识别的交互式在线学习
Pub Date : 2016-06-26 DOI: 10.1145/2939502.2939510
L. Tari, Varish Mulwad, Anna von Reden
Named entity recognition and entity linking are core natural language processing components that are predominantly solved by supervised machine learning approaches. Such supervised machine learning approaches require manual annotation of training data that can be expensive to compile. The applicability of supervised, machine learning-based entity recognition and linking components in real-world applications can be hindered by the limited availability of training data. In this paper, we propose a novel approach that uses ontologies as a basis for entity recognition and linking, and captures context of neighboring tokens of the entities of interest with vectors based on syntactic and semantic features. Our approach takes user feedback so that the vector-based model can be continuously updated in an online setting. Here we demonstrate our approach in a healthcare context, using it to recognize body part and imaging modality entities within clinical documents, and map these entities to the right concepts in the RadLex and NCIT medical ontologies. Our current evaluation shows promising results on a small set of clinical documents with a precision and recall of 0.841 and 0.966. The evaluation also demonstrates that our approach is capable of continuous performance improvement with increasing size of examples. We believe that our human-in-the-loop, online learning approach to entity recognition and linking shows promise that it is suitable for real-world applications.
命名实体识别和实体链接是自然语言处理的核心组件,主要由监督机器学习方法解决。这种有监督的机器学习方法需要对训练数据进行手动注释,编译成本可能很高。有监督的、基于机器学习的实体识别和链接组件在现实应用中的适用性可能会受到训练数据有限可用性的阻碍。在本文中,我们提出了一种新的方法,该方法使用本体作为实体识别和链接的基础,并基于语法和语义特征的向量捕获感兴趣实体的相邻标记的上下文。我们的方法采用用户反馈,因此基于向量的模型可以在在线设置中不断更新。在这里,我们将在医疗保健上下文中演示我们的方法,使用它来识别临床文档中的身体部位和成像模式实体,并将这些实体映射到RadLex和NCIT医学本体中的正确概念。我们目前的评估在一小部分临床文献上显示出有希望的结果,准确率和召回率分别为0.841和0.966。评估还表明,我们的方法能够随着样本规模的增加而持续提高性能。我们相信,我们的人在循环,在线学习方法的实体识别和链接显示出它适用于现实世界的应用前景。
{"title":"Interactive online learning for clinical entity recognition","authors":"L. Tari, Varish Mulwad, Anna von Reden","doi":"10.1145/2939502.2939510","DOIUrl":"https://doi.org/10.1145/2939502.2939510","url":null,"abstract":"Named entity recognition and entity linking are core natural language processing components that are predominantly solved by supervised machine learning approaches. Such supervised machine learning approaches require manual annotation of training data that can be expensive to compile. The applicability of supervised, machine learning-based entity recognition and linking components in real-world applications can be hindered by the limited availability of training data. In this paper, we propose a novel approach that uses ontologies as a basis for entity recognition and linking, and captures context of neighboring tokens of the entities of interest with vectors based on syntactic and semantic features. Our approach takes user feedback so that the vector-based model can be continuously updated in an online setting. Here we demonstrate our approach in a healthcare context, using it to recognize body part and imaging modality entities within clinical documents, and map these entities to the right concepts in the RadLex and NCIT medical ontologies. Our current evaluation shows promising results on a small set of clinical documents with a precision and recall of 0.841 and 0.966. The evaluation also demonstrates that our approach is capable of continuous performance improvement with increasing size of examples. We believe that our human-in-the-loop, online learning approach to entity recognition and linking shows promise that it is suitable for real-world applications.","PeriodicalId":356971,"journal":{"name":"HILDA '16","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129500945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The case for interactive data exploration accelerators (IDEAs) 交互式数据探索加速器的案例(IDEAs)
Pub Date : 2016-06-26 DOI: 10.1145/2939502.2939513
Andrew Crotty, Alex Galakatos, Emanuel Zgraggen, Carsten Binnig, Tim Kraska
Enabling interactive visualization over new datasets at "human speed" is key to democratizing data science and maximizing human productivity. In this work, we first argue why existing analytics infrastructures do not support interactive data exploration and then outline the challenges and opportunities of building a system specifically designed for interactive data exploration. Finally, we present an Interactive Data Exploration Accelerator (IDEA), a new type of system for interactive data exploration that is specifically designed to integrate with existing data management landscapes and allow users to explore their data instantly without expensive data preparation costs.
以“人类速度”在新数据集上实现交互式可视化是实现数据科学民主化和人类生产力最大化的关键。在这项工作中,我们首先讨论了为什么现有的分析基础设施不支持交互式数据探索,然后概述了构建专门为交互式数据探索设计的系统的挑战和机遇。最后,我们提出了一个交互式数据探索加速器(IDEA),这是一种用于交互式数据探索的新型系统,专门设计用于与现有数据管理景观集成,并允许用户即时探索他们的数据,而无需昂贵的数据准备成本。
{"title":"The case for interactive data exploration accelerators (IDEAs)","authors":"Andrew Crotty, Alex Galakatos, Emanuel Zgraggen, Carsten Binnig, Tim Kraska","doi":"10.1145/2939502.2939513","DOIUrl":"https://doi.org/10.1145/2939502.2939513","url":null,"abstract":"Enabling interactive visualization over new datasets at \"human speed\" is key to democratizing data science and maximizing human productivity. In this work, we first argue why existing analytics infrastructures do not support interactive data exploration and then outline the challenges and opportunities of building a system specifically designed for interactive data exploration. Finally, we present an Interactive Data Exploration Accelerator (IDEA), a new type of system for interactive data exploration that is specifically designed to integrate with existing data management landscapes and allow users to explore their data instantly without expensive data preparation costs.","PeriodicalId":356971,"journal":{"name":"HILDA '16","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132379144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 43
PFunk-H: approximate query processing using perceptual models PFunk-H:使用感知模型的近似查询处理
Pub Date : 2016-06-26 DOI: 10.1145/2939502.2939512
Daniel Alabi, Eugene Wu
Interactive visualization tools (e.g., crossfilter) are critical to many data analysts by making the discovery and verification of hypotheses quick and seamless. Increasing data sizes has made the scalability of these tools a necessity. To bridge the gap between data sizes and interactivity, many visualization systems have turned to sampling-based approximate query processing frameworks. However, these systems are currently oblivious to human perceptual visual accuracy. This could either lead to overly aggressive sampling when the approximation accuracy is higher than needed or an incorrect visual rendering when the accuracy is too lax. Thus, for both correctness and efficiency, we propose to use empirical knowledge of human perceptual limitations to automatically bound the error of approximate answers meant for visualization. This paper explores a preliminary model of sampling-based approximate query processing that uses perceptual models (encoded as functions) to construct approximate answers intended for visualization. We present initial results that show that the approximate and non-approximate answers for a given query differ by a perceptually indiscernible amount, as defined by perceptual functions.
交互式可视化工具(例如交叉过滤器)对于许多数据分析师来说至关重要,因为它可以快速无缝地发现和验证假设。不断增长的数据量使得这些工具的可伸缩性成为必要。为了弥合数据大小和交互性之间的差距,许多可视化系统转向基于抽样的近似查询处理框架。然而,这些系统目前忽略了人类感知视觉的准确性。当近似精度高于所需时,这可能会导致过度激进的采样,或者当精度过于松散时导致不正确的视觉渲染。因此,为了正确性和效率,我们建议使用人类感知局限性的经验知识来自动约束用于可视化的近似答案的误差。本文探讨了一个基于抽样的近似查询处理的初步模型,该模型使用感知模型(编码为函数)来构建用于可视化的近似答案。我们提出的初步结果表明,给定查询的近似和非近似答案的差异在感知上是不可分辨的,由感知函数定义。
{"title":"PFunk-H: approximate query processing using perceptual models","authors":"Daniel Alabi, Eugene Wu","doi":"10.1145/2939502.2939512","DOIUrl":"https://doi.org/10.1145/2939502.2939512","url":null,"abstract":"Interactive visualization tools (e.g., crossfilter) are critical to many data analysts by making the discovery and verification of hypotheses quick and seamless. Increasing data sizes has made the scalability of these tools a necessity. To bridge the gap between data sizes and interactivity, many visualization systems have turned to sampling-based approximate query processing frameworks. However, these systems are currently oblivious to human perceptual visual accuracy. This could either lead to overly aggressive sampling when the approximation accuracy is higher than needed or an incorrect visual rendering when the accuracy is too lax. Thus, for both correctness and efficiency, we propose to use empirical knowledge of human perceptual limitations to automatically bound the error of approximate answers meant for visualization.\u0000 This paper explores a preliminary model of sampling-based approximate query processing that uses perceptual models (encoded as functions) to construct approximate answers intended for visualization. We present initial results that show that the approximate and non-approximate answers for a given query differ by a perceptually indiscernible amount, as defined by perceptual functions.","PeriodicalId":356971,"journal":{"name":"HILDA '16","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121786804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
期刊
HILDA '16
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1