首页 > 最新文献

Proceedings of the Second Workshop on Data Management for End-to-End Machine Learning. Workshop on Data Management for End-to-End Machine Learning (2nd : 2018 : Houston, Tex.)最新文献

英文 中文
Modelling Machine Learning Algorithms on Relational Data with Datalog 关系型数据的机器学习算法建模
Nantia Makrynioti, N. Vasiloglou, E. Pasalic, V. Vassalos
The standard process of data science tasks is to prepare features inside a database, export them as a denormalized data frame and then apply machine learning algorithms. This process is not optimal for two reasons. First, it requires denormalization of the database that can convert a small data problem into a big data problem. The second shortcoming is that it assumes that the machine learning algorithm is disentangled from the relational model of the problem. That seems to be a serious limitation since the relational model contains very valuable domain expertise. In this paper we explore the use of convex optimization and specifically linear programming, for modelling machine learning algorithms on relational data in an integrated way with data processing operators. We are using SolverBlox, a framework that accepts as an input Datalog code and feeds it into a linear programming solver. We demonstrate the expression of common machine learning algorithms and present use case scenarios where combining data processing with modelling of optimization problems inside a database offers significant advantages.
数据科学任务的标准流程是在数据库中准备特征,将其导出为非规范化数据框架,然后应用机器学习算法。由于两个原因,这个过程不是最佳的。首先,它需要对数据库进行非规范化处理,从而将小数据问题转化为大数据问题。第二个缺点是它假设机器学习算法从问题的关系模型中解脱出来。这似乎是一个严重的限制,因为关系模型包含非常有价值的领域专业知识。在本文中,我们探索了凸优化,特别是线性规划的使用,以数据处理算子的集成方式对关系数据上的机器学习算法进行建模。我们使用的是SolverBlox,它是一个框架,可以接受输入Datalog代码,并将其输入到线性规划求解器中。我们展示了常见机器学习算法的表达,并给出了将数据处理与数据库内优化问题建模相结合的用例场景,这些场景具有显著的优势。
{"title":"Modelling Machine Learning Algorithms on Relational Data with Datalog","authors":"Nantia Makrynioti, N. Vasiloglou, E. Pasalic, V. Vassalos","doi":"10.1145/3209889.3209893","DOIUrl":"https://doi.org/10.1145/3209889.3209893","url":null,"abstract":"The standard process of data science tasks is to prepare features inside a database, export them as a denormalized data frame and then apply machine learning algorithms. This process is not optimal for two reasons. First, it requires denormalization of the database that can convert a small data problem into a big data problem. The second shortcoming is that it assumes that the machine learning algorithm is disentangled from the relational model of the problem. That seems to be a serious limitation since the relational model contains very valuable domain expertise. In this paper we explore the use of convex optimization and specifically linear programming, for modelling machine learning algorithms on relational data in an integrated way with data processing operators. We are using SolverBlox, a framework that accepts as an input Datalog code and feeds it into a linear programming solver. We demonstrate the expression of common machine learning algorithms and present use case scenarios where combining data processing with modelling of optimization problems inside a database offers significant advantages.","PeriodicalId":92710,"journal":{"name":"Proceedings of the Second Workshop on Data Management for End-to-End Machine Learning. Workshop on Data Management for End-to-End Machine Learning (2nd : 2018 : Houston, Tex.)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73254717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
End-to-End Machine Learning with Apache AsterixDB 端到端机器学习与Apache AsterixDB
Wail Y. Alkowaileet, Sattam Alsubaiee, M. Carey, Chen Li, H. Ramampiaro, Phanwadee Sinthong, Xikui Wang
Recent developments in machine learning and data science provide a foundation for extracting underlying information from Big Data. Unfortunately, current platforms and tools often require data scientists to glue together and maintain custom-built platforms consisting of multiple Big Data component technologies. In this paper, we explain how Apache AsterixDB, an open source Big Data Management System, can help to reduce the burden involved in using machine learning algorithms in Big Data analytics. In particular, we describe how AsterixDB's built-in support for user-defined functions (UDFs), the availability of UDFs in data ingestion pipelines and queries, and the provision of machine learning platform and notebook inter-operation capabilities can together enable data analysts to more easily create and manage end-to-end analytical dataflows.
机器学习和数据科学的最新发展为从大数据中提取底层信息提供了基础。不幸的是,当前的平台和工具通常需要数据科学家粘合在一起,并维护由多种大数据组件技术组成的定制平台。在本文中,我们解释了Apache AsterixDB,一个开源的大数据管理系统,如何帮助减少在大数据分析中使用机器学习算法所带来的负担。特别是,我们描述了AsterixDB对用户定义函数(udf)的内置支持,udf在数据摄取管道和查询中的可用性,以及机器学习平台和笔记本互操作功能的提供如何使数据分析师能够更轻松地创建和管理端到端分析数据流。
{"title":"End-to-End Machine Learning with Apache AsterixDB","authors":"Wail Y. Alkowaileet, Sattam Alsubaiee, M. Carey, Chen Li, H. Ramampiaro, Phanwadee Sinthong, Xikui Wang","doi":"10.1145/3209889.3209894","DOIUrl":"https://doi.org/10.1145/3209889.3209894","url":null,"abstract":"Recent developments in machine learning and data science provide a foundation for extracting underlying information from Big Data. Unfortunately, current platforms and tools often require data scientists to glue together and maintain custom-built platforms consisting of multiple Big Data component technologies. In this paper, we explain how Apache AsterixDB, an open source Big Data Management System, can help to reduce the burden involved in using machine learning algorithms in Big Data analytics. In particular, we describe how AsterixDB's built-in support for user-defined functions (UDFs), the availability of UDFs in data ingestion pipelines and queries, and the provision of machine learning platform and notebook inter-operation capabilities can together enable data analysts to more easily create and manage end-to-end analytical dataflows.","PeriodicalId":92710,"journal":{"name":"Proceedings of the Second Workshop on Data Management for End-to-End Machine Learning. Workshop on Data Management for End-to-End Machine Learning (2nd : 2018 : Houston, Tex.)","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89630659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning 第二届端到端机器学习数据管理研讨会论文集
{"title":"Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning","authors":"","doi":"10.1145/3209889","DOIUrl":"https://doi.org/10.1145/3209889","url":null,"abstract":"","PeriodicalId":92710,"journal":{"name":"Proceedings of the Second Workshop on Data Management for End-to-End Machine Learning. Workshop on Data Management for End-to-End Machine Learning (2nd : 2018 : Houston, Tex.)","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87187269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Avatar: Large Scale Entity Resolution of Heterogeneous User Profiles 头像:异构用户配置文件的大规模实体分辨率
Janani Balaji, Chris Min, F. Javed, Yun Zhu
Entity Resolution (ER), also known as record linkage or de-duplication, has been a long-standing problem in the data management space. Though an ER system follows an established pipeline involving the Blocking -> Matching -> Clustering components, the Matching forms the core element of an ER system. At CareerBuilder, we perform de-duplication of massive datasets of people profiles collected from disparate sources with varying informational content. In this paper, we discuss the challenges of de-duplicating inherently heterogeneous data and illustrate the end-to-end process of building a functional and scalable machine learning-based matching platform. We also provide an incremental framework to enable differential ER assimilation for continuous de-duplication workflows.
实体解析(ER),也称为记录链接或重复数据删除,是数据管理领域中一个长期存在的问题。虽然一个ER系统遵循一个既定的流程,包括阻塞->匹配->集群组件,但匹配构成了ER系统的核心元素。在凯业必达,我们对从不同来源、不同信息内容收集的大量个人资料集进行重复数据删除。在本文中,我们讨论了对固有异构数据进行重复数据删除的挑战,并说明了构建功能强大且可扩展的基于机器学习的匹配平台的端到端过程。我们还提供了一个增量框架,以实现连续重复数据删除工作流的差异ER同化。
{"title":"Avatar: Large Scale Entity Resolution of Heterogeneous User Profiles","authors":"Janani Balaji, Chris Min, F. Javed, Yun Zhu","doi":"10.1145/3209889.3209892","DOIUrl":"https://doi.org/10.1145/3209889.3209892","url":null,"abstract":"Entity Resolution (ER), also known as record linkage or de-duplication, has been a long-standing problem in the data management space. Though an ER system follows an established pipeline involving the Blocking -> Matching -> Clustering components, the Matching forms the core element of an ER system. At CareerBuilder, we perform de-duplication of massive datasets of people profiles collected from disparate sources with varying informational content. In this paper, we discuss the challenges of de-duplicating inherently heterogeneous data and illustrate the end-to-end process of building a functional and scalable machine learning-based matching platform. We also provide an incremental framework to enable differential ER assimilation for continuous de-duplication workflows.","PeriodicalId":92710,"journal":{"name":"Proceedings of the Second Workshop on Data Management for End-to-End Machine Learning. Workshop on Data Management for End-to-End Machine Learning (2nd : 2018 : Houston, Tex.)","volume":"46 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84109657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Towards Interactive Curation & Automatic Tuning of ML Pipelines 面向ML管道的交互式管理和自动调优
Carsten Binnig, Benedetto Buratti, Yeounoh Chung, Cyrus Cousins, Tim Kraska, Zeyuan Shang, E. Upfal, R. Zeleznik, Emanuel Zgraggen
Democratizing Data Science requires a fundamental rethinking of the way data analytics and model discovery is done. Available tools for analyzing massive data sets and curating machine learning models are limited in a number of fundamental ways. First, existing tools require well-trained data scientists to select the appropriate techniques to build models and to evaluate their outcomes. Second, existing tools require heavy data preparation steps and are often too slow to give interactive feedback to domain experts in the model building process, severely limiting the possible interactions. Third, current tools do not provide adequate analysis of statistical risk factors in the model development. In this work, we present the first iteration of QuIC-M (pronounced quick-m), an interactive human-in-the-loop data exploration and model building suite. The goal is to enable domain experts to build the machine learning pipelines an order of magnitude faster than machine learning experts while having model qualities comparable to expert solutions.
数据科学的民主化需要从根本上重新思考数据分析和模型发现的方式。用于分析大量数据集和管理机器学习模型的可用工具在许多基本方面受到限制。首先,现有的工具需要训练有素的数据科学家选择合适的技术来构建模型并评估其结果。其次,现有的工具需要大量的数据准备步骤,并且在模型构建过程中通常太慢,无法向领域专家提供交互式反馈,严重限制了可能的交互。第三,目前的工具没有提供模型开发中统计风险因素的充分分析。在这项工作中,我们提出了QuIC-M(发音为quick-m)的第一次迭代,这是一个交互式的人在循环中的数据探索和模型构建套件。目标是使领域专家能够以比机器学习专家快一个数量级的速度构建机器学习管道,同时具有与专家解决方案相当的模型质量。
{"title":"Towards Interactive Curation & Automatic Tuning of ML Pipelines","authors":"Carsten Binnig, Benedetto Buratti, Yeounoh Chung, Cyrus Cousins, Tim Kraska, Zeyuan Shang, E. Upfal, R. Zeleznik, Emanuel Zgraggen","doi":"10.1145/3209889.3209891","DOIUrl":"https://doi.org/10.1145/3209889.3209891","url":null,"abstract":"Democratizing Data Science requires a fundamental rethinking of the way data analytics and model discovery is done. Available tools for analyzing massive data sets and curating machine learning models are limited in a number of fundamental ways. First, existing tools require well-trained data scientists to select the appropriate techniques to build models and to evaluate their outcomes. Second, existing tools require heavy data preparation steps and are often too slow to give interactive feedback to domain experts in the model building process, severely limiting the possible interactions. Third, current tools do not provide adequate analysis of statistical risk factors in the model development. In this work, we present the first iteration of QuIC-M (pronounced quick-m), an interactive human-in-the-loop data exploration and model building suite. The goal is to enable domain experts to build the machine learning pipelines an order of magnitude faster than machine learning experts while having model qualities comparable to expert solutions.","PeriodicalId":92710,"journal":{"name":"Proceedings of the Second Workshop on Data Management for End-to-End Machine Learning. Workshop on Data Management for End-to-End Machine Learning (2nd : 2018 : Houston, Tex.)","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74850220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Learning Efficiently Over Heterogeneous Databases: Sampling and Constraints to the Rescue 异构数据库的高效学习:采样和约束
Jose Picado, Arash Termehchy, Sudhanshu Pathak
Given a relational database and training examples for a target relation, relational learning algorithms learn a definition for the target relation in terms of the existing relations in the database. We propose a relational learning system called CastorX, which learns efficiently across multiple heterogeneous databases. The user specifies connections and relationships between different databases using a set of declarative constraints called matching dependencies (MDs). Each MD connects tuples across multiple databases that are related and can meaningfully join but the values of their join attributes may not be equal due to the different representations of these values in different databases. CastorX leverages these constraints during learning to find the information relevant to the training data and target definition across multiple databases. Since each tuple in a database may be connected to too many tuples in other databases according to an MD, the learning process will become very slow. Hence, CastorX uses sampling techniques to learn efficiently and output accurate definitions.
给定一个关系数据库和目标关系的训练示例,关系学习算法根据数据库中的现有关系学习目标关系的定义。我们提出了一种称为CastorX的关系学习系统,它可以跨多个异构数据库进行高效的学习。用户使用一组称为匹配依赖项(MDs)的声明性约束指定不同数据库之间的连接和关系。每个MD连接跨多个数据库的元组,这些数据库是相关的,并且可以进行有意义的连接,但是它们的连接属性的值可能不相等,因为这些值在不同的数据库中有不同的表示。CastorX在学习过程中利用这些约束,在多个数据库中查找与训练数据和目标定义相关的信息。由于根据MD,数据库中的每个元组可能连接到其他数据库中的太多元组,因此学习过程将变得非常缓慢。因此,CastorX使用采样技术来高效地学习并输出准确的定义。
{"title":"Learning Efficiently Over Heterogeneous Databases: Sampling and Constraints to the Rescue","authors":"Jose Picado, Arash Termehchy, Sudhanshu Pathak","doi":"10.1145/3209889.3209899","DOIUrl":"https://doi.org/10.1145/3209889.3209899","url":null,"abstract":"Given a relational database and training examples for a target relation, relational learning algorithms learn a definition for the target relation in terms of the existing relations in the database. We propose a relational learning system called CastorX, which learns efficiently across multiple heterogeneous databases. The user specifies connections and relationships between different databases using a set of declarative constraints called matching dependencies (MDs). Each MD connects tuples across multiple databases that are related and can meaningfully join but the values of their join attributes may not be equal due to the different representations of these values in different databases. CastorX leverages these constraints during learning to find the information relevant to the training data and target definition across multiple databases. Since each tuple in a database may be connected to too many tuples in other databases according to an MD, the learning process will become very slow. Hence, CastorX uses sampling techniques to learn efficiently and output accurate definitions.","PeriodicalId":92710,"journal":{"name":"Proceedings of the Second Workshop on Data Management for End-to-End Machine Learning. Workshop on Data Management for End-to-End Machine Learning (2nd : 2018 : Houston, Tex.)","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85995415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Snorkel MeTaL: Weak Supervision for Multi-Task Learning. Snorkel MeTaL:多任务学习的弱监督。
Alex Ratner, Braden Hancock, Jared Dunnmon, Roger Goldman, Christopher Ré

Many real-world machine learning problems are challenging to tackle for two reasons: (i) they involve multiple sub-tasks at different levels of granularity; and (ii) they require large volumes of labeled training data. We propose Snorkel MeTaL, an end-to-end system for multi-task learning that leverages weak supervision provided at multiple levels of granularity by domain expert users. In MeTaL, a user specifies a problem consisting of multiple, hierarchically-related sub-tasks-for example, classifying a document at multiple levels of granularity-and then provides labeling functions for each sub-task as weak supervision. MeTaL learns a re-weighted model of these labeling functions, and uses the combined signal to train a hierarchical multi-task network which is automatically compiled from the structure of the sub-tasks. Using MeTaL on a radiology report triage task and a fine-grained news classification task, we achieve average gains of 11.2 accuracy points over a baseline supervised approach and 9.5 accuracy points over the predictions of the user-provided labeling functions.

现实世界中的许多机器学习问题都很难解决,原因有二:(i) 它们涉及不同粒度的多个子任务;(ii) 它们需要大量标注的训练数据。我们提出了 Snorkel MeTaL,这是一个用于多任务学习的端到端系统,可利用领域专家用户提供的多粒度弱监督。在 MeTaL 中,用户指定一个由多个层次相关的子任务组成的问题--例如,对文档进行多级分类--然后为每个子任务提供标签函数作为弱监督。MeTaL 学习这些标注函数的重新加权模型,并利用综合信号训练分层多任务网络,该网络由子任务结构自动编译而成。使用 MeTaL 完成放射报告分流任务和细粒度新闻分类任务后,我们的平均准确率比基准监督方法提高了 11.2 个百分点,比用户提供的标签函数预测准确率提高了 9.5 个百分点。
{"title":"Snorkel MeTaL: Weak Supervision for Multi-Task Learning.","authors":"Alex Ratner, Braden Hancock, Jared Dunnmon, Roger Goldman, Christopher Ré","doi":"10.1145/3209889.3209898","DOIUrl":"10.1145/3209889.3209898","url":null,"abstract":"<p><p>Many real-world machine learning problems are challenging to tackle for two reasons: (i) they involve multiple sub-tasks at different levels of granularity; and (ii) they require large volumes of labeled training data. We propose Snorkel MeTaL, an end-to-end system for multi-task learning that leverages <i>weak</i> supervision provided at <i>multiple levels of granularity</i> by domain expert users. In MeTaL, a user specifies a problem consisting of multiple, hierarchically-related sub-tasks-for example, classifying a document at multiple levels of granularity-and then provides <i>labeling functions</i> for each sub-task as weak supervision. MeTaL learns a re-weighted model of these labeling functions, and uses the combined signal to train a hierarchical multi-task network which is automatically compiled from the structure of the sub-tasks. Using MeTaL on a radiology report triage task and a fine-grained news classification task, we achieve average gains of 11.2 accuracy points over a baseline supervised approach and 9.5 accuracy points over the predictions of the user-provided labeling functions.</p>","PeriodicalId":92710,"journal":{"name":"Proceedings of the Second Workshop on Data Management for End-to-End Machine Learning. Workshop on Data Management for End-to-End Machine Learning (2nd : 2018 : Houston, Tex.)","volume":"2018 ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6436830/pdf/nihms-993812.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37107048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring the Utility of Developer Exhaust. 探索开发人员废气的效用。
Jian Zhang, Max Lam, Stephanie Wang, Paroma Varma, Luigi Nardi, Kunle Olukotun, Christopher Ré

Using machine learning to analyze data often results in developer exhaust - code, logs, or metadata that do not define the learning algorithm but are byproducts of the data analytics pipeline. We study how the rich information present in developer exhaust can be used to approximately solve otherwise complex tasks. Specifically, we focus on using log data associated with training deep learning models to perform model search by predicting performance metrics for untrained models. Instead of designing a different model for each performance metric, we present two preliminary methods that rely only on information present in logs to predict these characteristics for different architectures. We introduce (i) a nearest neighbor approach with a hand-crafted edit distance metric to compare model architectures and (ii) a more generalizable, end-to-end approach that trains an LSTM using model architectures and associated logs to predict performance metrics of interest. We perform model search optimizing for best validation accuracy, degree of overfitting, and best validation accuracy given a constraint on training time. Our approaches can predict validation accuracy within 1.37% error on average, while the baseline achieves 4.13% by using the performance of a trained model with the closest number of layers. When choosing the best performing model given constraints on training time, our approaches select the top-3 models that overlap with the true top- 3 models 82% of the time, while the baseline only achieves this 54% of the time. Our preliminary experiments hold promise for how developer exhaust can help learn models that can approximate various complex tasks efficiently.

使用机器学习来分析数据通常会导致开发人员产生废气——代码、日志或元数据,这些代码、日志或元数据没有定义学习算法,而是数据分析管道的副产品。我们研究如何使用开发人员排气中的丰富信息来近似地解决其他复杂的任务。具体来说,我们专注于使用与训练深度学习模型相关的日志数据,通过预测未训练模型的性能指标来执行模型搜索。我们没有为每个性能指标设计不同的模型,而是提出了两种初步方法,它们仅依赖日志中的信息来预测不同体系结构的这些特征。我们引入了(i)一种带有手工编辑距离度量的最近邻方法来比较模型架构,以及(ii)一种更通用的端到端方法,该方法使用模型架构和相关日志来训练LSTM,以预测感兴趣的性能指标。我们执行模型搜索优化,以获得最佳验证精度、过拟合程度和给定训练时间约束的最佳验证精度。我们的方法可以在1.37%的平均误差内预测验证精度,而通过使用层数最接近的训练模型的性能,基线可以达到4.13%。在给定训练时间的约束条件下选择表现最好的模型时,我们的方法选择与真正的前3个模型重叠的前3个模型的概率为82%,而基线只达到这一概率的54%。我们的初步实验为开发人员废气如何帮助学习能够有效地近似各种复杂任务的模型提供了希望。
{"title":"Exploring the Utility of Developer Exhaust.","authors":"Jian Zhang,&nbsp;Max Lam,&nbsp;Stephanie Wang,&nbsp;Paroma Varma,&nbsp;Luigi Nardi,&nbsp;Kunle Olukotun,&nbsp;Christopher Ré","doi":"10.1145/3209889.3209895","DOIUrl":"https://doi.org/10.1145/3209889.3209895","url":null,"abstract":"<p><p>Using machine learning to analyze data often results in <i>developer exhaust</i> - code, logs, or metadata that do not define the learning algorithm but are byproducts of the data analytics pipeline. We study how the rich information present in developer exhaust can be used to approximately solve otherwise complex tasks. Specifically, we focus on using log data associated with training deep learning models to perform model search by <i>predicting</i> performance metrics for untrained models. Instead of designing a different model for each performance metric, we present two preliminary methods that rely only on information present in logs to predict these characteristics for different architectures. We introduce (i) a nearest neighbor approach with a hand-crafted edit distance metric to compare model architectures and (ii) a more generalizable, end-to-end approach that trains an LSTM using model architectures and associated logs to predict performance metrics of interest. We perform model search optimizing for best validation accuracy, degree of overfitting, and best validation accuracy given a constraint on training time. Our approaches can predict validation accuracy within 1.37% error on average, while the baseline achieves 4.13% by using the performance of a trained model with the closest number of layers. When choosing the best performing model given constraints on training time, our approaches select the top-3 models that overlap with the true top- 3 models 82% of the time, while the baseline only achieves this 54% of the time. Our preliminary experiments hold promise for how developer exhaust can help <i>learn</i> models that can approximate various complex tasks efficiently.</p>","PeriodicalId":92710,"journal":{"name":"Proceedings of the Second Workshop on Data Management for End-to-End Machine Learning. Workshop on Data Management for End-to-End Machine Learning (2nd : 2018 : Houston, Tex.)","volume":"2018 ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3209889.3209895","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37280256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Accelerating Human-in-the-loop Machine Learning: Challenges and Opportunities 加速人在循环机器学习:挑战与机遇
Doris Xin, Litian Ma, Jialin Liu, Stephen Macke, Shuchen Song, Aditya G. Parameswaran
Development of machine learning (ML) workflows is a tedious process of iterative experimentation: developers repeatedly make changes to workflows until the desired accuracy is attained. We describe our vision for a "human-in-the-loop" ML system that accelerates this process: by intelligently tracking changes and intermediate results over time, such a system can enable rapid iteration, quick responsive feedback, introspection and debugging, and background execution and automation. We finally describe Helix, our preliminary attempt at such a system that has already led to speedups of upto 10x on typical iterative workflows against competing systems.
机器学习(ML)工作流的开发是一个冗长乏味的迭代实验过程:开发人员反复更改工作流,直到达到所需的准确性。我们描述了我们对加速这一过程的“人在循环”机器学习系统的愿景:通过智能地跟踪变化和中间结果,这样的系统可以实现快速迭代,快速响应反馈,内省和调试,以及后台执行和自动化。我们最后描述了Helix,这是我们对这样一个系统的初步尝试,它已经在典型的迭代工作流上比竞争系统加速了10倍。
{"title":"Accelerating Human-in-the-loop Machine Learning: Challenges and Opportunities","authors":"Doris Xin, Litian Ma, Jialin Liu, Stephen Macke, Shuchen Song, Aditya G. Parameswaran","doi":"10.1145/3209889.3209897","DOIUrl":"https://doi.org/10.1145/3209889.3209897","url":null,"abstract":"Development of machine learning (ML) workflows is a tedious process of iterative experimentation: developers repeatedly make changes to workflows until the desired accuracy is attained. We describe our vision for a \"human-in-the-loop\" ML system that accelerates this process: by intelligently tracking changes and intermediate results over time, such a system can enable rapid iteration, quick responsive feedback, introspection and debugging, and background execution and automation. We finally describe Helix, our preliminary attempt at such a system that has already led to speedups of upto 10x on typical iterative workflows against competing systems.","PeriodicalId":92710,"journal":{"name":"Proceedings of the Second Workshop on Data Management for End-to-End Machine Learning. Workshop on Data Management for End-to-End Machine Learning (2nd : 2018 : Houston, Tex.)","volume":"160 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90110347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 83
Learning State Representations for Query Optimization with Deep Reinforcement Learning 基于深度强化学习的查询优化学习状态表示
Jennifer Ortiz, M. Balazinska, J. Gehrke, S. Keerthi
We explore the idea of using deep reinforcement learning for query optimization. The approach is to build queries incrementally by encoding properties of subqueries using a learned representation. In this paper, we focus specifically on the state representation problem and the formation of the state transition function. We show preliminary results and discuss how we can use the state representation to improve query optimization using reinforcement learning.
我们探索了使用深度强化学习进行查询优化的想法。该方法是通过使用学习的表示对子查询的属性进行编码来增量地构建查询。在本文中,我们特别关注状态表示问题和状态转移函数的形成。我们展示了初步结果,并讨论了如何使用状态表示来改进使用强化学习的查询优化。
{"title":"Learning State Representations for Query Optimization with Deep Reinforcement Learning","authors":"Jennifer Ortiz, M. Balazinska, J. Gehrke, S. Keerthi","doi":"10.1145/3209889.3209890","DOIUrl":"https://doi.org/10.1145/3209889.3209890","url":null,"abstract":"We explore the idea of using deep reinforcement learning for query optimization. The approach is to build queries incrementally by encoding properties of subqueries using a learned representation. In this paper, we focus specifically on the state representation problem and the formation of the state transition function. We show preliminary results and discuss how we can use the state representation to improve query optimization using reinforcement learning.","PeriodicalId":92710,"journal":{"name":"Proceedings of the Second Workshop on Data Management for End-to-End Machine Learning. Workshop on Data Management for End-to-End Machine Learning (2nd : 2018 : Houston, Tex.)","volume":"16 2","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91472090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 140
期刊
Proceedings of the Second Workshop on Data Management for End-to-End Machine Learning. Workshop on Data Management for End-to-End Machine Learning (2nd : 2018 : Houston, Tex.)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1