首页 > 最新文献

Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning最新文献

英文 中文
dcbench dcbench
Sabri Eyuboglu, Bojan Karlas, Christopher Ré, Ce Zhang, James Zou
The development workflow for today's AI applications has grown far beyond the standard model training task. This workflow typically consists of various data and model management tasks. It includes a "data cycle" aimed at producing high-quality training data, and a "model cycle" aimed at managing trained models on their way to production. This broadened workflow has opened a space for already emerging tools and systems for AI development. However, as a research community, we are still missing standardized ways to evaluate these tools and systems. In a humble effort to get this wheel turning, we developed dcbench, a benchmark for evaluating systems for data-centric AI development. In this report, we present the main ideas behind dcbench, some benchmark tasks that we included in the initial release, and a short summary of its implementation.
{"title":"dcbench","authors":"Sabri Eyuboglu, Bojan Karlas, Christopher Ré, Ce Zhang, James Zou","doi":"10.1145/3533028.3533310","DOIUrl":"https://doi.org/10.1145/3533028.3533310","url":null,"abstract":"The development workflow for today's AI applications has grown far beyond the standard model training task. This workflow typically consists of various data and model management tasks. It includes a \"data cycle\" aimed at producing high-quality training data, and a \"model cycle\" aimed at managing trained models on their way to production. This broadened workflow has opened a space for already emerging tools and systems for AI development. However, as a research community, we are still missing standardized ways to evaluate these tools and systems. In a humble effort to get this wheel turning, we developed dcbench, a benchmark for evaluating systems for data-centric AI development. In this report, we present the main ideas behind dcbench, some benchmark tasks that we included in the initial release, and a short summary of its implementation.","PeriodicalId":345888,"journal":{"name":"Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning","volume":"254 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115840967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
LLVM code optimisation for automatic differentiation: when forward and reverse mode lead in the same direction LLVM代码优化自动区分:当向前和反向模式导致在同一方向
Maximilian E. Schüle, M. Springer, A. Kemper
Both forward and reverse mode automatic differentiation derive a model function as used for gradient descent automatically. Reverse mode calculates all derivatives in one run, whereas forward mode requires rerunning the algorithm with respect to every variable for which the derivative is needed. To allow for in-database machine learning, we have integrated automatic differentiation as an SQL operator inside the Umbra database system. To benchmark code-generation to GPU, we implement forward as well as reverse mode automatic differentiation. The inspection of the optimised LLVM code shows that nearly the same machine code is executed after the generated LLVM code has been optimised. Thus, both modes yield similar runtimes but different compilation times.
正反两种模式的自动微分都自动导出了用于梯度下降的模型函数。反向模式在一次运行中计算所有导数,而正向模式需要针对需要导数的每个变量重新运行算法。为了允许数据库内机器学习,我们在Umbra数据库系统中集成了自动区分作为SQL操作符。为了基准代码生成到GPU,我们实现了正向和反向模式的自动微分。对优化后的LLVM代码的检查显示,在生成的LLVM代码被优化后,几乎执行了相同的机器码。因此,这两种模式产生相似的运行时,但编译时间不同。
{"title":"LLVM code optimisation for automatic differentiation: when forward and reverse mode lead in the same direction","authors":"Maximilian E. Schüle, M. Springer, A. Kemper","doi":"10.1145/3533028.3533302","DOIUrl":"https://doi.org/10.1145/3533028.3533302","url":null,"abstract":"Both forward and reverse mode automatic differentiation derive a model function as used for gradient descent automatically. Reverse mode calculates all derivatives in one run, whereas forward mode requires rerunning the algorithm with respect to every variable for which the derivative is needed. To allow for in-database machine learning, we have integrated automatic differentiation as an SQL operator inside the Umbra database system. To benchmark code-generation to GPU, we implement forward as well as reverse mode automatic differentiation. The inspection of the optimised LLVM code shows that nearly the same machine code is executed after the generated LLVM code has been optimised. Thus, both modes yield similar runtimes but different compilation times.","PeriodicalId":345888,"journal":{"name":"Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning","volume":"401 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132208158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Towards data-centric what-if analysis for native machine learning pipelines 面向本地机器学习管道的以数据为中心的假设分析
Stefan Grafberger, Paul Groth, Sebastian Schelter
An important task of data scientists is to understand the sensitivity of their models to changes in the data that the models are trained and tested upon. Currently, conducting such data-centric what-if analyses requires significant and costly manual development and testing with the corresponding chance for the introduction of bugs. We discuss the problem of data-centric what-if analysis over whole ML pipelines (including data preparation and feature encoding), propose optimisations that reuse trained models and intermediate data to reduce the runtime of such analysis, and finally conduct preliminary experiments on three complex example pipelines, where our approach reduces the runtime by a factor of up to six.
数据科学家的一个重要任务是了解他们的模型对训练和测试的数据变化的敏感性。目前,执行这种以数据为中心的假设分析需要大量且昂贵的手工开发和测试,并有引入错误的相应机会。我们讨论了整个ML管道(包括数据准备和特征编码)上以数据为中心的假设分析问题,提出了重用训练模型和中间数据的优化方法,以减少此类分析的运行时间,并最终在三个复杂的示例管道上进行了初步实验,其中我们的方法将运行时间减少了多达六倍。
{"title":"Towards data-centric what-if analysis for native machine learning pipelines","authors":"Stefan Grafberger, Paul Groth, Sebastian Schelter","doi":"10.1145/3533028.3533303","DOIUrl":"https://doi.org/10.1145/3533028.3533303","url":null,"abstract":"An important task of data scientists is to understand the sensitivity of their models to changes in the data that the models are trained and tested upon. Currently, conducting such data-centric what-if analyses requires significant and costly manual development and testing with the corresponding chance for the introduction of bugs. We discuss the problem of data-centric what-if analysis over whole ML pipelines (including data preparation and feature encoding), propose optimisations that reuse trained models and intermediate data to reduce the runtime of such analysis, and finally conduct preliminary experiments on three complex example pipelines, where our approach reduces the runtime by a factor of up to six.","PeriodicalId":345888,"journal":{"name":"Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130990424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Learning-to-learn efficiently with self-learning 学会通过自学有效地学习
Shruti Kunde, Sharod Roy Choudhury, Amey Pandit, Rekha Singhal
Digital Twins of industrial process plants enable various what-if and if-what scenarios of the plants' functioning for fault diagnosis and general monitoring in the real-world. They do so through machine learning (ML) models built using data from sensors fitted in the plant. Over time, environmental factors cause variations in sensor readings, adversely affecting quality of the models' predictions. This triggers the self-learning loop, leading to the re-tuning/re-training of models. Reducing the time spent in self-learning of the models is a challenging task since there exist multiple models that need to be trained repeatedly using multiple algorithms which translates into large training time. We propose a metalearner which recommends the optimal regression algorithm for a model, thereby eliminating the need for training the model on multiple algorithms for every self-learning instance. The metalearner is trained on metafeatures extracted from the data which makes it application agnostic. We introduce domain metafeatures, which enhance metalearner prediction accuracy and propose machine learning and deep learning based approaches for selecting optimal metafeatures. To ensure relevance of selected metafeatures, we introduce novel static and dynamic reward functions for dynamic metafeature selection using a Q-Learning based approach. Our metalearning approach accelerates the time for determining the optimal regressor among 5 potential regressors from 5X to 27X over the traditional self-learning approaches. The incremental pre-processing approach achieves a speed-up of 25X over the traditional approach. The proposed metalearner achieves an AUC of 0.989, 0.954 and 0.998 for ML, DL and RL based approaches for metafeature selection respectively. We illustrate our findings on 3 datasets from the industrial process domain.
工业过程工厂的数字孪生可以实现各种假设和假设场景的工厂功能,用于故障诊断和现实世界中的一般监控。他们通过使用安装在工厂中的传感器的数据建立的机器学习(ML)模型来做到这一点。随着时间的推移,环境因素导致传感器读数的变化,对模型预测的质量产生不利影响。这触发了自学习循环,导致模型的重新调整/重新训练。减少模型的自学习时间是一项具有挑战性的任务,因为存在多个需要使用多种算法重复训练的模型,这意味着大量的训练时间。我们提出了一个元学习器,它为模型推荐最优回归算法,从而消除了为每个自学习实例在多个算法上训练模型的需要。元学习器是根据从数据中提取的元特征进行训练的,这使得它与应用无关。我们引入了域元特征,提高了元学习器的预测精度,并提出了基于机器学习和深度学习的方法来选择最优元特征。为了确保所选元特征的相关性,我们使用基于q学习的方法为动态元特征选择引入了新的静态和动态奖励函数。与传统的自学习方法相比,我们的元学习方法将从5个潜在回归量中确定最佳回归量的时间从5X加快到27X。增量预处理方法比传统方法的速度提高了25倍。对于基于ML、DL和RL的元特征选择方法,本文提出的元学习器的AUC分别为0.989、0.954和0.998。我们用来自工业过程领域的3个数据集来说明我们的发现。
{"title":"Learning-to-learn efficiently with self-learning","authors":"Shruti Kunde, Sharod Roy Choudhury, Amey Pandit, Rekha Singhal","doi":"10.1145/3533028.3533307","DOIUrl":"https://doi.org/10.1145/3533028.3533307","url":null,"abstract":"Digital Twins of industrial process plants enable various what-if and if-what scenarios of the plants' functioning for fault diagnosis and general monitoring in the real-world. They do so through machine learning (ML) models built using data from sensors fitted in the plant. Over time, environmental factors cause variations in sensor readings, adversely affecting quality of the models' predictions. This triggers the self-learning loop, leading to the re-tuning/re-training of models. Reducing the time spent in self-learning of the models is a challenging task since there exist multiple models that need to be trained repeatedly using multiple algorithms which translates into large training time. We propose a metalearner which recommends the optimal regression algorithm for a model, thereby eliminating the need for training the model on multiple algorithms for every self-learning instance. The metalearner is trained on metafeatures extracted from the data which makes it application agnostic. We introduce domain metafeatures, which enhance metalearner prediction accuracy and propose machine learning and deep learning based approaches for selecting optimal metafeatures. To ensure relevance of selected metafeatures, we introduce novel static and dynamic reward functions for dynamic metafeature selection using a Q-Learning based approach. Our metalearning approach accelerates the time for determining the optimal regressor among 5 potential regressors from 5X to 27X over the traditional self-learning approaches. The incremental pre-processing approach achieves a speed-up of 25X over the traditional approach. The proposed metalearner achieves an AUC of 0.989, 0.954 and 0.998 for ML, DL and RL based approaches for metafeature selection respectively. We illustrate our findings on 3 datasets from the industrial process domain.","PeriodicalId":345888,"journal":{"name":"Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133245521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Accelerating container-based deep learning hyperparameter optimization workloads 加速基于容器的深度学习超参数优化工作负载
Rui Liu, David Wong, David J. Lange, Patrik Larsson, Vinay Jethava, Qing Zheng
DocuSign is advancing at a great pace for artificial intelligence and embracing a continuous shift towards developing and deploying an increasing number of deep learning models. During the development stage, developers usually build a number of deep learning models and train them using a bunch of potential hyperparameter configurations to find the best-performed one, which is called hyperparameter optimization (HPO). Such HPO jobs can run for a long time due to ever-larger models and numerous hyperparameter configurations. Furthermore, the HPO jobs at DocuSign are processed in container-based environments so that the best-performed model can be deployed and maintained in production reliably and efficiently. The workload consists of the long-running and containerized HPO jobs that can saturate the current machine learning infrastructure in DocuSign rapidly, but the key resource (e.g., GPU memory or computing unit) are not always full utilized, for example, some hyperparameter configurations may only take a fraction of the GPU memory but will occupy the entire device due to containerization. Suffering from this issue, the users may have to either wait or manually coordinate with others for the resource to run the jobs, and such HPO workloads often take an unexpectedly long time to be completed. To address this problem, we propose Relish, a system designed specifically to accelerate HPO workloads by segmenting HPO jobs and efficiently sharing GPU resources in container-based environments so that multiple containerized segmented jobs can be executed in parallel. We conduct an HPO workload based on a three-month-long trace from a multi-tenant GPU cluster of a research and development team in DocuSign to evaluate Relish, the results demonstrate that Relish can significantly improve GPU utilization and accelerate the workload through efficient multiple jobs execution.
DocuSign在人工智能领域取得了长足的进步,并不断向开发和部署越来越多的深度学习模型转变。在开发阶段,开发人员通常构建许多深度学习模型,并使用一堆潜在的超参数配置来训练它们,以找到性能最佳的模型,这被称为超参数优化(HPO)。由于更大的模型和大量的超参数配置,这种HPO作业可以运行很长时间。此外,DocuSign的HPO作业是在基于容器的环境中处理的,因此可以在生产环境中可靠、高效地部署和维护性能最佳的模型。工作负载由长时间运行和容器化的HPO作业组成,这些作业可以迅速使DocuSign中的当前机器学习基础设施饱和,但关键资源(例如GPU内存或计算单元)并不总是被充分利用,例如,一些超参数配置可能只占用GPU内存的一小部分,但由于容器化将占用整个设备。遇到此问题时,用户可能不得不等待或手动与其他人协调资源来运行作业,并且此类HPO工作负载通常需要意想不到的长时间才能完成。为了解决这个问题,我们提出了一个专为加速HPO工作负载而设计的系统,它通过分段HPO作业和在基于容器的环境中有效地共享GPU资源来加速HPO工作负载,以便多个容器化的分段作业可以并行执行。我们对DocuSign的一个研发团队的多租户GPU集群进行了为期三个月的HPO工作负载跟踪来评估flavor,结果表明flavor可以显着提高GPU利用率并通过高效的多任务执行加速工作负载。
{"title":"Accelerating container-based deep learning hyperparameter optimization workloads","authors":"Rui Liu, David Wong, David J. Lange, Patrik Larsson, Vinay Jethava, Qing Zheng","doi":"10.1145/3533028.3533309","DOIUrl":"https://doi.org/10.1145/3533028.3533309","url":null,"abstract":"DocuSign is advancing at a great pace for artificial intelligence and embracing a continuous shift towards developing and deploying an increasing number of deep learning models. During the development stage, developers usually build a number of deep learning models and train them using a bunch of potential hyperparameter configurations to find the best-performed one, which is called hyperparameter optimization (HPO). Such HPO jobs can run for a long time due to ever-larger models and numerous hyperparameter configurations. Furthermore, the HPO jobs at DocuSign are processed in container-based environments so that the best-performed model can be deployed and maintained in production reliably and efficiently. The workload consists of the long-running and containerized HPO jobs that can saturate the current machine learning infrastructure in DocuSign rapidly, but the key resource (e.g., GPU memory or computing unit) are not always full utilized, for example, some hyperparameter configurations may only take a fraction of the GPU memory but will occupy the entire device due to containerization. Suffering from this issue, the users may have to either wait or manually coordinate with others for the resource to run the jobs, and such HPO workloads often take an unexpectedly long time to be completed. To address this problem, we propose Relish, a system designed specifically to accelerate HPO workloads by segmenting HPO jobs and efficiently sharing GPU resources in container-based environments so that multiple containerized segmented jobs can be executed in parallel. We conduct an HPO workload based on a three-month-long trace from a multi-tenant GPU cluster of a research and development team in DocuSign to evaluate Relish, the results demonstrate that Relish can significantly improve GPU utilization and accelerate the workload through efficient multiple jobs execution.","PeriodicalId":345888,"journal":{"name":"Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124674556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Evaluating model serving strategies over streaming data 评估流数据上的模型服务策略
Sonia-Florina Horchidan, Emmanouil Kritharakis, Vasiliki Kalavri, Paris Carbone
We present the first performance evaluation study of model serving integration tools in stream processing frameworks. Using Apache Flink as a representative stream processing system, we evaluate alternative Deep Learning serving pipelines for image classification. Our performance evaluation considers both the case of embedded use of Machine Learning libraries within stream tasks and that of external serving via Remote Procedure Calls. The results indicate superior throughput and scalability for pipelines that make use of embedded libraries to serve pre-trained models. Whereas, latency can vary across strategies, with external serving even achieving lower latency when network conditions are optimal due to better specialized use of underlying hardware. We discuss our findings and provide further motivating arguments towards research in the area of ML-native data streaming engines in the future.
我们提出了流处理框架中模型服务集成工具的第一个性能评估研究。使用Apache Flink作为代表性的流处理系统,我们评估了用于图像分类的替代深度学习服务管道。我们的性能评估考虑了在流任务中嵌入式使用机器学习库的情况,以及通过远程过程调用进行外部服务的情况。结果表明,利用嵌入式库为预训练模型服务的管道具有优越的吞吐量和可扩展性。然而,延迟可能因策略而异,由于更好地专门使用底层硬件,当网络条件达到最佳时,外部服务甚至可以实现更低的延迟。我们讨论了我们的发现,并为未来ml原生数据流引擎领域的研究提供了进一步的激励论据。
{"title":"Evaluating model serving strategies over streaming data","authors":"Sonia-Florina Horchidan, Emmanouil Kritharakis, Vasiliki Kalavri, Paris Carbone","doi":"10.1145/3533028.3533308","DOIUrl":"https://doi.org/10.1145/3533028.3533308","url":null,"abstract":"We present the first performance evaluation study of model serving integration tools in stream processing frameworks. Using Apache Flink as a representative stream processing system, we evaluate alternative Deep Learning serving pipelines for image classification. Our performance evaluation considers both the case of embedded use of Machine Learning libraries within stream tasks and that of external serving via Remote Procedure Calls. The results indicate superior throughput and scalability for pipelines that make use of embedded libraries to serve pre-trained models. Whereas, latency can vary across strategies, with external serving even achieving lower latency when network conditions are optimal due to better specialized use of underlying hardware. We discuss our findings and provide further motivating arguments towards research in the area of ML-native data streaming engines in the future.","PeriodicalId":345888,"journal":{"name":"Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning","volume":"70 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123001520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
How I stopped worrying about training data bugs and started complaining 我是如何不再担心训练数据错误而开始抱怨的
Lampros Flokas, Weiyuan Wu, Jiannan Wang, Nakul Verma, Eugene Wu
There is an increasing awareness of the gap between machine learning research and production. The research community has largely focused on developing a model that performs well on a validation set, but the production environment needs to make sure the model also performs well in a downstream application. The latter is more challenging because the test/inference-time data used in the application could be quite different from the training data. To address this challenge, we advocate for "complaint-driven" data debugging, which allows the user to complain about the unexpected behaviors of the model in the downstream application, and proposes interventions for training data errors that likely led to the complaints. This new debugging paradigm helps solve a range of training data quality problems such as labeling error, fairness, and data drift. We present our long-term vision, highlight achieved milestones, and outline a research roadmap including a number of open problems.
人们越来越意识到机器学习研究与生产之间的差距。研究团体主要关注于开发一个在验证集上表现良好的模型,但是生产环境需要确保该模型在下游应用程序中也表现良好。后者更具挑战性,因为应用程序中使用的测试/推理时间数据可能与训练数据大不相同。为了应对这一挑战,我们提倡“投诉驱动”的数据调试,它允许用户投诉下游应用程序中模型的意外行为,并建议对可能导致投诉的训练数据错误进行干预。这种新的调试范例有助于解决一系列训练数据质量问题,如标记错误、公平性和数据漂移。我们提出了我们的长期愿景,突出了取得的里程碑,并概述了一个研究路线图,包括一些开放的问题。
{"title":"How I stopped worrying about training data bugs and started complaining","authors":"Lampros Flokas, Weiyuan Wu, Jiannan Wang, Nakul Verma, Eugene Wu","doi":"10.1145/3533028.3533305","DOIUrl":"https://doi.org/10.1145/3533028.3533305","url":null,"abstract":"There is an increasing awareness of the gap between machine learning research and production. The research community has largely focused on developing a model that performs well on a validation set, but the production environment needs to make sure the model also performs well in a downstream application. The latter is more challenging because the test/inference-time data used in the application could be quite different from the training data. To address this challenge, we advocate for \"complaint-driven\" data debugging, which allows the user to complain about the unexpected behaviors of the model in the downstream application, and proposes interventions for training data errors that likely led to the complaints. This new debugging paradigm helps solve a range of training data quality problems such as labeling error, fairness, and data drift. We present our long-term vision, highlight achieved milestones, and outline a research roadmap including a number of open problems.","PeriodicalId":345888,"journal":{"name":"Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121520459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GouDa - generation of universal data sets: improving analysis and evaluation of data preparation pipelines 通用数据集的GouDa生成:改进数据准备管道的分析和评估
Valerie Restat, Gerrit Boerner, Andrew P. Conrad, U. Störl
Data preparation is necessary to ensure data quality in machine learning-based decisions and data-driven systems. A variety of different tools exist to simplify this process. However, there is often a lack of suitable data sets to evaluate and compare existing tools and new research approaches. For this reason, we implemented GouDa, a tool for generating universal data sets. GouDa can be used to create data sets with arbitrary error types at arbitrary error rates. In addition to the data sets with automatically generated errors, ground truth is provided. Thus, GouDa can be used for the extensive analysis and evaluation of data preparation pipelines.
在基于机器学习的决策和数据驱动系统中,数据准备是确保数据质量的必要条件。有许多不同的工具可以简化这个过程。然而,往往缺乏适当的数据集来评估和比较现有的工具和新的研究方法。出于这个原因,我们实现了GouDa,一个生成通用数据集的工具。GouDa可以用来创建具有任意错误类型和任意错误率的数据集。除了自动生成错误的数据集外,还提供了地面真实值。因此,GouDa可以用于数据准备管道的广泛分析和评估。
{"title":"GouDa - generation of universal data sets: improving analysis and evaluation of data preparation pipelines","authors":"Valerie Restat, Gerrit Boerner, Andrew P. Conrad, U. Störl","doi":"10.1145/3533028.3533311","DOIUrl":"https://doi.org/10.1145/3533028.3533311","url":null,"abstract":"Data preparation is necessary to ensure data quality in machine learning-based decisions and data-driven systems. A variety of different tools exist to simplify this process. However, there is often a lack of suitable data sets to evaluate and compare existing tools and new research approaches. For this reason, we implemented GouDa, a tool for generating universal data sets. GouDa can be used to create data sets with arbitrary error types at arbitrary error rates. In addition to the data sets with automatically generated errors, ground truth is provided. Thus, GouDa can be used for the extensive analysis and evaluation of data preparation pipelines.","PeriodicalId":345888,"journal":{"name":"Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115926061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Minun
Jin Wang, Yuliang Li
Entity Matching (EM) is an important problem in data integration and cleaning. More recently, deep learning techniques, especially pre-trained language models, have been integrated into EM applications and achieved promising results. Unfortunately, the significant performance gain comes with the loss of explainability and transparency, deterring EM from the requirement of responsible data management. To address this issue, recent studies extended explainable AI techniques to explain black-box EM models. However, these solutions have the major drawbacks that (i) their explanations do not capture the unique semantics characteristics of the EM problem; and (ii) they fail to provide an objective method to quantitatively evaluate the provided explanations. In this paper, we propose Minun, a model-agnostic method to generate explanations for EM solutions. We utilize counterfactual examples generated from an EM customized search space as the explanations and develop two search algorithms to efficiently find such results. We also come up with a novel evaluation framework based on a student-teacher paradigm. The framework enables the evaluation of explanations of diverse formats by capturing the performance gain of a "student" model at simulating the target "teacher" model when explanations are given as side input. We conduct an extensive set of experiments on explaining state-of-the-art deep EM models on popular EM benchmark datasets. The results demonstrate that Minun significantly outperforms popular explainable AI methods such as LIME and SHAP on both explanation quality and scalability.
{"title":"Minun","authors":"Jin Wang, Yuliang Li","doi":"10.1145/3533028.3533304","DOIUrl":"https://doi.org/10.1145/3533028.3533304","url":null,"abstract":"Entity Matching (EM) is an important problem in data integration and cleaning. More recently, deep learning techniques, especially pre-trained language models, have been integrated into EM applications and achieved promising results. Unfortunately, the significant performance gain comes with the loss of explainability and transparency, deterring EM from the requirement of responsible data management. To address this issue, recent studies extended explainable AI techniques to explain black-box EM models. However, these solutions have the major drawbacks that (i) their explanations do not capture the unique semantics characteristics of the EM problem; and (ii) they fail to provide an objective method to quantitatively evaluate the provided explanations. In this paper, we propose Minun, a model-agnostic method to generate explanations for EM solutions. We utilize counterfactual examples generated from an EM customized search space as the explanations and develop two search algorithms to efficiently find such results. We also come up with a novel evaluation framework based on a student-teacher paradigm. The framework enables the evaluation of explanations of diverse formats by capturing the performance gain of a \"student\" model at simulating the target \"teacher\" model when explanations are given as side input. We conduct an extensive set of experiments on explaining state-of-the-art deep EM models on popular EM benchmark datasets. The results demonstrate that Minun significantly outperforms popular explainable AI methods such as LIME and SHAP on both explanation quality and scalability.","PeriodicalId":345888,"journal":{"name":"Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129512678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1