Learning Efficiently Over Heterogeneous Databases: Sampling and Constraints to the Rescue

Jose Picado, Arash Termehchy, Sudhanshu Pathak
{"title":"Learning Efficiently Over Heterogeneous Databases: Sampling and Constraints to the Rescue","authors":"Jose Picado, Arash Termehchy, Sudhanshu Pathak","doi":"10.1145/3209889.3209899","DOIUrl":null,"url":null,"abstract":"Given a relational database and training examples for a target relation, relational learning algorithms learn a definition for the target relation in terms of the existing relations in the database. We propose a relational learning system called CastorX, which learns efficiently across multiple heterogeneous databases. The user specifies connections and relationships between different databases using a set of declarative constraints called matching dependencies (MDs). Each MD connects tuples across multiple databases that are related and can meaningfully join but the values of their join attributes may not be equal due to the different representations of these values in different databases. CastorX leverages these constraints during learning to find the information relevant to the training data and target definition across multiple databases. Since each tuple in a database may be connected to too many tuples in other databases according to an MD, the learning process will become very slow. Hence, CastorX uses sampling techniques to learn efficiently and output accurate definitions.","PeriodicalId":92710,"journal":{"name":"Proceedings of the Second Workshop on Data Management for End-to-End Machine Learning. Workshop on Data Management for End-to-End Machine Learning (2nd : 2018 : Houston, Tex.)","volume":"2 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Second Workshop on Data Management for End-to-End Machine Learning. Workshop on Data Management for End-to-End Machine Learning (2nd : 2018 : Houston, Tex.)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3209889.3209899","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Given a relational database and training examples for a target relation, relational learning algorithms learn a definition for the target relation in terms of the existing relations in the database. We propose a relational learning system called CastorX, which learns efficiently across multiple heterogeneous databases. The user specifies connections and relationships between different databases using a set of declarative constraints called matching dependencies (MDs). Each MD connects tuples across multiple databases that are related and can meaningfully join but the values of their join attributes may not be equal due to the different representations of these values in different databases. CastorX leverages these constraints during learning to find the information relevant to the training data and target definition across multiple databases. Since each tuple in a database may be connected to too many tuples in other databases according to an MD, the learning process will become very slow. Hence, CastorX uses sampling techniques to learn efficiently and output accurate definitions.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
异构数据库的高效学习:采样和约束
给定一个关系数据库和目标关系的训练示例,关系学习算法根据数据库中的现有关系学习目标关系的定义。我们提出了一种称为CastorX的关系学习系统,它可以跨多个异构数据库进行高效的学习。用户使用一组称为匹配依赖项(MDs)的声明性约束指定不同数据库之间的连接和关系。每个MD连接跨多个数据库的元组,这些数据库是相关的,并且可以进行有意义的连接,但是它们的连接属性的值可能不相等,因为这些值在不同的数据库中有不同的表示。CastorX在学习过程中利用这些约束,在多个数据库中查找与训练数据和目标定义相关的信息。由于根据MD,数据库中的每个元组可能连接到其他数据库中的太多元组,因此学习过程将变得非常缓慢。因此,CastorX使用采样技术来高效地学习并输出准确的定义。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Modelling Machine Learning Algorithms on Relational Data with Datalog Towards Interactive Curation & Automatic Tuning of ML Pipelines Avatar: Large Scale Entity Resolution of Heterogeneous User Profiles Learning Efficiently Over Heterogeneous Databases: Sampling and Constraints to the Rescue Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1