数据孤岛的协作分析

Jinkyu Kim, Heonseok Ha, Byung-Gon Chun, Sungroh Yoon, S. Cha
{"title":"数据孤岛的协作分析","authors":"Jinkyu Kim, Heonseok Ha, Byung-Gon Chun, Sungroh Yoon, S. Cha","doi":"10.1109/ICDE.2016.7498286","DOIUrl":null,"url":null,"abstract":"As a great deal of data has been accumulated in various disciplines, the need for the integrative analysis of separate but relevant data sources is becoming more important. Combining data sources can provide global insight that is otherwise difficult to obtain from individual sources. Because of privacy, regulations, and other issues, many large-scale data repositories remain closed off from the outside, raising what has been termed the data silo issue. The huge volume of today's big data often leads to computational challenges, adding another layer of complexity to the solution. In this paper, we propose a novel method called collaborative analytics by ensemble learning (CABEL), which attempts to resolve the main hurdles regarding the silo issue: accuracy, privacy, and computational efficiency. CABEL represents the data stored in each silo as a compact aggregate of samples called the silo signature. The compact representation provides computational efficiency and privacy preservation but makes it challenging to produce accurate analytics. To resolve this challenge, we formulate the problem of attribute domain sampling and reconstruction, and propose a solution called the Chebyshev subset. To model collaborative efforts to analyze semantically linked but structurally disconnected databases, CABEL utilizes a new ensemble learning technique termed the weighted bagging of base classifiers. We demonstrate the effectiveness of CABEL by testing with a nationwide health-insurance data set containing approximately 4,182,000,000 records collected from the entire population of an Organisation for Economic Co-operation and Development (OECD) country in 2012. In our binary classification tests, CABEL achieved median recall, precision, and F-measure values of 89%, 64%, and 76%, respectively, although only 0.001-0.00001% of the original data was used for model construction, while maintaining data privacy and computational efficiency.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"152 1","pages":"743-754"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"Collaborative analytics for data silos\",\"authors\":\"Jinkyu Kim, Heonseok Ha, Byung-Gon Chun, Sungroh Yoon, S. Cha\",\"doi\":\"10.1109/ICDE.2016.7498286\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As a great deal of data has been accumulated in various disciplines, the need for the integrative analysis of separate but relevant data sources is becoming more important. Combining data sources can provide global insight that is otherwise difficult to obtain from individual sources. Because of privacy, regulations, and other issues, many large-scale data repositories remain closed off from the outside, raising what has been termed the data silo issue. The huge volume of today's big data often leads to computational challenges, adding another layer of complexity to the solution. In this paper, we propose a novel method called collaborative analytics by ensemble learning (CABEL), which attempts to resolve the main hurdles regarding the silo issue: accuracy, privacy, and computational efficiency. CABEL represents the data stored in each silo as a compact aggregate of samples called the silo signature. The compact representation provides computational efficiency and privacy preservation but makes it challenging to produce accurate analytics. To resolve this challenge, we formulate the problem of attribute domain sampling and reconstruction, and propose a solution called the Chebyshev subset. To model collaborative efforts to analyze semantically linked but structurally disconnected databases, CABEL utilizes a new ensemble learning technique termed the weighted bagging of base classifiers. We demonstrate the effectiveness of CABEL by testing with a nationwide health-insurance data set containing approximately 4,182,000,000 records collected from the entire population of an Organisation for Economic Co-operation and Development (OECD) country in 2012. In our binary classification tests, CABEL achieved median recall, precision, and F-measure values of 89%, 64%, and 76%, respectively, although only 0.001-0.00001% of the original data was used for model construction, while maintaining data privacy and computational efficiency.\",\"PeriodicalId\":6883,\"journal\":{\"name\":\"2016 IEEE 32nd International Conference on Data Engineering (ICDE)\",\"volume\":\"152 1\",\"pages\":\"743-754\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE 32nd International Conference on Data Engineering (ICDE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDE.2016.7498286\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2016.7498286","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12

摘要

由于各个学科积累了大量的数据,对独立但相关的数据源进行综合分析的需求变得越来越重要。结合数据源可以提供难以从单个来源获得的全局洞察力。由于隐私、法规和其他问题,许多大型数据存储库仍然与外部封闭,从而引发了所谓的数据孤岛问题。当今庞大的大数据量通常会带来计算挑战,为解决方案增加了另一层复杂性。在本文中,我们提出了一种称为集成学习协作分析(CABEL)的新方法,该方法试图解决有关筒仓问题的主要障碍:准确性、隐私性和计算效率。CABEL将存储在每个筒仓中的数据表示为一个紧凑的样本集合,称为筒仓特征。紧凑的表示提供了计算效率和隐私保护,但使其难以产生准确的分析。为了解决这一挑战,我们提出了属性域采样和重构问题,并提出了一种称为Chebyshev子集的解决方案。为了对分析语义相连但结构不相连的数据库的协作努力进行建模,CABEL采用了一种新的集成学习技术,称为基础分类器的加权装袋。我们通过对2012年从经济合作与发展组织(OECD)国家的全体人口中收集的包含约41.82亿条记录的全国性医疗保险数据集进行测试,证明了CABEL的有效性。在我们的二元分类测试中,尽管在保持数据隐私性和计算效率的前提下,仅使用原始数据的0.001-0.00001%进行模型构建,但CABEL的中位召回率、精度和f测量值分别达到89%、64%和76%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Collaborative analytics for data silos
As a great deal of data has been accumulated in various disciplines, the need for the integrative analysis of separate but relevant data sources is becoming more important. Combining data sources can provide global insight that is otherwise difficult to obtain from individual sources. Because of privacy, regulations, and other issues, many large-scale data repositories remain closed off from the outside, raising what has been termed the data silo issue. The huge volume of today's big data often leads to computational challenges, adding another layer of complexity to the solution. In this paper, we propose a novel method called collaborative analytics by ensemble learning (CABEL), which attempts to resolve the main hurdles regarding the silo issue: accuracy, privacy, and computational efficiency. CABEL represents the data stored in each silo as a compact aggregate of samples called the silo signature. The compact representation provides computational efficiency and privacy preservation but makes it challenging to produce accurate analytics. To resolve this challenge, we formulate the problem of attribute domain sampling and reconstruction, and propose a solution called the Chebyshev subset. To model collaborative efforts to analyze semantically linked but structurally disconnected databases, CABEL utilizes a new ensemble learning technique termed the weighted bagging of base classifiers. We demonstrate the effectiveness of CABEL by testing with a nationwide health-insurance data set containing approximately 4,182,000,000 records collected from the entire population of an Organisation for Economic Co-operation and Development (OECD) country in 2012. In our binary classification tests, CABEL achieved median recall, precision, and F-measure values of 89%, 64%, and 76%, respectively, although only 0.001-0.00001% of the original data was used for model construction, while maintaining data privacy and computational efficiency.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Data profiling SEED: A system for entity exploration and debugging in large-scale knowledge graphs TemProRA: Top-k temporal-probabilistic results analysis Durable graph pattern queries on historical graphs SCouT: Scalable coupled matrix-tensor factorization - algorithm and discoveries
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1