基于高质量集群保证的地理空间模式匹配与社交网络位置挖掘

L. Khan, J. Partyka, Satyen Abrol, B. Thuraisingham
{"title":"基于高质量集群保证的地理空间模式匹配与社交网络位置挖掘","authors":"L. Khan, J. Partyka, Satyen Abrol, B. Thuraisingham","doi":"10.1109/ICDMW.2010.204","DOIUrl":null,"url":null,"abstract":"In this talk, we will present how semantics can improve the quality of the data mining process. In particular, first, we will focus on geospatial schema matching with high quality cluster assurance. Next, we will focus on location mining from social network. With regard to the first problem, resolving semantic heterogeneity across distinct data sources remains a highly relevant problem in the GIS domain requiring innovative solutions. Our approach, called GSim, semantically aligns tables from respective GIS databases by first choosing attributes for comparison. We then examine their instances and calculate a similarity value between them called Entropy-Based Distribution (EBD) by combining two separate methods. Our primary method discerns the geographic types from instances of compared attributes. If geographic type matching is not possible, we then apply a generic schema matching method which employs normalized Google distance with the usage of clustering process. GSim proceeds by deriving clusters from attribute instances based on content and their geographic types (if possible), gleaned from a gazetteer. However, clustering algorithms may produce inconsistent results based on variable cluster quality. We apply novel metrics measuring cluster distance and purity to guarantee high-quality homogeneous clusters. The end result is a wholly geospatial similarity value, expressed as EBD. We show the effectiveness of our approach over the traditional N-gram approach across multi-jurisdictional datasets by generating impressive results. With regard to the second problem, we will predict the location of the user on the basis of his social network (e.g., Twitter) using the strong theoretical framework of semi-supervised learning, in particular, we employ label propagation algorithm. For privacy and security reasons, most of the people on social networking sites like Twitter are unwilling to specify their locations explicitly. On the city locations returned by the algorithm, the system performs agglomerative clustering based on geospatial proximity and their individual scores to return cluster of locations with higher confidence. We perform extensive experiments to show the validity of our system in terms of both accuracy and running time. Experimental results show that our approach outperforms the content based geo-tagging approach in both accuracy and running time.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Geospatial Schema Matching with High-Quality Cluster Assurance and Location Mining from Social Network\",\"authors\":\"L. Khan, J. Partyka, Satyen Abrol, B. Thuraisingham\",\"doi\":\"10.1109/ICDMW.2010.204\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this talk, we will present how semantics can improve the quality of the data mining process. In particular, first, we will focus on geospatial schema matching with high quality cluster assurance. Next, we will focus on location mining from social network. With regard to the first problem, resolving semantic heterogeneity across distinct data sources remains a highly relevant problem in the GIS domain requiring innovative solutions. Our approach, called GSim, semantically aligns tables from respective GIS databases by first choosing attributes for comparison. We then examine their instances and calculate a similarity value between them called Entropy-Based Distribution (EBD) by combining two separate methods. Our primary method discerns the geographic types from instances of compared attributes. If geographic type matching is not possible, we then apply a generic schema matching method which employs normalized Google distance with the usage of clustering process. GSim proceeds by deriving clusters from attribute instances based on content and their geographic types (if possible), gleaned from a gazetteer. However, clustering algorithms may produce inconsistent results based on variable cluster quality. We apply novel metrics measuring cluster distance and purity to guarantee high-quality homogeneous clusters. The end result is a wholly geospatial similarity value, expressed as EBD. We show the effectiveness of our approach over the traditional N-gram approach across multi-jurisdictional datasets by generating impressive results. With regard to the second problem, we will predict the location of the user on the basis of his social network (e.g., Twitter) using the strong theoretical framework of semi-supervised learning, in particular, we employ label propagation algorithm. For privacy and security reasons, most of the people on social networking sites like Twitter are unwilling to specify their locations explicitly. On the city locations returned by the algorithm, the system performs agglomerative clustering based on geospatial proximity and their individual scores to return cluster of locations with higher confidence. We perform extensive experiments to show the validity of our system in terms of both accuracy and running time. Experimental results show that our approach outperforms the content based geo-tagging approach in both accuracy and running time.\",\"PeriodicalId\":170201,\"journal\":{\"name\":\"2010 IEEE International Conference on Data Mining Workshops\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-12-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 IEEE International Conference on Data Mining Workshops\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDMW.2010.204\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE International Conference on Data Mining Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDMW.2010.204","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

在这次演讲中,我们将介绍语义如何提高数据挖掘过程的质量。特别是,首先,我们将重点关注具有高质量集群保证的地理空间模式匹配。接下来,我们将专注于从社交网络中挖掘位置。关于第一个问题,在GIS领域,解决不同数据源之间的语义异构问题仍然是一个高度相关的问题,需要创新的解决方案。我们的方法称为GSim,通过首先选择用于比较的属性,从语义上对各自GIS数据库中的表进行对齐。然后,我们检查它们的实例,并通过结合两种不同的方法计算它们之间的相似值,称为基于熵的分布(EBD)。我们的主要方法是从比较属性的实例中识别地理类型。如果无法进行地理类型匹配,则采用归一化Google距离并使用聚类过程的通用模式匹配方法。GSim根据内容及其地理类型(如果可能的话)从属性实例中派生集群,这些属性实例是从地名词典中收集的。然而,由于聚类质量的不同,聚类算法可能会产生不一致的结果。我们采用新的度量方法来测量聚类距离和纯度,以保证高质量的同质聚类。最终结果是一个完整的地理空间相似性值,表示为EBD。通过生成令人印象深刻的结果,我们展示了我们的方法在跨多管辖数据集的传统n图方法上的有效性。对于第二个问题,我们将使用半监督学习的强大理论框架,根据用户的社交网络(例如Twitter)来预测用户的位置,特别是我们使用标签传播算法。出于隐私和安全的考虑,Twitter等社交网站上的大多数人都不愿意明确地说明自己的位置。系统对算法返回的城市位置进行基于地理空间接近度及其个体得分的聚类,得到置信度更高的城市位置聚类。我们进行了大量的实验,以证明我们的系统在准确性和运行时间方面的有效性。实验结果表明,该方法在准确率和运行时间上都优于基于内容的地理标记方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Geospatial Schema Matching with High-Quality Cluster Assurance and Location Mining from Social Network
In this talk, we will present how semantics can improve the quality of the data mining process. In particular, first, we will focus on geospatial schema matching with high quality cluster assurance. Next, we will focus on location mining from social network. With regard to the first problem, resolving semantic heterogeneity across distinct data sources remains a highly relevant problem in the GIS domain requiring innovative solutions. Our approach, called GSim, semantically aligns tables from respective GIS databases by first choosing attributes for comparison. We then examine their instances and calculate a similarity value between them called Entropy-Based Distribution (EBD) by combining two separate methods. Our primary method discerns the geographic types from instances of compared attributes. If geographic type matching is not possible, we then apply a generic schema matching method which employs normalized Google distance with the usage of clustering process. GSim proceeds by deriving clusters from attribute instances based on content and their geographic types (if possible), gleaned from a gazetteer. However, clustering algorithms may produce inconsistent results based on variable cluster quality. We apply novel metrics measuring cluster distance and purity to guarantee high-quality homogeneous clusters. The end result is a wholly geospatial similarity value, expressed as EBD. We show the effectiveness of our approach over the traditional N-gram approach across multi-jurisdictional datasets by generating impressive results. With regard to the second problem, we will predict the location of the user on the basis of his social network (e.g., Twitter) using the strong theoretical framework of semi-supervised learning, in particular, we employ label propagation algorithm. For privacy and security reasons, most of the people on social networking sites like Twitter are unwilling to specify their locations explicitly. On the city locations returned by the algorithm, the system performs agglomerative clustering based on geospatial proximity and their individual scores to return cluster of locations with higher confidence. We perform extensive experiments to show the validity of our system in terms of both accuracy and running time. Experimental results show that our approach outperforms the content based geo-tagging approach in both accuracy and running time.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Quantum Path Integral Inspired Query Sequence Suggestion for User Search Task Simplification PTCR-Miner: Progressive Temporal Class Rule Mining for Multivariate Temporal Data Classification Bridging Folksonomies and Domain Ontologies: Getting Out Non-taxonomic Relations SIMPLE: Interactive Analytics on Patent Data Parallel EM-Clustering: Fast Convergence by Asynchronous Model Updates
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1