首页 > 最新文献

2011 International Conference on Data and Knowledge Engineering (ICDKE)最新文献

英文 中文
rsLDA: A Bayesian hierarchical model for relational learning rsLDA:关系学习的贝叶斯层次模型
Pub Date : 2011-10-20 DOI: 10.1109/ICDKE.2011.6053932
Claudio Taranto, Nicola Di Mauro, F. Esposito
We introduce and evaluate a technique to tackle relational learning tasks combining a framework for mining relational queries with a hierarchical Bayesian model. We present the novel rsLDA algorithm that works as follows. It initially discovers a set of relevant features from the relational data useful to describe in a propositional way the examples. This corresponds to reformulate the problem from a relational representation space into an attribute-value form. Afterwards, given this new features space, a supervised version of the Latent Dirichlet Allocation model is applied in order to learn the probabilistic model. The performance of the proposed method when applied on two real-world datasets shows an improvement when compared to other methods.
我们介绍并评估了一种处理关系学习任务的技术,该技术结合了挖掘关系查询的框架和分层贝叶斯模型。我们提出了一种新的rsLDA算法,其工作原理如下。它首先从关系数据中发现一组相关的特征,这些特征有助于以命题的方式描述示例。这对应于将问题从关系表示空间重新表述为属性-值形式。然后,给定这个新的特征空间,应用潜在狄利克雷分配模型的监督版本来学习概率模型。与其他方法相比,该方法在两个真实数据集上的性能有所提高。
{"title":"rsLDA: A Bayesian hierarchical model for relational learning","authors":"Claudio Taranto, Nicola Di Mauro, F. Esposito","doi":"10.1109/ICDKE.2011.6053932","DOIUrl":"https://doi.org/10.1109/ICDKE.2011.6053932","url":null,"abstract":"We introduce and evaluate a technique to tackle relational learning tasks combining a framework for mining relational queries with a hierarchical Bayesian model. We present the novel rsLDA algorithm that works as follows. It initially discovers a set of relevant features from the relational data useful to describe in a propositional way the examples. This corresponds to reformulate the problem from a relational representation space into an attribute-value form. Afterwards, given this new features space, a supervised version of the Latent Dirichlet Allocation model is applied in order to learn the probabilistic model. The performance of the proposed method when applied on two real-world datasets shows an improvement when compared to other methods.","PeriodicalId":377148,"journal":{"name":"2011 International Conference on Data and Knowledge Engineering (ICDKE)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114264346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
SVM based approaches for classifying protein tertiary structures 基于SVM的蛋白质三级结构分类方法
Pub Date : 2011-10-20 DOI: 10.1109/ICDKE.2011.6053917
G. Mirceva, D. Davcev
The tertiary structure of a protein molecule is the main factor which can be used to determine its chemical properties as well as its function. The knowledge of the protein function is crucial in the development of new drugs, better crops and synthetic biochemicals. With the rapid development in technology, the number of determined protein structures increases every day, so retrieving structurally similar proteins using current algorithms takes too long. Therefore, improving the efficiency of the methods for protein structure retrieval and classification is an important research issue in bioinformatics community. In this paper, we present two SVM based protein classifiers. Our classifiers use the information about the conformation of protein structures in 3D space. Namely, our protein voxel and ray based protein descriptors are used for representing the protein structures. A part of the SCOP 1.73 database is used for evaluation of our classifiers. The results show that our approach achieves 98.7% classification accuracy by using the protein ray based descriptor, while it is much faster than other similar algorithms with comparable accuracy. We provide some experimental results.
蛋白质分子的三级结构是决定其化学性质和功能的主要因素。对蛋白质功能的了解对新药、更好的作物和合成生物化学物质的开发至关重要。随着技术的快速发展,确定的蛋白质结构数量每天都在增加,因此使用现有算法检索结构相似的蛋白质花费的时间太长。因此,提高蛋白质结构检索和分类方法的效率是生物信息学领域的一个重要研究课题。本文提出了两种基于支持向量机的蛋白质分类器。我们的分类器使用三维空间中蛋白质结构的构象信息。也就是说,我们的蛋白质体素和基于射线的蛋白质描述符用于表示蛋白质结构。SCOP 1.73数据库的一部分用于评估我们的分类器。结果表明,基于蛋白质射线描述符的分类准确率达到了98.7%,比其他准确率相当的算法要快得多。我们提供了一些实验结果。
{"title":"SVM based approaches for classifying protein tertiary structures","authors":"G. Mirceva, D. Davcev","doi":"10.1109/ICDKE.2011.6053917","DOIUrl":"https://doi.org/10.1109/ICDKE.2011.6053917","url":null,"abstract":"The tertiary structure of a protein molecule is the main factor which can be used to determine its chemical properties as well as its function. The knowledge of the protein function is crucial in the development of new drugs, better crops and synthetic biochemicals. With the rapid development in technology, the number of determined protein structures increases every day, so retrieving structurally similar proteins using current algorithms takes too long. Therefore, improving the efficiency of the methods for protein structure retrieval and classification is an important research issue in bioinformatics community. In this paper, we present two SVM based protein classifiers. Our classifiers use the information about the conformation of protein structures in 3D space. Namely, our protein voxel and ray based protein descriptors are used for representing the protein structures. A part of the SCOP 1.73 database is used for evaluation of our classifiers. The results show that our approach achieves 98.7% classification accuracy by using the protein ray based descriptor, while it is much faster than other similar algorithms with comparable accuracy. We provide some experimental results.","PeriodicalId":377148,"journal":{"name":"2011 International Conference on Data and Knowledge Engineering (ICDKE)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132670293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A multicriteria recommendation method for data with missing rating scores 缺失评分数据的多标准推荐方法
Pub Date : 2011-10-20 DOI: 10.1109/ICDKE.2011.6053931
A. Takasu
This paper proposes a recommendation method for multi-criteria (MC) collaborative filtering, where users are required to give rating scores from multiple aspects to each item and systems utilize the rich information to improve the recommendation accuracy. One drawback of MC recommender systems is user's cost to give scores to items because it requires rating scores on MC for each item. To overcome this drawback, we aim at developing a MC recommender system that allows missing rating information. This paper proposes generative models for MC recommendation that are robust against missing scores. In these models we convert a list of rating scores on MC to a low dimensional feature space. Correlation among scores on MC is embedded in the feature space. So we can expect that a score list is mapped to a close point in the feature space even if some scores are missing. We conducted experiments to check the robustness of the proposed models by using Yahoo! movie data and experimentally show that they are less affected by missing information compared to Pearson correlation base collaborative filtering method.
本文提出了一种多准则协同过滤的推荐方法,该方法要求用户从多个方面对每个项目给出评级分数,系统利用这些丰富的信息来提高推荐的准确性。MC推荐系统的一个缺点是用户需要为每个项目打分,因为它需要对每个项目的MC评分。为了克服这个缺点,我们的目标是开发一个允许缺失评级信息的MC推荐系统。本文提出了对缺失分数具有鲁棒性的MC推荐生成模型。在这些模型中,我们将MC上的评分列表转换为低维特征空间。将分数间的相关性嵌入到特征空间中。所以我们可以期望分数列表被映射到特征空间中的一个闭合点,即使有些分数缺失。我们利用Yahoo!实验表明,与基于Pearson相关的协同过滤方法相比,该方法受缺失信息的影响较小。
{"title":"A multicriteria recommendation method for data with missing rating scores","authors":"A. Takasu","doi":"10.1109/ICDKE.2011.6053931","DOIUrl":"https://doi.org/10.1109/ICDKE.2011.6053931","url":null,"abstract":"This paper proposes a recommendation method for multi-criteria (MC) collaborative filtering, where users are required to give rating scores from multiple aspects to each item and systems utilize the rich information to improve the recommendation accuracy. One drawback of MC recommender systems is user's cost to give scores to items because it requires rating scores on MC for each item. To overcome this drawback, we aim at developing a MC recommender system that allows missing rating information. This paper proposes generative models for MC recommendation that are robust against missing scores. In these models we convert a list of rating scores on MC to a low dimensional feature space. Correlation among scores on MC is embedded in the feature space. So we can expect that a score list is mapped to a close point in the feature space even if some scores are missing. We conducted experiments to check the robustness of the proposed models by using Yahoo! movie data and experimentally show that they are less affected by missing information compared to Pearson correlation base collaborative filtering method.","PeriodicalId":377148,"journal":{"name":"2011 International Conference on Data and Knowledge Engineering (ICDKE)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127408456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Building a distributed authenticating CDN 构建分布式认证CDN
Pub Date : 2011-10-20 DOI: 10.1109/ICDKE.2011.6053930
Sam Moffatt
In recent times, much has been made of the security, or lack thereof, utilised within Facebook's content distribution network (CDN). Their CDN is noted to enable public access to any resource via a GET request presuming the user knows the URL for the resource. This means that not only can users directly access material that they would otherwise not have access to but it also means that material that has been considered “deleted” may still be accessible. noncdn is a content distribution network designed to provide light-weight authenticated access to content stored at edge nodes with easily replicated authentication access through time limited authentication tokens. noncdn provides “volumes” as a container for handling access control and authentication nodes for generation and validation of authentication tokens. As tokens identify individuals, accesses can be logged and tracked to provide extra auditing functionality.
最近,Facebook的内容分发网络(CDN)的安全性(或缺乏安全性)备受关注。它们的CDN被注意到允许通过GET请求对任何资源进行公共访问,假设用户知道资源的URL。这意味着用户不仅可以直接访问他们原本无法访问的材料,而且还意味着已被认为“删除”的材料可能仍然可以访问。Noncdn是一个内容分发网络,旨在为存储在边缘节点的内容提供轻量级的身份验证访问,并通过有时间限制的身份验证令牌轻松复制身份验证访问。Noncdn提供“卷”作为容器,用于处理访问控制和身份验证节点,以生成和验证身份验证令牌。由于令牌可以标识个人,因此可以记录和跟踪访问,以提供额外的审计功能。
{"title":"Building a distributed authenticating CDN","authors":"Sam Moffatt","doi":"10.1109/ICDKE.2011.6053930","DOIUrl":"https://doi.org/10.1109/ICDKE.2011.6053930","url":null,"abstract":"In recent times, much has been made of the security, or lack thereof, utilised within Facebook's content distribution network (CDN). Their CDN is noted to enable public access to any resource via a GET request presuming the user knows the URL for the resource. This means that not only can users directly access material that they would otherwise not have access to but it also means that material that has been considered “deleted” may still be accessible. noncdn is a content distribution network designed to provide light-weight authenticated access to content stored at edge nodes with easily replicated authentication access through time limited authentication tokens. noncdn provides “volumes” as a container for handling access control and authentication nodes for generation and validation of authentication tokens. As tokens identify individuals, accesses can be logged and tracked to provide extra auditing functionality.","PeriodicalId":377148,"journal":{"name":"2011 International Conference on Data and Knowledge Engineering (ICDKE)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117282305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-level continuous skyline queries (MCSQ) 多层连续天际线查询(MCSQ)
Pub Date : 2011-10-20 DOI: 10.1109/ICDKE.2011.6053927
Eman El-Dawy, Hoda M. O. Mokhtar, A. El-Bastawissy
Most of the current work on skyline queries mainly dealt with querying static query points over static data sets. With the advances in wireless communication, mobile computing, and positioning technologies, it has become possible to obtain and manage (model, index, query, etc.) the trajectories of moving objects in real life, and consequently the need for continuous skyline query processing has become more and more pressing. In this paper, we address the problem of efficiently maintaining continuous skyline queries which contain both static and dynamic attributes. We present a Multi-level Continuous Skyline Query (MCSQ) algorithm, which basically creates a pre-computed skyline data set, facilitates skyline update, and enhances query running time and performance. Our algorithm in brief proceeds as follows: First, we distinguish the data points that are permanently in the skyline and use them to derive a search bound. Second, we establish a pre-computed data set for dynamic skyline that depends on the number of skyline levels (M) which is later used to update the first (initial) skyline points. Finally, every time the skyline needs to be updated we use the pre-computed data sets of skyline to update the previous skyline set and consequently updating first skyline. Finally, we present experimental results to demonstrate the performance and efficiency of our algorithm.
当前关于skyline查询的大部分工作主要处理在静态数据集上查询静态查询点。随着无线通信、移动计算和定位技术的进步,获取和管理(建模、索引、查询等)现实生活中运动物体的轨迹已经成为可能,因此对连续的天际线查询处理的需求变得越来越迫切。在本文中,我们解决了有效维护包含静态和动态属性的连续天际线查询的问题。提出了一种多层连续天际线查询(MCSQ)算法,该算法创建了预先计算的天际线数据集,方便了天际线更新,提高了查询的运行时间和性能。我们的算法简单地进行如下:首先,我们区分永久在天际线中的数据点,并用它们来推导搜索边界。其次,我们建立了一个预先计算的动态天际线数据集,该数据集取决于天际线水平(M)的数量,稍后用于更新第一个(初始)天际线点。最后,每次需要更新天际线时,我们使用预先计算的天际线数据集来更新先前的天际线集,从而更新第一个天际线。最后,给出了实验结果,验证了算法的性能和效率。
{"title":"Multi-level continuous skyline queries (MCSQ)","authors":"Eman El-Dawy, Hoda M. O. Mokhtar, A. El-Bastawissy","doi":"10.1109/ICDKE.2011.6053927","DOIUrl":"https://doi.org/10.1109/ICDKE.2011.6053927","url":null,"abstract":"Most of the current work on skyline queries mainly dealt with querying static query points over static data sets. With the advances in wireless communication, mobile computing, and positioning technologies, it has become possible to obtain and manage (model, index, query, etc.) the trajectories of moving objects in real life, and consequently the need for continuous skyline query processing has become more and more pressing. In this paper, we address the problem of efficiently maintaining continuous skyline queries which contain both static and dynamic attributes. We present a Multi-level Continuous Skyline Query (MCSQ) algorithm, which basically creates a pre-computed skyline data set, facilitates skyline update, and enhances query running time and performance. Our algorithm in brief proceeds as follows: First, we distinguish the data points that are permanently in the skyline and use them to derive a search bound. Second, we establish a pre-computed data set for dynamic skyline that depends on the number of skyline levels (M) which is later used to update the first (initial) skyline points. Finally, every time the skyline needs to be updated we use the pre-computed data sets of skyline to update the previous skyline set and consequently updating first skyline. Finally, we present experimental results to demonstrate the performance and efficiency of our algorithm.","PeriodicalId":377148,"journal":{"name":"2011 International Conference on Data and Knowledge Engineering (ICDKE)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114352120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Meta-model based knowledge discovery 基于元模型的知识发现
Pub Date : 2011-10-20 DOI: 10.1109/ICDKE.2011.6053918
Dominic Girardi, J. Dirnberger, M. Giretzlehner
Data acquisition and data mining are often seen as two independent processes in research. We introduce a meta-information based, highly generic data acquisition system which is able to store data of almost arbitrary structure. Based on the meta-information we plan to apply data mining algorithms for knowledge retrieval. Furthermore, the results from the data mining algorithms will be used to apply plausibility checks for the subsequent data acquisition, in order to maintain the quality of the collected data. So, the gap between data acquisition and data mining shall be decreased.
在研究中,数据采集和数据挖掘通常被视为两个独立的过程。我们介绍了一个基于元信息的、高度通用的数据采集系统,它能够存储几乎任意结构的数据。基于元信息,我们计划应用数据挖掘算法进行知识检索。此外,数据挖掘算法的结果将用于对后续数据采集进行合理性检查,以保持所收集数据的质量。因此,需要缩小数据采集与数据挖掘之间的差距。
{"title":"Meta-model based knowledge discovery","authors":"Dominic Girardi, J. Dirnberger, M. Giretzlehner","doi":"10.1109/ICDKE.2011.6053918","DOIUrl":"https://doi.org/10.1109/ICDKE.2011.6053918","url":null,"abstract":"Data acquisition and data mining are often seen as two independent processes in research. We introduce a meta-information based, highly generic data acquisition system which is able to store data of almost arbitrary structure. Based on the meta-information we plan to apply data mining algorithms for knowledge retrieval. Furthermore, the results from the data mining algorithms will be used to apply plausibility checks for the subsequent data acquisition, in order to maintain the quality of the collected data. So, the gap between data acquisition and data mining shall be decreased.","PeriodicalId":377148,"journal":{"name":"2011 International Conference on Data and Knowledge Engineering (ICDKE)","volume":"247 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121696778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A purpose based usage access control model for E-healthcare services 基于目的的电子医疗保健服务使用访问控制模型
Pub Date : 2011-10-20 DOI: 10.1109/ICDKE.2011.6053928
Lili Sun, Hua Wang
Information privacy becomes a major concern for customers to provide their private data that can promote future business service, especially in E-healthcare services. E-healthcare is the use of web-based systems to share and deliver information across the Internet that is easy to disclose private data provided by customers. The private data has to be protected through proper authorization and access control models for e-Health systems in a large health organization. Usage access control is considered as the next generation access control model with distinguishing properties of decision continuity. It has been proven efficient to improve security administration with flexible authorization management. Usage control enables finer-grained control over usage of digital objects that offers a better access control to private information in E-healthcare systems. In this paper, we design a comprehensive usage access control approach with purpose extension to tackle such private data protection in E-healthcare services.
信息隐私成为客户提供私人数据的主要关注点,这些数据可以促进未来的业务服务,特别是在电子医疗保健服务中。电子医疗保健是使用基于web的系统在互联网上共享和交付信息,这很容易泄露客户提供的私人数据。大型医疗机构的电子健康系统必须通过适当的授权和访问控制模型来保护私有数据。使用访问控制被认为是具有决策连续性特征的下一代访问控制模型。事实证明,通过灵活的授权管理来改进安全管理是有效的。使用控制可以对数字对象的使用进行更细粒度的控制,从而为电子医疗保健系统中的私有信息提供更好的访问控制。在本文中,我们设计了一个综合的使用访问控制方法,目的扩展,以解决电子医疗保健服务中的此类私人数据保护问题。
{"title":"A purpose based usage access control model for E-healthcare services","authors":"Lili Sun, Hua Wang","doi":"10.1109/ICDKE.2011.6053928","DOIUrl":"https://doi.org/10.1109/ICDKE.2011.6053928","url":null,"abstract":"Information privacy becomes a major concern for customers to provide their private data that can promote future business service, especially in E-healthcare services. E-healthcare is the use of web-based systems to share and deliver information across the Internet that is easy to disclose private data provided by customers. The private data has to be protected through proper authorization and access control models for e-Health systems in a large health organization. Usage access control is considered as the next generation access control model with distinguishing properties of decision continuity. It has been proven efficient to improve security administration with flexible authorization management. Usage control enables finer-grained control over usage of digital objects that offers a better access control to private information in E-healthcare systems. In this paper, we design a comprehensive usage access control approach with purpose extension to tackle such private data protection in E-healthcare services.","PeriodicalId":377148,"journal":{"name":"2011 International Conference on Data and Knowledge Engineering (ICDKE)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117058456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
A generalization of blocking and windowing algorithms for duplicate detection 一种用于重复检测的阻塞和窗口算法的推广
Pub Date : 2011-10-20 DOI: 10.1109/ICDKE.2011.6053920
Uwe Draisbach, Felix Naumann
Duplicate detection is the process of finding multiple records in a dataset that represent the same real-world entity. Due to the enormous costs of an exhaustive comparison, typical algorithms select only promising record pairs for comparison. Two competing approaches are blocking and windowing. Blocking methods partition records into disjoint subsets, while windowing methods, in particular the Sorted Neighborhood Method, slide a window over the sorted records and compare records only within the window. We present a new algorithm called Sorted Blocks in several variants, which generalizes both approaches. To evaluate Sorted Blocks, we have conducted extensive experiments with different datasets. These show that our new algorithm needs fewer comparisons to find the same number of duplicates.
重复检测是在数据集中查找代表相同现实世界实体的多个记录的过程。由于穷举比较的巨大成本,典型的算法只选择有希望的记录对进行比较。两种相互竞争的方法是阻塞和打开。阻塞方法将记录划分为不相交的子集,而窗口方法,特别是排序邻域方法,将窗口滑动到排序记录上,并仅比较窗口内的记录。我们提出了一种新的算法,称为排序块的几种变体,它概括了这两种方法。为了评估排序块,我们对不同的数据集进行了广泛的实验。这表明我们的新算法需要更少的比较来找到相同数量的重复项。
{"title":"A generalization of blocking and windowing algorithms for duplicate detection","authors":"Uwe Draisbach, Felix Naumann","doi":"10.1109/ICDKE.2011.6053920","DOIUrl":"https://doi.org/10.1109/ICDKE.2011.6053920","url":null,"abstract":"Duplicate detection is the process of finding multiple records in a dataset that represent the same real-world entity. Due to the enormous costs of an exhaustive comparison, typical algorithms select only promising record pairs for comparison. Two competing approaches are blocking and windowing. Blocking methods partition records into disjoint subsets, while windowing methods, in particular the Sorted Neighborhood Method, slide a window over the sorted records and compare records only within the window. We present a new algorithm called Sorted Blocks in several variants, which generalizes both approaches. To evaluate Sorted Blocks, we have conducted extensive experiments with different datasets. These show that our new algorithm needs fewer comparisons to find the same number of duplicates.","PeriodicalId":377148,"journal":{"name":"2011 International Conference on Data and Knowledge Engineering (ICDKE)","volume":"175 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127242533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 78
A database model for heterogeneous spatial collections: Definition and algebra 异构空间集合的数据库模型:定义和代数
Pub Date : 2011-10-20 DOI: 10.1109/ICDKE.2011.6053926
G. Psaila
Spatial DBMSs usually extends the classical relational model with data types for georeferenced data, providing a suitable extension of SQL. Designing a database for heterogeneous technological infrastructures may be hard, and queries may be hard to write and low to execute. We define a data model able to model, in a natural way, heterogeneous collections of spatial objects. The query algebra provides new operators able to naturally express complex queries on heterogeneous collections, by automatically deriving spatial descriptions from the composition relationships.
空间dbms通常使用地理引用数据的数据类型扩展经典关系模型,从而提供SQL的适当扩展。为异构技术基础设施设计数据库可能很困难,查询可能很难编写,执行起来也很困难。我们定义了一个能够以自然的方式对空间对象的异构集合进行建模的数据模型。查询代数通过自动从组合关系中派生空间描述,提供了能够自然地表达对异构集合的复杂查询的新操作符。
{"title":"A database model for heterogeneous spatial collections: Definition and algebra","authors":"G. Psaila","doi":"10.1109/ICDKE.2011.6053926","DOIUrl":"https://doi.org/10.1109/ICDKE.2011.6053926","url":null,"abstract":"Spatial DBMSs usually extends the classical relational model with data types for georeferenced data, providing a suitable extension of SQL. Designing a database for heterogeneous technological infrastructures may be hard, and queries may be hard to write and low to execute. We define a data model able to model, in a natural way, heterogeneous collections of spatial objects. The query algebra provides new operators able to naturally express complex queries on heterogeneous collections, by automatically deriving spatial descriptions from the composition relationships.","PeriodicalId":377148,"journal":{"name":"2011 International Conference on Data and Knowledge Engineering (ICDKE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127504286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Dealing with domain knowledge in association rules mining — Several experiments 关联规则挖掘中领域知识的处理-几个实验
Pub Date : 2011-10-20 DOI: 10.1109/ICDKE.2011.6053919
J. Rauch, M. Simunek
Experiments concerning dealing with domain knowledge in association rules mining are presented. Formalized items of domain knowledge are used. Each such item is converted into a set of all association rules that can be considered as its consequences.
对关联规则挖掘中领域知识的处理进行了实验研究。使用领域知识的形式化项。每个这样的项都被转换成一组可以被视为其结果的所有关联规则。
{"title":"Dealing with domain knowledge in association rules mining — Several experiments","authors":"J. Rauch, M. Simunek","doi":"10.1109/ICDKE.2011.6053919","DOIUrl":"https://doi.org/10.1109/ICDKE.2011.6053919","url":null,"abstract":"Experiments concerning dealing with domain knowledge in association rules mining are presented. Formalized items of domain knowledge are used. Each such item is converted into a set of all association rules that can be considered as its consequences.","PeriodicalId":377148,"journal":{"name":"2011 International Conference on Data and Knowledge Engineering (ICDKE)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114752107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2011 International Conference on Data and Knowledge Engineering (ICDKE)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1