首页 > 最新文献

2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)最新文献

英文 中文
Midas for government: Integration of government spending data on Hadoop 政府的迈达斯:在Hadoop上集成政府支出数据
Pub Date : 2010-03-01 DOI: 10.1109/ICDEW.2010.5452758
Antonio Sala, Calvin Lin, Howard Ho
We describe our experience in developing a Hadoop based integration flow to collect and integrate publicly available government datasets related to government spending. The objective is to enable users, U.S. taxpayers in this case, to easily access the data their government discloses on the web in different websites. We also provide users with easy-to-use tools to query and explore this data to gather information from the integrated data that allows for evaluation of how tax money is spent.
我们描述了我们在开发基于Hadoop的集成流程以收集和集成与政府支出相关的公开可用政府数据集方面的经验。我们的目标是让用户,在这个例子中是美国纳税人,能够在不同的网站上方便地访问政府在网上公开的数据。我们还为用户提供易于使用的工具来查询和探索这些数据,以便从综合数据中收集信息,从而评估税收的使用情况。
{"title":"Midas for government: Integration of government spending data on Hadoop","authors":"Antonio Sala, Calvin Lin, Howard Ho","doi":"10.1109/ICDEW.2010.5452758","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452758","url":null,"abstract":"We describe our experience in developing a Hadoop based integration flow to collect and integrate publicly available government datasets related to government spending. The objective is to enable users, U.S. taxpayers in this case, to easily access the data their government discloses on the web in different websites. We also provide users with easy-to-use tools to query and explore this data to gather information from the integrated data that allows for evaluation of how tax money is spent.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126201887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Flash-enabled database storage 支持flash的数据库存储
Pub Date : 2010-03-01 DOI: 10.1109/ICDEW.2010.5452712
Ioannis Koltsidas, Stratis Viglas
Flash memory has emerged as a high-performing and viable storage alternative to magnetic disks for data-intensive applications. In our work we study how the storage layer of a database system can benefit from the presence of a flash disk. Due to the varying price and I/O characteristics of flash disks the optimal design decisions vary widely across different setups. We study how the system can take advantage of the random read efficiency of inexpensive flash disks by using the latter at the same level of memory hierarchy as magnetic disks in a hybrid setup; our algorithms provide efficient and adaptive data placement that leads to substantial performance improvement. We propose techniques to accurately predict the main memory cache behavior for systems consisting of heterogeneous storage media and selectively allocate memory buffers to devices; thereby, the I/O cost of the system drops significantly, even offsetting wrong data placement decisions. We also explore the design space for a system that uses flash memory as a cache to the underlying storage and propose techniques towards high performance. The experimental results, we believe, exhibit both the potential and necessity of our techniques in future database systems.
闪存已成为数据密集型应用中磁盘的高性能和可行的存储替代品。在我们的工作中,我们研究数据库系统的存储层如何从闪存盘的存在中受益。由于闪存盘的价格和I/O特性不同,最佳设计决策在不同的设置中差异很大。我们研究了系统如何利用廉价闪存盘的随机读取效率,在混合设置中使用后者在与磁盘相同的内存层次上;我们的算法提供了高效和自适应的数据放置,从而大大提高了性能。我们提出了准确预测由异构存储介质组成的系统的主内存缓存行为的技术,并有选择地将内存缓冲区分配给设备;因此,系统的I/O成本显著下降,甚至可以抵消错误的数据放置决策。我们还探讨了使用闪存作为底层存储缓存的系统的设计空间,并提出了实现高性能的技术。我们相信,实验结果显示了我们的技术在未来数据库系统中的潜力和必要性。
{"title":"Flash-enabled database storage","authors":"Ioannis Koltsidas, Stratis Viglas","doi":"10.1109/ICDEW.2010.5452712","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452712","url":null,"abstract":"Flash memory has emerged as a high-performing and viable storage alternative to magnetic disks for data-intensive applications. In our work we study how the storage layer of a database system can benefit from the presence of a flash disk. Due to the varying price and I/O characteristics of flash disks the optimal design decisions vary widely across different setups. We study how the system can take advantage of the random read efficiency of inexpensive flash disks by using the latter at the same level of memory hierarchy as magnetic disks in a hybrid setup; our algorithms provide efficient and adaptive data placement that leads to substantial performance improvement. We propose techniques to accurately predict the main memory cache behavior for systems consisting of heterogeneous storage media and selectively allocate memory buffers to devices; thereby, the I/O cost of the system drops significantly, even offsetting wrong data placement decisions. We also explore the design space for a system that uses flash memory as a cache to the underlying storage and propose techniques towards high performance. The experimental results, we believe, exhibit both the potential and necessity of our techniques in future database systems.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131440309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Partitioning real-time ETL workflows 划分实时ETL工作流
Pub Date : 2010-03-01 DOI: 10.1109/ICDEW.2010.5452754
A. Simitsis, Chetan Gupta, Song Wang, U. Dayal
Many organizations are aiming to move away from traditional batch processing ETL to real-time ETL (RT-ETL). This move is motivated by a need to analyze and take decisions on as fresh a data as possible. The RT-ETL engines operate on the abstraction of data flow executed on parallel architectures. For high throughput and low response times, there is a need for partitioning the data over the large number of nodes in the engine. In this paper, we consider the problem of partitioning realtime ETL flows and we propose a high level architecture for that.
许多组织的目标是从传统的批处理ETL转向实时ETL (RT-ETL)。这一举措的动机是需要对尽可能新鲜的数据进行分析和决策。RT-ETL引擎对并行架构上执行的数据流进行抽象操作。对于高吞吐量和低响应时间,需要在引擎中的大量节点上对数据进行分区。在本文中,我们考虑了实时ETL流的划分问题,并为此提出了一个高层次的体系结构。
{"title":"Partitioning real-time ETL workflows","authors":"A. Simitsis, Chetan Gupta, Song Wang, U. Dayal","doi":"10.1109/ICDEW.2010.5452754","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452754","url":null,"abstract":"Many organizations are aiming to move away from traditional batch processing ETL to real-time ETL (RT-ETL). This move is motivated by a need to analyze and take decisions on as fresh a data as possible. The RT-ETL engines operate on the abstraction of data flow executed on parallel architectures. For high throughput and low response times, there is a need for partitioning the data over the large number of nodes in the engine. In this paper, we consider the problem of partitioning realtime ETL flows and we propose a high level architecture for that.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131323642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Complement union for data integration 数据集成的补并
Pub Date : 2010-03-01 DOI: 10.1109/ICDEW.2010.5452760
Jens Bleiholder, Sascha Szott, Melanie Herschel, Felix Naumann
A data integration process consists of mapping source data into a target representation (schema mapping [1]), identifying multiple representations of the same real-word object (duplicate detection [2]), and finally combining these representations into a single consistent representation (data fusion [3]). Clearly, as multiple representations of an object are generally not exactly equal, during data fusion, we have to take special care in handling data conflicts. This paper focuses on the definition and implementation of complement union, an operator that defines a new semantics for data fusion.
数据集成过程包括将源数据映射为目标表示(模式映射[1]),识别同一真实世界对象的多个表示(重复检测[2]),最后将这些表示组合为单个一致的表示(数据融合[3])。显然,由于对象的多个表示通常不完全相等,因此在数据融合过程中,我们必须特别注意处理数据冲突。本文重点讨论了补并算子的定义和实现,它为数据融合定义了一种新的语义。
{"title":"Complement union for data integration","authors":"Jens Bleiholder, Sascha Szott, Melanie Herschel, Felix Naumann","doi":"10.1109/ICDEW.2010.5452760","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452760","url":null,"abstract":"A data integration process consists of mapping source data into a target representation (schema mapping [1]), identifying multiple representations of the same real-word object (duplicate detection [2]), and finally combining these representations into a single consistent representation (data fusion [3]). Clearly, as multiple representations of an object are generally not exactly equal, during data fusion, we have to take special care in handling data conflicts. This paper focuses on the definition and implementation of complement union, an operator that defines a new semantics for data fusion.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132407670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Fast algorithms for time series mining 快速时间序列挖掘算法
Pub Date : 2010-03-01 DOI: 10.1109/ICDEW.2010.5452719
Lei Li, C. Faloutsos
In this paper, we present fast algorithms on mining coevolving time series, with or with out missing values. Our algorithms could mine meaningful patterns effectively and efficiently. With those patterns, our algorithms can do forecasting, compression, and segmentation. Furthermore, we apply our algorithm to solve practical problems including occlusions in motion capture, and generating natural human motions by stitching low-effort motions. We also propose a parallel learning algorithm for LDS to fully utilize the power of multicore/multiprocessors, which will serve as corner stone of many applications and algorithms for time series.
在本文中,我们提出了一种快速挖掘协同演化时间序列的算法,无论是否存在缺失值。我们的算法可以有效地挖掘有意义的模式。有了这些模式,我们的算法就可以进行预测、压缩和分割。此外,我们将该算法应用于解决实际问题,包括运动捕捉中的遮挡,以及通过拼接低费力的运动来生成自然的人体运动。我们还提出了一种LDS的并行学习算法,以充分利用多核/多处理器的能力,这将成为许多时间序列应用和算法的基石。
{"title":"Fast algorithms for time series mining","authors":"Lei Li, C. Faloutsos","doi":"10.1109/ICDEW.2010.5452719","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452719","url":null,"abstract":"In this paper, we present fast algorithms on mining coevolving time series, with or with out missing values. Our algorithms could mine meaningful patterns effectively and efficiently. With those patterns, our algorithms can do forecasting, compression, and segmentation. Furthermore, we apply our algorithm to solve practical problems including occlusions in motion capture, and generating natural human motions by stitching low-effort motions. We also propose a parallel learning algorithm for LDS to fully utilize the power of multicore/multiprocessors, which will serve as corner stone of many applications and algorithms for time series.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"9 10","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114030475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Automatic tuning of the multiprogramming level in Sybase SQL Anywhere Sybase SQL Anywhere中多编程级别的自动调优
Pub Date : 2010-03-01 DOI: 10.1109/ICDEW.2010.5452740
Mohammed Abouzour, K. Salem, P. Bumbulis
This paper looks at the problem of automatically tuning the database server multiprogramming level to improve database server performance under varying workloads. We describe two tuning algorithms that were considered and how they performed under different workloads. We then present the hybrid approach that we have successfully implemented in SQL Anywhere 12. We found that the hybrid approach yielded better performance than each of the algorithms separately.
本文着眼于在不同工作负载下自动调优数据库服务器多编程级别以提高数据库服务器性能的问题。我们描述了考虑的两种调优算法,以及它们在不同工作负载下的执行情况。然后,我们介绍了在SQL Anywhere 12中成功实现的混合方法。我们发现混合方法比单独使用每种算法产生更好的性能。
{"title":"Automatic tuning of the multiprogramming level in Sybase SQL Anywhere","authors":"Mohammed Abouzour, K. Salem, P. Bumbulis","doi":"10.1109/ICDEW.2010.5452740","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452740","url":null,"abstract":"This paper looks at the problem of automatically tuning the database server multiprogramming level to improve database server performance under varying workloads. We describe two tuning algorithms that were considered and how they performed under different workloads. We then present the hybrid approach that we have successfully implemented in SQL Anywhere 12. We found that the hybrid approach yielded better performance than each of the algorithms separately.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127599393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
End-to-end confidentiality for a message warehousing service using Identity-Based Encryption 使用基于身份的加密实现消息仓库服务的端到端机密性
Pub Date : 2010-03-01 DOI: 10.1109/ICDEW.2010.5452750
Yücel Karabulut, Harald Weppner, I. Nassi, A. Nagarajan, Yash Shroff, Nishant Dubey, Tyelisa Shields
More and more classes of devices become capable of connecting to the Internet. Due to the observation that a point-to-point communication is insufficient for many non-interactive application integration scenarios we assume the existence of a logically centralized message warehousing service, which clients can use to deposit and retrieve messages. The particular challenge in this context is that a client depositing messages can only describe eligible receiving clients using their characterizing attributes and does not know their specific identities. The depositing client still wants to prevent exposure of the message content to the message warehousing service. We explore how this many-to-many integration between devices and enterprise systems can achieve end-to-end information confidentiality using a solution based on Identity-Based Encryption.
越来越多种类的设备能够连接到互联网。由于观察到点对点通信不足以满足许多非交互式应用程序集成场景,我们假设存在逻辑上集中的消息仓库服务,客户机可以使用该服务来存放和检索消息。这种情况下的特殊挑战是,存放消息的客户机只能使用它们的特征属性来描述合格的接收客户机,而不知道它们的具体身份。存贮客户端仍然希望防止向消息仓库服务公开消息内容。我们将探讨设备和企业系统之间的这种多对多集成如何使用基于身份的加密解决方案实现端到端的信息机密性。
{"title":"End-to-end confidentiality for a message warehousing service using Identity-Based Encryption","authors":"Yücel Karabulut, Harald Weppner, I. Nassi, A. Nagarajan, Yash Shroff, Nishant Dubey, Tyelisa Shields","doi":"10.1109/ICDEW.2010.5452750","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452750","url":null,"abstract":"More and more classes of devices become capable of connecting to the Internet. Due to the observation that a point-to-point communication is insufficient for many non-interactive application integration scenarios we assume the existence of a logically centralized message warehousing service, which clients can use to deposit and retrieve messages. The particular challenge in this context is that a client depositing messages can only describe eligible receiving clients using their characterizing attributes and does not know their specific identities. The depositing client still wants to prevent exposure of the message content to the message warehousing service. We explore how this many-to-many integration between devices and enterprise systems can achieve end-to-end information confidentiality using a solution based on Identity-Based Encryption.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129125570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
U-DBSCAN : A density-based clustering algorithm for uncertain objects U-DBSCAN:针对不确定对象的基于密度的聚类算法
Pub Date : 2010-03-01 DOI: 10.1109/ICDEW.2010.5452734
Apinya Tepwankul, S. Maneewongvatana
In recent years, uncertain data have gained increasing research interests due to its natural presence in many applications such as location based services and sensor services. In this paper, we study the problem of clustering uncertain objects. We propose a new deviation function that approximates the underlying uncertain model of objects and a new density-based clustering algorithm, U-DBSCAN, that utilizes the proposed deviation. Since, there is no cluster quality measurement of density-based clustering at present. Thus, we also propose a metric which specifically measures the density quality of clustering solution. Finally, we perform a set of experiments to evaluate the quality effectiveness of our algorithm using our metric. The results reveal that U-DBSCAN gives better clustering quality while having comparable running time compared to a traditional approach of using representative points of objects with DBSCAN.
近年来,不确定数据由于其自然存在于许多应用中,如基于位置的服务和传感器服务,引起了越来越多的研究兴趣。本文主要研究不确定目标的聚类问题。我们提出了一个新的偏差函数来近似对象的潜在不确定性模型,并提出了一个新的基于密度的聚类算法U-DBSCAN,该算法利用了所提出的偏差。因此,目前还没有基于密度的聚类的聚类质量度量。因此,我们还提出了一个度量,专门度量聚类解决方案的密度质量。最后,我们执行了一组实验,使用我们的度量来评估我们的算法的质量有效性。结果表明,与使用DBSCAN对象的代表性点的传统方法相比,U-DBSCAN提供了更好的聚类质量,同时具有相当的运行时间。
{"title":"U-DBSCAN : A density-based clustering algorithm for uncertain objects","authors":"Apinya Tepwankul, S. Maneewongvatana","doi":"10.1109/ICDEW.2010.5452734","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452734","url":null,"abstract":"In recent years, uncertain data have gained increasing research interests due to its natural presence in many applications such as location based services and sensor services. In this paper, we study the problem of clustering uncertain objects. We propose a new deviation function that approximates the underlying uncertain model of objects and a new density-based clustering algorithm, U-DBSCAN, that utilizes the proposed deviation. Since, there is no cluster quality measurement of density-based clustering at present. Thus, we also propose a metric which specifically measures the density quality of clustering solution. Finally, we perform a set of experiments to evaluate the quality effectiveness of our algorithm using our metric. The results reveal that U-DBSCAN gives better clustering quality while having comparable running time compared to a traditional approach of using representative points of objects with DBSCAN.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115261170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Semantic flooding: Search over semantic links 语义泛滥:搜索语义链接
Pub Date : 2010-03-01 DOI: 10.1109/ICDEW.2010.5452749
Fausto Giunchiglia, Uladzimir Kharkevich, Alethia Hume, Piyatat Chatvorawit
Classification hierarchies are trees where links codify the fact that a node lower in the hierarchy contains documents whose contents are more specific than those one level above. In turn, multiple classification hierarchies can be connected by semantic links which represent mappings among them and which can be computed, e.g., by ontology matching. In this paper we describe how these two types of links can be used to define a semantic overlay network which can cover any number of peers and which can be flooded to perform semantic search on documents, i.e., to perform semantic flooding. We have evaluated our approach in a simulation of the network of 10,000 peers containing classifications which are fragments of the DMoz Web directory. The results are very promising and show that, in our approach, only a relatively small number of peers needs to be queried in order to achieve high accuracy.
分类层次结构是树状结构,其中的链接记录了层次结构中较低的节点所包含的文档的内容比上一级节点的内容更具体这一事实。反过来,多个分类层次可以通过语义链接连接起来,语义链接表示它们之间的映射,并且可以计算,例如,通过本体匹配。在本文中,我们描述了如何使用这两种类型的链接来定义一个语义覆盖网络,该网络可以覆盖任意数量的对等节点,并且可以被淹没以对文档进行语义搜索,即执行语义淹没。我们已经在包含DMoz Web目录碎片分类的10,000个对等网络的模拟中评估了我们的方法。结果非常有希望,并且表明,在我们的方法中,只需要查询相对较少的对等体就可以达到较高的准确性。
{"title":"Semantic flooding: Search over semantic links","authors":"Fausto Giunchiglia, Uladzimir Kharkevich, Alethia Hume, Piyatat Chatvorawit","doi":"10.1109/ICDEW.2010.5452749","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452749","url":null,"abstract":"Classification hierarchies are trees where links codify the fact that a node lower in the hierarchy contains documents whose contents are more specific than those one level above. In turn, multiple classification hierarchies can be connected by semantic links which represent mappings among them and which can be computed, e.g., by ontology matching. In this paper we describe how these two types of links can be used to define a semantic overlay network which can cover any number of peers and which can be flooded to perform semantic search on documents, i.e., to perform semantic flooding. We have evaluated our approach in a simulation of the network of 10,000 peers containing classifications which are fragments of the DMoz Web directory. The results are very promising and show that, in our approach, only a relatively small number of peers needs to be queried in order to achieve high accuracy.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130354219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Towards better entity resolution techniques for Web document collections 为Web文档集合提供更好的实体解析技术
Pub Date : 2010-03-01 DOI: 10.1109/ICDEW.2010.5452698
Surender Reddy Yerva, Z. Miklós, K. Aberer
As person names are non-unique, the same name on different Web pages might or might not refer to the same real-world person. This entity identification problem is one of the most challenging issues in realizing the Semantic Web or entity-oriented search. We address this disambiguation problem, which is very similar to the entity resolution problem studied in relational databases, however there are also several differences. Most importantly Web pages often only contain partial or incomplete information about the persons, moreover the available information is very heterogeneous, thus we are only able to obtain some uncertain evidence about whether two names refer to the same person using similarity functions. These similarity functions capture some aspects of the similarities between Web-pages, where the names occur, thus they perform very differently for the different names. We analyze some data engineering techniques to cope with the limited accuracy of the similarity functions and to combine multiple functions. Even with our simple techniques we could demonstrate systematic performance improvements and produce comparable results to state-of-the-art methods.
由于人名是非惟一的,因此不同Web页面上的相同姓名可能指的是真实世界中的同一个人,也可能不是。实体识别问题是实现语义网或面向实体搜索中最具挑战性的问题之一。我们解决了这个消歧问题,它与关系数据库中研究的实体解析问题非常相似,但也有一些不同之处。最重要的是,Web页面通常只包含有关人物的部分或不完整的信息,而且可用的信息非常异构,因此我们只能使用相似度函数来获得关于两个名字是否指同一个人的一些不确定证据。这些相似性函数捕获了出现名称的web页面之间相似性的某些方面,因此它们对不同名称的执行非常不同。分析了一些数据工程技术,以解决相似函数精度有限的问题,并将多个函数组合在一起。即使使用我们简单的技术,我们也可以展示系统的性能改进,并产生与最先进的方法相当的结果。
{"title":"Towards better entity resolution techniques for Web document collections","authors":"Surender Reddy Yerva, Z. Miklós, K. Aberer","doi":"10.1109/ICDEW.2010.5452698","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452698","url":null,"abstract":"As person names are non-unique, the same name on different Web pages might or might not refer to the same real-world person. This entity identification problem is one of the most challenging issues in realizing the Semantic Web or entity-oriented search. We address this disambiguation problem, which is very similar to the entity resolution problem studied in relational databases, however there are also several differences. Most importantly Web pages often only contain partial or incomplete information about the persons, moreover the available information is very heterogeneous, thus we are only able to obtain some uncertain evidence about whether two names refer to the same person using similarity functions. These similarity functions capture some aspects of the similarities between Web-pages, where the names occur, thus they perform very differently for the different names. We analyze some data engineering techniques to cope with the limited accuracy of the similarity functions and to combine multiple functions. Even with our simple techniques we could demonstrate systematic performance improvements and produce comparable results to state-of-the-art methods.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129350994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
期刊
2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1