Pub Date : 2010-03-01DOI: 10.1109/ICDEW.2010.5452758
Antonio Sala, Calvin Lin, Howard Ho
We describe our experience in developing a Hadoop based integration flow to collect and integrate publicly available government datasets related to government spending. The objective is to enable users, U.S. taxpayers in this case, to easily access the data their government discloses on the web in different websites. We also provide users with easy-to-use tools to query and explore this data to gather information from the integrated data that allows for evaluation of how tax money is spent.
{"title":"Midas for government: Integration of government spending data on Hadoop","authors":"Antonio Sala, Calvin Lin, Howard Ho","doi":"10.1109/ICDEW.2010.5452758","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452758","url":null,"abstract":"We describe our experience in developing a Hadoop based integration flow to collect and integrate publicly available government datasets related to government spending. The objective is to enable users, U.S. taxpayers in this case, to easily access the data their government discloses on the web in different websites. We also provide users with easy-to-use tools to query and explore this data to gather information from the integrated data that allows for evaluation of how tax money is spent.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126201887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-03-01DOI: 10.1109/ICDEW.2010.5452712
Ioannis Koltsidas, Stratis Viglas
Flash memory has emerged as a high-performing and viable storage alternative to magnetic disks for data-intensive applications. In our work we study how the storage layer of a database system can benefit from the presence of a flash disk. Due to the varying price and I/O characteristics of flash disks the optimal design decisions vary widely across different setups. We study how the system can take advantage of the random read efficiency of inexpensive flash disks by using the latter at the same level of memory hierarchy as magnetic disks in a hybrid setup; our algorithms provide efficient and adaptive data placement that leads to substantial performance improvement. We propose techniques to accurately predict the main memory cache behavior for systems consisting of heterogeneous storage media and selectively allocate memory buffers to devices; thereby, the I/O cost of the system drops significantly, even offsetting wrong data placement decisions. We also explore the design space for a system that uses flash memory as a cache to the underlying storage and propose techniques towards high performance. The experimental results, we believe, exhibit both the potential and necessity of our techniques in future database systems.
{"title":"Flash-enabled database storage","authors":"Ioannis Koltsidas, Stratis Viglas","doi":"10.1109/ICDEW.2010.5452712","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452712","url":null,"abstract":"Flash memory has emerged as a high-performing and viable storage alternative to magnetic disks for data-intensive applications. In our work we study how the storage layer of a database system can benefit from the presence of a flash disk. Due to the varying price and I/O characteristics of flash disks the optimal design decisions vary widely across different setups. We study how the system can take advantage of the random read efficiency of inexpensive flash disks by using the latter at the same level of memory hierarchy as magnetic disks in a hybrid setup; our algorithms provide efficient and adaptive data placement that leads to substantial performance improvement. We propose techniques to accurately predict the main memory cache behavior for systems consisting of heterogeneous storage media and selectively allocate memory buffers to devices; thereby, the I/O cost of the system drops significantly, even offsetting wrong data placement decisions. We also explore the design space for a system that uses flash memory as a cache to the underlying storage and propose techniques towards high performance. The experimental results, we believe, exhibit both the potential and necessity of our techniques in future database systems.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131440309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-03-01DOI: 10.1109/ICDEW.2010.5452754
A. Simitsis, Chetan Gupta, Song Wang, U. Dayal
Many organizations are aiming to move away from traditional batch processing ETL to real-time ETL (RT-ETL). This move is motivated by a need to analyze and take decisions on as fresh a data as possible. The RT-ETL engines operate on the abstraction of data flow executed on parallel architectures. For high throughput and low response times, there is a need for partitioning the data over the large number of nodes in the engine. In this paper, we consider the problem of partitioning realtime ETL flows and we propose a high level architecture for that.
{"title":"Partitioning real-time ETL workflows","authors":"A. Simitsis, Chetan Gupta, Song Wang, U. Dayal","doi":"10.1109/ICDEW.2010.5452754","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452754","url":null,"abstract":"Many organizations are aiming to move away from traditional batch processing ETL to real-time ETL (RT-ETL). This move is motivated by a need to analyze and take decisions on as fresh a data as possible. The RT-ETL engines operate on the abstraction of data flow executed on parallel architectures. For high throughput and low response times, there is a need for partitioning the data over the large number of nodes in the engine. In this paper, we consider the problem of partitioning realtime ETL flows and we propose a high level architecture for that.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131323642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-03-01DOI: 10.1109/ICDEW.2010.5452760
Jens Bleiholder, Sascha Szott, Melanie Herschel, Felix Naumann
A data integration process consists of mapping source data into a target representation (schema mapping [1]), identifying multiple representations of the same real-word object (duplicate detection [2]), and finally combining these representations into a single consistent representation (data fusion [3]). Clearly, as multiple representations of an object are generally not exactly equal, during data fusion, we have to take special care in handling data conflicts. This paper focuses on the definition and implementation of complement union, an operator that defines a new semantics for data fusion.
{"title":"Complement union for data integration","authors":"Jens Bleiholder, Sascha Szott, Melanie Herschel, Felix Naumann","doi":"10.1109/ICDEW.2010.5452760","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452760","url":null,"abstract":"A data integration process consists of mapping source data into a target representation (schema mapping [1]), identifying multiple representations of the same real-word object (duplicate detection [2]), and finally combining these representations into a single consistent representation (data fusion [3]). Clearly, as multiple representations of an object are generally not exactly equal, during data fusion, we have to take special care in handling data conflicts. This paper focuses on the definition and implementation of complement union, an operator that defines a new semantics for data fusion.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132407670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-03-01DOI: 10.1109/ICDEW.2010.5452719
Lei Li, C. Faloutsos
In this paper, we present fast algorithms on mining coevolving time series, with or with out missing values. Our algorithms could mine meaningful patterns effectively and efficiently. With those patterns, our algorithms can do forecasting, compression, and segmentation. Furthermore, we apply our algorithm to solve practical problems including occlusions in motion capture, and generating natural human motions by stitching low-effort motions. We also propose a parallel learning algorithm for LDS to fully utilize the power of multicore/multiprocessors, which will serve as corner stone of many applications and algorithms for time series.
{"title":"Fast algorithms for time series mining","authors":"Lei Li, C. Faloutsos","doi":"10.1109/ICDEW.2010.5452719","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452719","url":null,"abstract":"In this paper, we present fast algorithms on mining coevolving time series, with or with out missing values. Our algorithms could mine meaningful patterns effectively and efficiently. With those patterns, our algorithms can do forecasting, compression, and segmentation. Furthermore, we apply our algorithm to solve practical problems including occlusions in motion capture, and generating natural human motions by stitching low-effort motions. We also propose a parallel learning algorithm for LDS to fully utilize the power of multicore/multiprocessors, which will serve as corner stone of many applications and algorithms for time series.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"9 10","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114030475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-03-01DOI: 10.1109/ICDEW.2010.5452740
Mohammed Abouzour, K. Salem, P. Bumbulis
This paper looks at the problem of automatically tuning the database server multiprogramming level to improve database server performance under varying workloads. We describe two tuning algorithms that were considered and how they performed under different workloads. We then present the hybrid approach that we have successfully implemented in SQL Anywhere 12. We found that the hybrid approach yielded better performance than each of the algorithms separately.
{"title":"Automatic tuning of the multiprogramming level in Sybase SQL Anywhere","authors":"Mohammed Abouzour, K. Salem, P. Bumbulis","doi":"10.1109/ICDEW.2010.5452740","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452740","url":null,"abstract":"This paper looks at the problem of automatically tuning the database server multiprogramming level to improve database server performance under varying workloads. We describe two tuning algorithms that were considered and how they performed under different workloads. We then present the hybrid approach that we have successfully implemented in SQL Anywhere 12. We found that the hybrid approach yielded better performance than each of the algorithms separately.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127599393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-03-01DOI: 10.1109/ICDEW.2010.5452750
Yücel Karabulut, Harald Weppner, I. Nassi, A. Nagarajan, Yash Shroff, Nishant Dubey, Tyelisa Shields
More and more classes of devices become capable of connecting to the Internet. Due to the observation that a point-to-point communication is insufficient for many non-interactive application integration scenarios we assume the existence of a logically centralized message warehousing service, which clients can use to deposit and retrieve messages. The particular challenge in this context is that a client depositing messages can only describe eligible receiving clients using their characterizing attributes and does not know their specific identities. The depositing client still wants to prevent exposure of the message content to the message warehousing service. We explore how this many-to-many integration between devices and enterprise systems can achieve end-to-end information confidentiality using a solution based on Identity-Based Encryption.
{"title":"End-to-end confidentiality for a message warehousing service using Identity-Based Encryption","authors":"Yücel Karabulut, Harald Weppner, I. Nassi, A. Nagarajan, Yash Shroff, Nishant Dubey, Tyelisa Shields","doi":"10.1109/ICDEW.2010.5452750","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452750","url":null,"abstract":"More and more classes of devices become capable of connecting to the Internet. Due to the observation that a point-to-point communication is insufficient for many non-interactive application integration scenarios we assume the existence of a logically centralized message warehousing service, which clients can use to deposit and retrieve messages. The particular challenge in this context is that a client depositing messages can only describe eligible receiving clients using their characterizing attributes and does not know their specific identities. The depositing client still wants to prevent exposure of the message content to the message warehousing service. We explore how this many-to-many integration between devices and enterprise systems can achieve end-to-end information confidentiality using a solution based on Identity-Based Encryption.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129125570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-03-01DOI: 10.1109/ICDEW.2010.5452734
Apinya Tepwankul, S. Maneewongvatana
In recent years, uncertain data have gained increasing research interests due to its natural presence in many applications such as location based services and sensor services. In this paper, we study the problem of clustering uncertain objects. We propose a new deviation function that approximates the underlying uncertain model of objects and a new density-based clustering algorithm, U-DBSCAN, that utilizes the proposed deviation. Since, there is no cluster quality measurement of density-based clustering at present. Thus, we also propose a metric which specifically measures the density quality of clustering solution. Finally, we perform a set of experiments to evaluate the quality effectiveness of our algorithm using our metric. The results reveal that U-DBSCAN gives better clustering quality while having comparable running time compared to a traditional approach of using representative points of objects with DBSCAN.
{"title":"U-DBSCAN : A density-based clustering algorithm for uncertain objects","authors":"Apinya Tepwankul, S. Maneewongvatana","doi":"10.1109/ICDEW.2010.5452734","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452734","url":null,"abstract":"In recent years, uncertain data have gained increasing research interests due to its natural presence in many applications such as location based services and sensor services. In this paper, we study the problem of clustering uncertain objects. We propose a new deviation function that approximates the underlying uncertain model of objects and a new density-based clustering algorithm, U-DBSCAN, that utilizes the proposed deviation. Since, there is no cluster quality measurement of density-based clustering at present. Thus, we also propose a metric which specifically measures the density quality of clustering solution. Finally, we perform a set of experiments to evaluate the quality effectiveness of our algorithm using our metric. The results reveal that U-DBSCAN gives better clustering quality while having comparable running time compared to a traditional approach of using representative points of objects with DBSCAN.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115261170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Classification hierarchies are trees where links codify the fact that a node lower in the hierarchy contains documents whose contents are more specific than those one level above. In turn, multiple classification hierarchies can be connected by semantic links which represent mappings among them and which can be computed, e.g., by ontology matching. In this paper we describe how these two types of links can be used to define a semantic overlay network which can cover any number of peers and which can be flooded to perform semantic search on documents, i.e., to perform semantic flooding. We have evaluated our approach in a simulation of the network of 10,000 peers containing classifications which are fragments of the DMoz Web directory. The results are very promising and show that, in our approach, only a relatively small number of peers needs to be queried in order to achieve high accuracy.
{"title":"Semantic flooding: Search over semantic links","authors":"Fausto Giunchiglia, Uladzimir Kharkevich, Alethia Hume, Piyatat Chatvorawit","doi":"10.1109/ICDEW.2010.5452749","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452749","url":null,"abstract":"Classification hierarchies are trees where links codify the fact that a node lower in the hierarchy contains documents whose contents are more specific than those one level above. In turn, multiple classification hierarchies can be connected by semantic links which represent mappings among them and which can be computed, e.g., by ontology matching. In this paper we describe how these two types of links can be used to define a semantic overlay network which can cover any number of peers and which can be flooded to perform semantic search on documents, i.e., to perform semantic flooding. We have evaluated our approach in a simulation of the network of 10,000 peers containing classifications which are fragments of the DMoz Web directory. The results are very promising and show that, in our approach, only a relatively small number of peers needs to be queried in order to achieve high accuracy.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130354219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-03-01DOI: 10.1109/ICDEW.2010.5452698
Surender Reddy Yerva, Z. Miklós, K. Aberer
As person names are non-unique, the same name on different Web pages might or might not refer to the same real-world person. This entity identification problem is one of the most challenging issues in realizing the Semantic Web or entity-oriented search. We address this disambiguation problem, which is very similar to the entity resolution problem studied in relational databases, however there are also several differences. Most importantly Web pages often only contain partial or incomplete information about the persons, moreover the available information is very heterogeneous, thus we are only able to obtain some uncertain evidence about whether two names refer to the same person using similarity functions. These similarity functions capture some aspects of the similarities between Web-pages, where the names occur, thus they perform very differently for the different names. We analyze some data engineering techniques to cope with the limited accuracy of the similarity functions and to combine multiple functions. Even with our simple techniques we could demonstrate systematic performance improvements and produce comparable results to state-of-the-art methods.
{"title":"Towards better entity resolution techniques for Web document collections","authors":"Surender Reddy Yerva, Z. Miklós, K. Aberer","doi":"10.1109/ICDEW.2010.5452698","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452698","url":null,"abstract":"As person names are non-unique, the same name on different Web pages might or might not refer to the same real-world person. This entity identification problem is one of the most challenging issues in realizing the Semantic Web or entity-oriented search. We address this disambiguation problem, which is very similar to the entity resolution problem studied in relational databases, however there are also several differences. Most importantly Web pages often only contain partial or incomplete information about the persons, moreover the available information is very heterogeneous, thus we are only able to obtain some uncertain evidence about whether two names refer to the same person using similarity functions. These similarity functions capture some aspects of the similarities between Web-pages, where the names occur, thus they perform very differently for the different names. We analyze some data engineering techniques to cope with the limited accuracy of the similarity functions and to combine multiple functions. Even with our simple techniques we could demonstrate systematic performance improvements and produce comparable results to state-of-the-art methods.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129350994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}