Pub Date : 2008-07-13DOI: 10.1109/IRI.2008.4583028
A. Folleco, T. Khoshgoftaar, J. V. Hulse, Amri Napolitano
Real world datasets commonly contain noise that is distributed in both the independent and dependent variables. Noise, which typically consists of erroneous variable values, has been shown to significantly affect the classification performance of learners. In this study, we identify learners with robust performance in the presence of low quality (noisy) measurement data. Noise was injected into five class imbalanced software engineering measurement datasets, initially relatively free of noise. The experimental factors considered included the learner used, the level of injected noise, the dataset used (each with unique properties), and the percentage of minority instances containing noise. No other related studies were found that have identified learners that are robust in the presence of low quality measurement data. Based on the results of this study, we recommend using the random forest learner for building classification models from noisy data.
{"title":"Identifying learners robust to low quality data","authors":"A. Folleco, T. Khoshgoftaar, J. V. Hulse, Amri Napolitano","doi":"10.1109/IRI.2008.4583028","DOIUrl":"https://doi.org/10.1109/IRI.2008.4583028","url":null,"abstract":"Real world datasets commonly contain noise that is distributed in both the independent and dependent variables. Noise, which typically consists of erroneous variable values, has been shown to significantly affect the classification performance of learners. In this study, we identify learners with robust performance in the presence of low quality (noisy) measurement data. Noise was injected into five class imbalanced software engineering measurement datasets, initially relatively free of noise. The experimental factors considered included the learner used, the level of injected noise, the dataset used (each with unique properties), and the percentage of minority instances containing noise. No other related studies were found that have identified learners that are robust in the presence of low quality measurement data. Based on the results of this study, we recommend using the random forest learner for building classification models from noisy data.","PeriodicalId":169554,"journal":{"name":"2008 IEEE International Conference on Information Reuse and Integration","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125207083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-07-13DOI: 10.1109/IRI.2008.4582997
Brian Harrington, R. Brazile, K. Swigger
While search engines are the most popular way to find information on the web, they are generally not used to query relational databases (RDBs). This paper describes a technique for making the data in an RDB accessible to standard search engines. The technique involves using a URL to express queries and creating a wrapper that can then process the URL-query and generate web pages that contain the answer to the query as well as links to additional data. By following these links, a crawler is able to index the RDB along with all the URL-queries. Once the content and their corresponding URL-queries have been indexed, a user may submit keyword queries through a standard search engine and receive up-to-date database information. The system was then tested to determine if it could return results that were similar to those submitted using SQL. We also looked at whether a standard search engine such as Google could actually index the database content appropriately.
{"title":"Using a search engine to query a relational database","authors":"Brian Harrington, R. Brazile, K. Swigger","doi":"10.1109/IRI.2008.4582997","DOIUrl":"https://doi.org/10.1109/IRI.2008.4582997","url":null,"abstract":"While search engines are the most popular way to find information on the web, they are generally not used to query relational databases (RDBs). This paper describes a technique for making the data in an RDB accessible to standard search engines. The technique involves using a URL to express queries and creating a wrapper that can then process the URL-query and generate web pages that contain the answer to the query as well as links to additional data. By following these links, a crawler is able to index the RDB along with all the URL-queries. Once the content and their corresponding URL-queries have been indexed, a user may submit keyword queries through a standard search engine and receive up-to-date database information. The system was then tested to determine if it could return results that were similar to those submitted using SQL. We also looked at whether a standard search engine such as Google could actually index the database content appropriately.","PeriodicalId":169554,"journal":{"name":"2008 IEEE International Conference on Information Reuse and Integration","volume":"96 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134057919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-07-13DOI: 10.1109/IRI.2008.4583005
Mohammad Rifaie, K. Kianmehr, R. Alhajj, M. Ridley
A data warehouse is attractive as the main repository of an organization’s historical data and is optimized for reporting and analysis. In this paper, we present a data warehouse the process of data warehouse architecture development and design. We highlight the different aspects to be considered in building a data warehouse. These range from data store characteristics to data modeling and the principles to be considered for effective data warehouse architecture.
{"title":"Data warehouse architecture and design","authors":"Mohammad Rifaie, K. Kianmehr, R. Alhajj, M. Ridley","doi":"10.1109/IRI.2008.4583005","DOIUrl":"https://doi.org/10.1109/IRI.2008.4583005","url":null,"abstract":"A data warehouse is attractive as the main repository of an organization’s historical data and is optimized for reporting and analysis. In this paper, we present a data warehouse the process of data warehouse architecture development and design. We highlight the different aspects to be considered in building a data warehouse. These range from data store characteristics to data modeling and the principles to be considered for effective data warehouse architecture.","PeriodicalId":169554,"journal":{"name":"2008 IEEE International Conference on Information Reuse and Integration","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131273512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-07-13DOI: 10.1109/IRI.2008.4583040
N. F. Chikhi, B. Rothenburger, Nathalie Aussenac-Gilles
Current techniques for authoritative documents identification (ADI) suffer two main drawbacks. On the one hand, results of several ADI algorithms cannot be interpreted in a straightforward manner. This symptom is observed for instance in the HITS family algorithms. On the other hand, accuracy of some ADI algorithms is poor. For instance, PHITS overcomes the interpretability issue of HITS at the price of a low accuracy. In this paper, we propose a new ADI algorithm, namely NHITS, which experimentally outperforms both HITS and PHITS in terms of interpretability and accuracy.
{"title":"Authoritative documents identification based on Nonnegative Matrix Factorization","authors":"N. F. Chikhi, B. Rothenburger, Nathalie Aussenac-Gilles","doi":"10.1109/IRI.2008.4583040","DOIUrl":"https://doi.org/10.1109/IRI.2008.4583040","url":null,"abstract":"Current techniques for authoritative documents identification (ADI) suffer two main drawbacks. On the one hand, results of several ADI algorithms cannot be interpreted in a straightforward manner. This symptom is observed for instance in the HITS family algorithms. On the other hand, accuracy of some ADI algorithms is poor. For instance, PHITS overcomes the interpretability issue of HITS at the price of a low accuracy. In this paper, we propose a new ADI algorithm, namely NHITS, which experimentally outperforms both HITS and PHITS in terms of interpretability and accuracy.","PeriodicalId":169554,"journal":{"name":"2008 IEEE International Conference on Information Reuse and Integration","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133333699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-07-13DOI: 10.1109/IRI.2008.4583008
Wei-bang Chen, Chengcui Zhang, Hua Zhong
In this paper, we present an unsupervised novel approach for protein sequences clustering by incorporating the functional domain information into the clustering process. In the proposed framework, the domain boundaries predicated by ProDom database are used to provide a better measurement in calculating the sequence similarity. In addition, we use an unsupervised clustering algorithm as the kernel that includes a hierarchical clustering in the first phase to pre-cluster the protein sequences, and a partitioning clustering in the second phase to refine the clustering results. More specifically, we perform the agglomerative hierarchical clustering on protein sequences in the first phase to obtain the initial clustering results for the subsequent partitioning clustering, and then, a profile Hidden Markove Model (HMM) is built for each cluster to represent the centroid of a cluster. In the second phase, the HMMs based k-means clustering is then performed to refine the cluster results as protein families. The experimental results show our model is effective and efficient in clustering protein families.
{"title":"An unsupervised protein sequences clustering algorithm using functional domain information","authors":"Wei-bang Chen, Chengcui Zhang, Hua Zhong","doi":"10.1109/IRI.2008.4583008","DOIUrl":"https://doi.org/10.1109/IRI.2008.4583008","url":null,"abstract":"In this paper, we present an unsupervised novel approach for protein sequences clustering by incorporating the functional domain information into the clustering process. In the proposed framework, the domain boundaries predicated by ProDom database are used to provide a better measurement in calculating the sequence similarity. In addition, we use an unsupervised clustering algorithm as the kernel that includes a hierarchical clustering in the first phase to pre-cluster the protein sequences, and a partitioning clustering in the second phase to refine the clustering results. More specifically, we perform the agglomerative hierarchical clustering on protein sequences in the first phase to obtain the initial clustering results for the subsequent partitioning clustering, and then, a profile Hidden Markove Model (HMM) is built for each cluster to represent the centroid of a cluster. In the second phase, the HMMs based k-means clustering is then performed to refine the cluster results as protein families. The experimental results show our model is effective and efficient in clustering protein families.","PeriodicalId":169554,"journal":{"name":"2008 IEEE International Conference on Information Reuse and Integration","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114108189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-07-13DOI: 10.1109/IRI.2008.4583051
Ying Jin, Tejaswitha Bhavsar
Database triggers allow database users to specify integrity constraints and business logics by describing the reactions to events. Traditional database triggers can handle mutating events such as insert, update, and delete. This paper describes our approach to incorporate timer-triggers to handle temporal events that are generated at a given time or at certain time intervals. We propose a trigger language, named FZ-Trigger, to allow fuzziness in database triggers. FZ-Triggers allow fuzzy expressions in the condition part of a trigger with either a mutating event or a temporal event. This paper describes the generation of temporal events, the language of FZ-Triggers, and the system implementation. We also present a motivating example that illustrates the use of FZ-Trigger in the case of reacting to temporal events.
{"title":"Incorporating fuzziness into timer-triggers for temporal event handling","authors":"Ying Jin, Tejaswitha Bhavsar","doi":"10.1109/IRI.2008.4583051","DOIUrl":"https://doi.org/10.1109/IRI.2008.4583051","url":null,"abstract":"Database triggers allow database users to specify integrity constraints and business logics by describing the reactions to events. Traditional database triggers can handle mutating events such as insert, update, and delete. This paper describes our approach to incorporate timer-triggers to handle temporal events that are generated at a given time or at certain time intervals. We propose a trigger language, named FZ-Trigger, to allow fuzziness in database triggers. FZ-Triggers allow fuzzy expressions in the condition part of a trigger with either a mutating event or a temporal event. This paper describes the generation of temporal events, the language of FZ-Triggers, and the system implementation. We also present a motivating example that illustrates the use of FZ-Trigger in the case of reacting to temporal events.","PeriodicalId":169554,"journal":{"name":"2008 IEEE International Conference on Information Reuse and Integration","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131487533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-07-13DOI: 10.1109/IRI.2008.4583058
Fan Sun, V. Prasanna, A. Bakshi, L. Pianelo
A system that captures knowledge from experienced users is of great interest in the oil industry. An important source of knowledge is application logs that record user activities. However, most of the log files are sequential records of pre-defined low level actions. It is often inconvenient or even impossible for humans to view and obtain useful information from these log entries. Also, the heterogeneity of log data in terms of syntax and granularity makes it challenging to extract the underlying knowledge from log files. In this paper, we propose a semantically rich workflow model to capture the semantics of user activities in a hierarchical structure. The mapping from low level log entries to semantic level workflow components enables automatic aggregation of log entries and their high level representation. We model and analyze two cases from the petroleum engineering domain in detail. We also present an algorithm that detects workflow instances from log files. Experimental results show that the detection algorithm is efficient and scalable.
{"title":"Workflow instance detection: Toward a knowledge capture methodology for smart oilfields","authors":"Fan Sun, V. Prasanna, A. Bakshi, L. Pianelo","doi":"10.1109/IRI.2008.4583058","DOIUrl":"https://doi.org/10.1109/IRI.2008.4583058","url":null,"abstract":"A system that captures knowledge from experienced users is of great interest in the oil industry. An important source of knowledge is application logs that record user activities. However, most of the log files are sequential records of pre-defined low level actions. It is often inconvenient or even impossible for humans to view and obtain useful information from these log entries. Also, the heterogeneity of log data in terms of syntax and granularity makes it challenging to extract the underlying knowledge from log files. In this paper, we propose a semantically rich workflow model to capture the semantics of user activities in a hierarchical structure. The mapping from low level log entries to semantic level workflow components enables automatic aggregation of log entries and their high level representation. We model and analyze two cases from the petroleum engineering domain in detail. We also present an algorithm that detects workflow instances from log files. Experimental results show that the detection algorithm is efficient and scalable.","PeriodicalId":169554,"journal":{"name":"2008 IEEE International Conference on Information Reuse and Integration","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129063520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-07-13DOI: 10.1109/IRI.2008.4583007
Abu H. M. Kamal, Xingquan Zhu, A. Pandya, S. Hsu, Yong Shi
Recent years have seen increasing quantities of high-throughput biological data available for genetic disease profiling, protein structure and function prediction, and new drug and therapy discovery. High-throughput biological experiments output high volume and/or high dimensional data, which impose significant challenges for molecular biologists and domain experts to properly and rapidly digest and interpret the data. In this paper, we provide simple background knowledge for computer scientists to understand how supervised learning tools can be used to solve biological challenges, with a primary focus on two types of problems: Biological sequence profiling and microarray expression data analysis. We employ a set of supervised learning methods to analyze four types of biological data: (1) gene promoter site prediction; (2) splice junction prediction; (3) protein structure prediction; and (4) gene expression data analysis. We argue that although existing studies favor one or two learning methods (such as Support Vector Machines), such conclusions might have been biased, mainly because of the inadequacy of the measures employed in their study. A line of learning algorithms should be considered in different scenarios, depending on the objective and the requirement of the applications, such as the system running time or the prediction accuracy on the minority class examples.
{"title":"An empirical study of supervised learning for biological sequence profiling and microarray expression data analysis","authors":"Abu H. M. Kamal, Xingquan Zhu, A. Pandya, S. Hsu, Yong Shi","doi":"10.1109/IRI.2008.4583007","DOIUrl":"https://doi.org/10.1109/IRI.2008.4583007","url":null,"abstract":"Recent years have seen increasing quantities of high-throughput biological data available for genetic disease profiling, protein structure and function prediction, and new drug and therapy discovery. High-throughput biological experiments output high volume and/or high dimensional data, which impose significant challenges for molecular biologists and domain experts to properly and rapidly digest and interpret the data. In this paper, we provide simple background knowledge for computer scientists to understand how supervised learning tools can be used to solve biological challenges, with a primary focus on two types of problems: Biological sequence profiling and microarray expression data analysis. We employ a set of supervised learning methods to analyze four types of biological data: (1) gene promoter site prediction; (2) splice junction prediction; (3) protein structure prediction; and (4) gene expression data analysis. We argue that although existing studies favor one or two learning methods (such as Support Vector Machines), such conclusions might have been biased, mainly because of the inadequacy of the measures employed in their study. A line of learning algorithms should be considered in different scenarios, depending on the objective and the requirement of the applications, such as the system running time or the prediction accuracy on the minority class examples.","PeriodicalId":169554,"journal":{"name":"2008 IEEE International Conference on Information Reuse and Integration","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127174493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-07-13DOI: 10.1109/IRI.2008.4583050
P. Papantoni-Kazakos, A. K. Zaidi, M. F. Rafi
Activation Timed Influence Net (ATIN) is a term representing a progressively evolving sequence of actions, where the effects of an action become the preconditions of the action that follows. An ATIN integrates the notions of time and uncertainty in a network model, where nodes explicitly represent mechanisms and/or tactical actions that are responsible for changes in the state of a domain. In this paper, we present an algorithm for the initialization of actions within a ATIN.
{"title":"An algorithm for activation timed influence nets","authors":"P. Papantoni-Kazakos, A. K. Zaidi, M. F. Rafi","doi":"10.1109/IRI.2008.4583050","DOIUrl":"https://doi.org/10.1109/IRI.2008.4583050","url":null,"abstract":"Activation Timed Influence Net (ATIN) is a term representing a progressively evolving sequence of actions, where the effects of an action become the preconditions of the action that follows. An ATIN integrates the notions of time and uncertainty in a network model, where nodes explicitly represent mechanisms and/or tactical actions that are responsible for changes in the state of a domain. In this paper, we present an algorithm for the initialization of actions within a ATIN.","PeriodicalId":169554,"journal":{"name":"2008 IEEE International Conference on Information Reuse and Integration","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126461067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-07-13DOI: 10.1109/IRI.2008.4583076
Chia-Hui Chang, Shih-Feng Yang, Che-Min Liou, Mohammed Kayed
Although the ever growing Web contain information to virtually every user’s query, it does not guarantee effectively accessing to those information. In many situations, the users still have to do a lot of browsing in order to fuse the information needed. In this paper, we propose the idea of gadget creation such that extracted data can be immediately reused on personal portals by existing presentation components, like map, calendar, table and lists, etc. The underlying technique is an unsupervised web data extraction approach, FivaTech, which has been proposed to wrap data (usually in xml format). Despite the efforts to utilize supervised web data extraction in RSS feed burning like OpenKapow and Dapper, there’s no research on incorporating unsupervised extraction method for RSS feeds or gadget creation. The advanced application in gadget creation allow immediate use by users and can be embedded to any web sites, especially Web portals (personal desktop on Web). This paper describes our initiatives in working towards a personal information integration service where light-weight software can be created without programming.
{"title":"Gadget creation for personal information integration on web portals","authors":"Chia-Hui Chang, Shih-Feng Yang, Che-Min Liou, Mohammed Kayed","doi":"10.1109/IRI.2008.4583076","DOIUrl":"https://doi.org/10.1109/IRI.2008.4583076","url":null,"abstract":"Although the ever growing Web contain information to virtually every user’s query, it does not guarantee effectively accessing to those information. In many situations, the users still have to do a lot of browsing in order to fuse the information needed. In this paper, we propose the idea of gadget creation such that extracted data can be immediately reused on personal portals by existing presentation components, like map, calendar, table and lists, etc. The underlying technique is an unsupervised web data extraction approach, FivaTech, which has been proposed to wrap data (usually in xml format). Despite the efforts to utilize supervised web data extraction in RSS feed burning like OpenKapow and Dapper, there’s no research on incorporating unsupervised extraction method for RSS feeds or gadget creation. The advanced application in gadget creation allow immediate use by users and can be embedded to any web sites, especially Web portals (personal desktop on Web). This paper describes our initiatives in working towards a personal information integration service where light-weight software can be created without programming.","PeriodicalId":169554,"journal":{"name":"2008 IEEE International Conference on Information Reuse and Integration","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125958018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}