2008 IEEE International Conference on Information Reuse and Integration最新文献

英文中文

Identifying learners robust to low quality data 识别对低质量数据具有鲁棒性的学习者

2008 IEEE International Conference on Information Reuse and Integration

Pub Date : 2008-07-13 DOI: 10.1109/IRI.2008.4583028

A. Folleco, T. Khoshgoftaar, J. V. Hulse, Amri Napolitano

Real world datasets commonly contain noise that is distributed in both the independent and dependent variables. Noise, which typically consists of erroneous variable values, has been shown to significantly affect the classification performance of learners. In this study, we identify learners with robust performance in the presence of low quality (noisy) measurement data. Noise was injected into five class imbalanced software engineering measurement datasets, initially relatively free of noise. The experimental factors considered included the learner used, the level of injected noise, the dataset used (each with unique properties), and the percentage of minority instances containing noise. No other related studies were found that have identified learners that are robust in the presence of low quality measurement data. Based on the results of this study, we recommend using the random forest learner for building classification models from noisy data.

现实世界的数据集通常包含分布在自变量和因变量中的噪声。噪声通常由错误的变量值组成，已被证明会显著影响学习器的分类性能。在本研究中，我们识别了在低质量(噪声)测量数据存在下具有鲁棒性能的学习器。将噪声注入到五类不平衡软件工程测量数据集中，初始相对无噪声。考虑的实验因素包括使用的学习器、注入噪声的水平、使用的数据集(每个数据集都具有独特的属性)以及包含噪声的少数实例的百分比。没有其他相关的研究发现，已经确定学习者在低质量的测量数据的存在是稳健的。基于本研究的结果，我们建议使用随机森林学习器从噪声数据中构建分类模型。

引用次数: 63

Using a search engine to query a relational database 使用搜索引擎查询关系数据库

2008 IEEE International Conference on Information Reuse and Integration

Pub Date : 2008-07-13 DOI: 10.1109/IRI.2008.4582997

Brian Harrington, R. Brazile, K. Swigger

While search engines are the most popular way to find information on the web, they are generally not used to query relational databases (RDBs). This paper describes a technique for making the data in an RDB accessible to standard search engines. The technique involves using a URL to express queries and creating a wrapper that can then process the URL-query and generate web pages that contain the answer to the query as well as links to additional data. By following these links, a crawler is able to index the RDB along with all the URL-queries. Once the content and their corresponding URL-queries have been indexed, a user may submit keyword queries through a standard search engine and receive up-to-date database information. The system was then tested to determine if it could return results that were similar to those submitted using SQL. We also looked at whether a standard search engine such as Google could actually index the database content appropriately.

虽然搜索引擎是在web上查找信息的最流行的方法，但它们通常不用于查询关系数据库(rdb)。本文描述了一种使标准搜索引擎可以访问RDB中的数据的技术。该技术涉及使用URL来表达查询，并创建一个包装器，该包装器随后可以处理URL查询并生成包含查询答案以及指向其他数据的链接的网页。通过跟踪这些链接，爬虫能够索引RDB以及所有url查询。一旦内容及其相应的url查询被索引，用户就可以通过标准搜索引擎提交关键字查询，并接收最新的数据库信息。然后对系统进行测试，以确定它是否可以返回与使用SQL提交的结果相似的结果。我们还研究了像Google这样的标准搜索引擎是否能够正确地索引数据库内容。

引用次数: 1

Data warehouse architecture and design 数据仓库体系结构和设计

2008 IEEE International Conference on Information Reuse and Integration

Pub Date : 2008-07-13 DOI: 10.1109/IRI.2008.4583005

Mohammad Rifaie, K. Kianmehr, R. Alhajj, M. Ridley

A data warehouse is attractive as the main repository of an organization’s historical data and is optimized for reporting and analysis. In this paper, we present a data warehouse the process of data warehouse architecture development and design. We highlight the different aspects to be considered in building a data warehouse. These range from data store characteristics to data modeling and the principles to be considered for effective data warehouse architecture.

作为组织历史数据的主要存储库，数据仓库很有吸引力，并且针对报告和分析进行了优化。本文介绍了一个数据仓库体系结构的开发和设计过程。我们重点介绍了在构建数据仓库时需要考虑的不同方面。这些范围从数据存储特征到数据建模以及有效数据仓库架构需要考虑的原则。

引用次数: 33

Authoritative documents identification based on Nonnegative Matrix Factorization 基于非负矩阵分解的权威文档识别

2008 IEEE International Conference on Information Reuse and Integration

Pub Date : 2008-07-13 DOI: 10.1109/IRI.2008.4583040

N. F. Chikhi, B. Rothenburger, Nathalie Aussenac-Gilles

Current techniques for authoritative documents identification (ADI) suffer two main drawbacks. On the one hand, results of several ADI algorithms cannot be interpreted in a straightforward manner. This symptom is observed for instance in the HITS family algorithms. On the other hand, accuracy of some ADI algorithms is poor. For instance, PHITS overcomes the interpretability issue of HITS at the price of a low accuracy. In this paper, we propose a new ADI algorithm, namely NHITS, which experimentally outperforms both HITS and PHITS in terms of interpretability and accuracy.

目前的权威文档识别(ADI)技术存在两个主要缺陷。一方面，几种ADI算法的结果不能以直接的方式解释。例如，在HITS系列算法中可以观察到这种症状。另一方面，一些ADI算法的精度较差。例如，PHITS以较低的准确率为代价克服了HITS的可解释性问题。在本文中，我们提出了一种新的ADI算法，即NHITS，它在可解释性和准确性方面都优于HITS和PHITS。

引用次数: 1

An unsupervised protein sequences clustering algorithm using functional domain information 一种基于功能域信息的无监督蛋白质序列聚类算法

2008 IEEE International Conference on Information Reuse and Integration

Pub Date : 2008-07-13 DOI: 10.1109/IRI.2008.4583008

Wei-bang Chen, Chengcui Zhang, Hua Zhong

In this paper, we present an unsupervised novel approach for protein sequences clustering by incorporating the functional domain information into the clustering process. In the proposed framework, the domain boundaries predicated by ProDom database are used to provide a better measurement in calculating the sequence similarity. In addition, we use an unsupervised clustering algorithm as the kernel that includes a hierarchical clustering in the first phase to pre-cluster the protein sequences, and a partitioning clustering in the second phase to refine the clustering results. More specifically, we perform the agglomerative hierarchical clustering on protein sequences in the first phase to obtain the initial clustering results for the subsequent partitioning clustering, and then, a profile Hidden Markove Model (HMM) is built for each cluster to represent the centroid of a cluster. In the second phase, the HMMs based k-means clustering is then performed to refine the cluster results as protein families. The experimental results show our model is effective and efficient in clustering protein families.

在本文中，我们提出了一种将功能域信息纳入聚类过程的无监督蛋白质序列聚类新方法。在该框架中，利用ProDom数据库预测的领域边界为序列相似度的计算提供了更好的度量。此外，我们还采用了一种无监督聚类算法作为核心，该算法在第一阶段采用分层聚类对蛋白质序列进行预聚类，在第二阶段采用分区聚类对聚类结果进行细化。具体来说，我们在第一阶段对蛋白质序列进行聚类，获得初始聚类结果，用于后续的划分聚类，然后为每个聚类构建一个隐马尔可夫模型(HMM)来表示聚类的质心。在第二阶段，然后执行基于hmm的k-means聚类，将聚类结果细化为蛋白质家族。实验结果表明，该模型对蛋白质家族的聚类是有效的。

{"title":"An unsupervised protein sequences clustering algorithm using functional domain information","authors":"Wei-bang Chen, Chengcui Zhang, Hua Zhong","doi":"10.1109/IRI.2008.4583008","DOIUrl":"https://doi.org/10.1109/IRI.2008.4583008","url":null,"abstract":"In this paper, we present an unsupervised novel approach for protein sequences clustering by incorporating the functional domain information into the clustering process. In the proposed framework, the domain boundaries predicated by ProDom database are used to provide a better measurement in calculating the sequence similarity. In addition, we use an unsupervised clustering algorithm as the kernel that includes a hierarchical clustering in the first phase to pre-cluster the protein sequences, and a partitioning clustering in the second phase to refine the clustering results. More specifically, we perform the agglomerative hierarchical clustering on protein sequences in the first phase to obtain the initial clustering results for the subsequent partitioning clustering, and then, a profile Hidden Markove Model (HMM) is built for each cluster to represent the centroid of a cluster. In the second phase, the HMMs based k-means clustering is then performed to refine the cluster results as protein families. The experimental results show our model is effective and efficient in clustering protein families.","PeriodicalId":169554,"journal":{"name":"2008 IEEE International Conference on Information Reuse and Integration","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114108189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Incorporating fuzziness into timer-triggers for temporal event handling 将模糊性整合到时间事件处理的定时器触发器中

2008 IEEE International Conference on Information Reuse and Integration

Pub Date : 2008-07-13 DOI: 10.1109/IRI.2008.4583051

Ying Jin, Tejaswitha Bhavsar

Database triggers allow database users to specify integrity constraints and business logics by describing the reactions to events. Traditional database triggers can handle mutating events such as insert, update, and delete. This paper describes our approach to incorporate timer-triggers to handle temporal events that are generated at a given time or at certain time intervals. We propose a trigger language, named FZ-Trigger, to allow fuzziness in database triggers. FZ-Triggers allow fuzzy expressions in the condition part of a trigger with either a mutating event or a temporal event. This paper describes the generation of temporal events, the language of FZ-Triggers, and the system implementation. We also present a motivating example that illustrates the use of FZ-Trigger in the case of reacting to temporal events.

数据库触发器允许数据库用户通过描述对事件的反应来指定完整性约束和业务逻辑。传统的数据库触发器可以处理诸如插入、更新和删除之类的变化事件。本文描述了我们结合定时器触发器来处理在给定时间或特定时间间隔生成的时间事件的方法。我们提出了一种触发器语言，命名为FZ-Trigger，以允许数据库触发器的模糊性。FZ-Triggers允许在触发器的条件部分使用模糊表达式，条件部分可以是突变事件，也可以是临时事件。本文介绍了时间事件的生成、fz触发器的语言和系统实现。我们还提供了一个激励示例，说明了在对时间事件作出反应的情况下使用FZ-Trigger。

引用次数: 5

Workflow instance detection: Toward a knowledge capture methodology for smart oilfields 工作流实例检测:面向智能油田的知识捕获方法

2008 IEEE International Conference on Information Reuse and Integration

Pub Date : 2008-07-13 DOI: 10.1109/IRI.2008.4583058

Fan Sun, V. Prasanna, A. Bakshi, L. Pianelo

A system that captures knowledge from experienced users is of great interest in the oil industry. An important source of knowledge is application logs that record user activities. However, most of the log files are sequential records of pre-defined low level actions. It is often inconvenient or even impossible for humans to view and obtain useful information from these log entries. Also, the heterogeneity of log data in terms of syntax and granularity makes it challenging to extract the underlying knowledge from log files. In this paper, we propose a semantically rich workflow model to capture the semantics of user activities in a hierarchical structure. The mapping from low level log entries to semantic level workflow components enables automatic aggregation of log entries and their high level representation. We model and analyze two cases from the petroleum engineering domain in detail. We also present an algorithm that detects workflow instances from log files. Experimental results show that the detection algorithm is efficient and scalable.

从经验丰富的用户那里获取知识的系统是石油行业非常感兴趣的。一个重要的知识来源是记录用户活动的应用程序日志。但是，大多数日志文件都是预定义的低级操作的顺序记录。人们通常不方便甚至不可能从这些日志条目中查看和获取有用的信息。此外，日志数据在语法和粒度方面的异质性使得从日志文件中提取底层知识具有挑战性。在本文中，我们提出了一个语义丰富的工作流模型来捕获层次结构中的用户活动的语义。从低级日志条目到语义级工作流组件的映射支持日志条目及其高级表示的自动聚合。我们对石油工程领域的两个案例进行了详细的建模和分析。我们还提出了一种从日志文件中检测工作流实例的算法。实验结果表明，该算法具有较高的检测效率和可扩展性。

引用次数: 7

An empirical study of supervised learning for biological sequence profiling and microarray expression data analysis 监督学习在生物序列分析和微阵列表达数据分析中的实证研究

2008 IEEE International Conference on Information Reuse and Integration

Pub Date : 2008-07-13 DOI: 10.1109/IRI.2008.4583007

Abu H. M. Kamal, Xingquan Zhu, A. Pandya, S. Hsu, Yong Shi

Recent years have seen increasing quantities of high-throughput biological data available for genetic disease profiling, protein structure and function prediction, and new drug and therapy discovery. High-throughput biological experiments output high volume and/or high dimensional data, which impose significant challenges for molecular biologists and domain experts to properly and rapidly digest and interpret the data. In this paper, we provide simple background knowledge for computer scientists to understand how supervised learning tools can be used to solve biological challenges, with a primary focus on two types of problems: Biological sequence profiling and microarray expression data analysis. We employ a set of supervised learning methods to analyze four types of biological data: (1) gene promoter site prediction; (2) splice junction prediction; (3) protein structure prediction; and (4) gene expression data analysis. We argue that although existing studies favor one or two learning methods (such as Support Vector Machines), such conclusions might have been biased, mainly because of the inadequacy of the measures employed in their study. A line of learning algorithms should be considered in different scenarios, depending on the objective and the requirement of the applications, such as the system running time or the prediction accuracy on the minority class examples.

近年来，越来越多的高通量生物学数据可用于遗传疾病分析、蛋白质结构和功能预测以及新药和治疗发现。高通量生物实验输出高容量和/或高维数据，这对分子生物学家和领域专家正确快速地消化和解释数据提出了重大挑战。在本文中，我们为计算机科学家提供了简单的背景知识，以了解如何使用监督学习工具来解决生物学挑战，主要关注两类问题:生物序列分析和微阵列表达数据分析。我们采用一套监督学习方法来分析四种类型的生物学数据:(1)基因启动子位点预测;(2)拼接结预测;(3)蛋白质结构预测;(4)基因表达数据分析。我们认为，虽然现有的研究倾向于一种或两种学习方法(如支持向量机)，但这些结论可能是有偏见的，主要是因为他们研究中采用的措施不充分。根据应用程序的目标和要求，例如系统运行时间或对少数类示例的预测精度，应该在不同的场景中考虑一系列学习算法。

{"title":"An empirical study of supervised learning for biological sequence profiling and microarray expression data analysis","authors":"Abu H. M. Kamal, Xingquan Zhu, A. Pandya, S. Hsu, Yong Shi","doi":"10.1109/IRI.2008.4583007","DOIUrl":"https://doi.org/10.1109/IRI.2008.4583007","url":null,"abstract":"Recent years have seen increasing quantities of high-throughput biological data available for genetic disease profiling, protein structure and function prediction, and new drug and therapy discovery. High-throughput biological experiments output high volume and/or high dimensional data, which impose significant challenges for molecular biologists and domain experts to properly and rapidly digest and interpret the data. In this paper, we provide simple background knowledge for computer scientists to understand how supervised learning tools can be used to solve biological challenges, with a primary focus on two types of problems: Biological sequence profiling and microarray expression data analysis. We employ a set of supervised learning methods to analyze four types of biological data: (1) gene promoter site prediction; (2) splice junction prediction; (3) protein structure prediction; and (4) gene expression data analysis. We argue that although existing studies favor one or two learning methods (such as Support Vector Machines), such conclusions might have been biased, mainly because of the inadequacy of the measures employed in their study. A line of learning algorithms should be considered in different scenarios, depending on the objective and the requirement of the applications, such as the system running time or the prediction accuracy on the minority class examples.","PeriodicalId":169554,"journal":{"name":"2008 IEEE International Conference on Information Reuse and Integration","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127174493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An algorithm for activation timed influence nets 一种激活定时影响网的算法

2008 IEEE International Conference on Information Reuse and Integration

Pub Date : 2008-07-13 DOI: 10.1109/IRI.2008.4583050

P. Papantoni-Kazakos, A. K. Zaidi, M. F. Rafi

Activation Timed Influence Net (ATIN) is a term representing a progressively evolving sequence of actions, where the effects of an action become the preconditions of the action that follows. An ATIN integrates the notions of time and uncertainty in a network model, where nodes explicitly represent mechanisms and/or tactical actions that are responsible for changes in the state of a domain. In this paper, we present an algorithm for the initialization of actions within a ATIN.

激活限时影响网(ATIN)是一个术语，代表一个逐步发展的行动序列，其中一个行动的效果成为后续行动的先决条件。ATIN在网络模型中集成了时间和不确定性的概念，其中节点显式地表示负责域状态变化的机制和/或战术行动。在本文中，我们提出了一种在ATIN中初始化动作的算法。

引用次数: 5

Gadget creation for personal information integration on web portals 为门户网站上的个人信息集成创建小工具

2008 IEEE International Conference on Information Reuse and Integration

Pub Date : 2008-07-13 DOI: 10.1109/IRI.2008.4583076

Chia-Hui Chang, Shih-Feng Yang, Che-Min Liou, Mohammed Kayed

Although the ever growing Web contain information to virtually every user’s query, it does not guarantee effectively accessing to those information. In many situations, the users still have to do a lot of browsing in order to fuse the information needed. In this paper, we propose the idea of gadget creation such that extracted data can be immediately reused on personal portals by existing presentation components, like map, calendar, table and lists, etc. The underlying technique is an unsupervised web data extraction approach, FivaTech, which has been proposed to wrap data (usually in xml format). Despite the efforts to utilize supervised web data extraction in RSS feed burning like OpenKapow and Dapper, there’s no research on incorporating unsupervised extraction method for RSS feeds or gadget creation. The advanced application in gadget creation allow immediate use by users and can be embedded to any web sites, especially Web portals (personal desktop on Web). This paper describes our initiatives in working towards a personal information integration service where light-weight software can be created without programming.

尽管不断增长的Web包含了几乎每个用户查询的信息，但它并不能保证有效地访问这些信息。在许多情况下，用户仍然需要进行大量的浏览才能融合所需的信息。在本文中，我们提出了创建小工具的想法，以便提取的数据可以立即在个人门户上被现有的表示组件(如地图、日历、表和列表等)重用。底层技术是一种无监督的web数据提取方法，即FivaTech，它被提议包装数据(通常以xml格式)。尽管OpenKapow和Dapper等公司已经在RSS提要中使用了有监督的网络数据提取方法，但目前还没有研究将无监督提取方法纳入RSS提要或小工具创建中。小工具创建中的高级应用程序允许用户立即使用，并且可以嵌入到任何网站，特别是web门户(web上的个人桌面)。本文描述了我们为实现个人信息集成服务所做的努力，在这种服务中，无需编程就可以创建轻量级软件。

{"title":"Gadget creation for personal information integration on web portals","authors":"Chia-Hui Chang, Shih-Feng Yang, Che-Min Liou, Mohammed Kayed","doi":"10.1109/IRI.2008.4583076","DOIUrl":"https://doi.org/10.1109/IRI.2008.4583076","url":null,"abstract":"Although the ever growing Web contain information to virtually every user’s query, it does not guarantee effectively accessing to those information. In many situations, the users still have to do a lot of browsing in order to fuse the information needed. In this paper, we propose the idea of gadget creation such that extracted data can be immediately reused on personal portals by existing presentation components, like map, calendar, table and lists, etc. The underlying technique is an unsupervised web data extraction approach, FivaTech, which has been proposed to wrap data (usually in xml format). Despite the efforts to utilize supervised web data extraction in RSS feed burning like OpenKapow and Dapper, there’s no research on incorporating unsupervised extraction method for RSS feeds or gadget creation. The advanced application in gadget creation allow immediate use by users and can be embedded to any web sites, especially Web portals (personal desktop on Web). This paper describes our initiatives in working towards a personal information integration service where light-weight software can be created without programming.","PeriodicalId":169554,"journal":{"name":"2008 IEEE International Conference on Information Reuse and Integration","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125958018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2008 IEEE International Conference on Information Reuse and Integration

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀