Proceedings. International Conference on Data Engineering最新文献

A Neural Database for Answering Aggregate Queries on Incomplete Relational Data (Extended Abstract). 用于回答不完整关系数据汇总查询的神经数据库（扩展摘要）。

Proceedings. International Conference on Data Engineering

Pub Date : 2024-05-01 Epub Date: 2024-07-23 DOI: 10.1109/icde60146.2024.00483

Sepanta Zeighami, Raghav Seshadri, Cyrus Shahabi

引用次数: 0

Wearables for Health (W4H) Toolkit for Acquisition, Storage, Analysis and Visualization of Data from Various Wearable Devices. 用于从各种可穿戴设备获取、存储、分析和可视化数据的健康可穿戴设备 (W4H) 工具包。

Proceedings. International Conference on Data Engineering

Pub Date : 2024-05-01 Epub Date: 2024-07-23 DOI: 10.1109/ICDE60146.2024.00419

Arash Hajisafi, Maria Despoina Siampou, Jize Bi, Luciano Nocera, Cyrus Shahabi

The Wearables for Health Toolkit (W4H Toolkit) is an open-source platform that provides a robust, end-to-end solution for the centralized management and analysis of wearable data. With integrated tools and frameworks, the toolkit facilitates seamless data acquisition, integration, storage, analysis, and visualization of both stored and streaming data from various wearable devices. The W4H Toolkit is designed to provide medical researchers and health practitioners with a unified framework that enables the analysis of health-related data for various clinical applications. We provide an overview of the system and demonstrate how it can be used by health researchers to import and analyze a wide range of wearable data and perform data analysis, highlighting the versatility and functionality of the system across diverse healthcare domains and applications.

健康可穿戴设备工具包（W4H Toolkit）是一个开源平台，为集中管理和分析可穿戴设备数据提供了一个强大的端到端解决方案。通过集成工具和框架，该工具包可对来自各种可穿戴设备的存储数据和流数据进行无缝数据采集、集成、存储、分析和可视化。W4H 工具包旨在为医学研究人员和医疗从业人员提供一个统一的框架，使他们能够为各种临床应用分析健康相关数据。我们对该系统进行了概述，并演示了健康研究人员如何使用它来导入和分析各种可穿戴数据并进行数据分析，从而突出了该系统在不同医疗保健领域和应用中的通用性和功能性。

引用次数: 0

A Mortality Study for ICU Patients using Bursty Medical Events. 突发医学事件对ICU患者死亡率的影响

Proceedings. International Conference on Data Engineering

Pub Date : 2017-04-01 Epub Date: 2017-05-18 DOI: 10.1109/ICDE.2017.224

Luca Bonomi, Xiaoqian Jiang

The study of patients in Intensive Care Units (ICUs) is a crucial task in critical care research which has significant implications both in identifying clinical risk factors and defining institutional guidances. The mortality study of ICU patients is of particular interest because it provides useful indications to healthcare institutions for improving patients experience, internal policies, and procedures (e.g. allocation of resources). To this end, many research works have been focused on the length of stay (LOS) for ICU patients as a feature for studying the mortality. In this work, we propose a novel mortality study based on the notion of burstiness, where the temporal information of patients longitudinal data is taken into consideration. The burstiness of temporal data is a popular measure in network analysis and time-series anomaly detection, where high values of burstiness indicate presence of rapidly occurring events in short time periods (i.e. burst). Our intuition is that these bursts may relate to possible complications in the patient's medical condition and hence provide indications on the mortality. Compared to the LOS, the burstiness parameter captures the temporality of the medical events providing information about the overall dynamic of the patients condition. To the best of our knowledge, we are the first to apply the burstiness measure in the clinical research domain. Our preliminary results on a real dataset show that patients with high values of burstiness tend to have higher mortality rate compared to patients with more regular medical events. Overall, our study shows promising results and provides useful insights for developing predictive models on temporal data and advancing modern critical care medicine.

重症监护病房(icu)患者的研究是重症监护研究的一项重要任务，对确定临床危险因素和制定制度指导具有重要意义。ICU患者的死亡率研究特别有趣，因为它为医疗机构提供了有用的指征，以改善患者体验、内部政策和程序(例如资源分配)。为此，许多研究工作将ICU患者的住院时间(LOS)作为研究死亡率的一个特征。在这项工作中，我们提出了一种基于突发概念的新型死亡率研究，其中考虑了患者纵向数据的时间信息。在网络分析和时间序列异常检测中，时间数据的突发性是一种常用的度量方法，突发性的高值表明在短时间内存在快速发生的事件(即突发)。我们的直觉是，这些爆发可能与患者医疗状况中可能出现的并发症有关，因此提供了死亡率的指示。与LOS相比，突发参数捕获医疗事件的时间性，提供有关患者状况整体动态的信息。据我们所知，我们是第一个在临床研究领域应用爆发性测量的公司。我们在真实数据集上的初步结果表明，与更常规的医疗事件相比，高爆发值的患者往往具有更高的死亡率。总的来说，我们的研究显示了有希望的结果，并为开发时间数据的预测模型和推进现代危重病医学提供了有用的见解。

{"title":"A Mortality Study for ICU Patients using Bursty Medical Events.","authors":"Luca Bonomi, Xiaoqian Jiang","doi":"10.1109/ICDE.2017.224","DOIUrl":"https://doi.org/10.1109/ICDE.2017.224","url":null,"abstract":"The study of patients in Intensive Care Units (ICUs) is a crucial task in critical care research which has significant implications both in identifying clinical risk factors and defining institutional guidances. The mortality study of ICU patients is of particular interest because it provides useful indications to healthcare institutions for improving patients experience, internal policies, and procedures (e.g. allocation of resources). To this end, many research works have been focused on the length of stay (LOS) for ICU patients as a feature for studying the mortality. In this work, we propose a novel mortality study based on the notion of burstiness, where the temporal information of patients longitudinal data is taken into consideration. The burstiness of temporal data is a popular measure in network analysis and time-series anomaly detection, where high values of burstiness indicate presence of rapidly occurring events in short time periods (i.e. burst). Our intuition is that these bursts may relate to possible complications in the patient's medical condition and hence provide indications on the mortality. Compared to the LOS, the burstiness parameter captures the temporality of the medical events providing information about the overall dynamic of the patients condition. To the best of our knowledge, we are the first to apply the burstiness measure in the clinical research domain. Our preliminary results on a real dataset show that patients with high values of burstiness tend to have higher mortality rate compared to patients with more regular medical events. Overall, our study shows promising results and provides useful insights for developing predictive models on temporal data and advancing modern critical care medicine.","PeriodicalId":74570,"journal":{"name":"Proceedings. International Conference on Data Engineering","volume":" ","pages":"1533-1540"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/ICDE.2017.224","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35279222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

A Scalable Data Integration and Analysis Architecture for Sensor Data of Pediatric Asthma. 儿童哮喘传感器数据的可扩展数据集成与分析体系结构。

Proceedings. International Conference on Data Engineering

Pub Date : 2017-04-01 Epub Date: 2017-05-18 DOI: 10.1109/ICDE.2017.198

Dimitris Stripelis, José Luis Ambite, Yao-Yi Chiang, Sandrah P Eckel, Rima Habre

According to the Centers for Disease Control, in the United States there are 6.8 million children living with asthma. Despite the importance of the disease, the available prognostic tools are not sufficient for biomedical researchers to thoroughly investigate the potential risks of the disease at scale. To overcome these challenges we present a big data integration and analysis infrastructure developed by our Data and Software Coordination and Integration Center (DSCIC) of the NIBIB-funded Pediatric Research using Integrated Sensor Monitoring Systems (PRISMS) program. Our goal is to help biomedical researchers to efficiently predict and prevent asthma attacks. The PRISMS-DSCIC is responsible for collecting, integrating, storing, and analyzing real-time environmental, physiological and behavioral data obtained from heterogeneous sensor and traditional data sources. Our architecture is based on the Apache Kafka, Spark and Hadoop frameworks and PostgreSQL DBMS. A main contribution of this work is extending the Spark framework with a mediation layer, based on logical schema mappings and query rewriting, to facilitate data analysis over a consistent harmonized schema. The system provides both batch and stream analytic capabilities over the massive data generated by wearable and fixed sensors.

根据疾病控制中心的数据，在美国有680万儿童患有哮喘。尽管这种疾病很重要，但现有的预后工具还不足以让生物医学研究人员彻底调查这种疾病的潜在风险。为了克服这些挑战，我们提出了由nibib资助的儿童研究使用集成传感器监测系统(PRISMS)项目的数据和软件协调与集成中心(DSCIC)开发的大数据集成和分析基础设施。我们的目标是帮助生物医学研究人员有效地预测和预防哮喘发作。prism - dscic负责收集、整合、存储和分析从异构传感器和传统数据源获得的实时环境、生理和行为数据。我们的架构是基于Apache Kafka, Spark和Hadoop框架以及PostgreSQL DBMS。这项工作的一个主要贡献是使用基于逻辑模式映射和查询重写的中介层扩展Spark框架，以便在一致的协调模式上进行数据分析。该系统对可穿戴和固定传感器产生的大量数据提供批处理和流分析功能。

{"title":"A Scalable Data Integration and Analysis Architecture for Sensor Data of Pediatric Asthma.","authors":"Dimitris Stripelis, José Luis Ambite, Yao-Yi Chiang, Sandrah P Eckel, Rima Habre","doi":"10.1109/ICDE.2017.198","DOIUrl":"https://doi.org/10.1109/ICDE.2017.198","url":null,"abstract":"According to the Centers for Disease Control, in the United States there are 6.8 million children living with asthma. Despite the importance of the disease, the available prognostic tools are not sufficient for biomedical researchers to thoroughly investigate the potential risks of the disease at scale. To overcome these challenges we present a big data integration and analysis infrastructure developed by our Data and Software Coordination and Integration Center (DSCIC) of the NIBIB-funded Pediatric Research using Integrated Sensor Monitoring Systems (PRISMS) program. Our goal is to help biomedical researchers to efficiently predict and prevent asthma attacks. The PRISMS-DSCIC is responsible for collecting, integrating, storing, and analyzing real-time environmental, physiological and behavioral data obtained from heterogeneous sensor and traditional data sources. Our architecture is based on the Apache Kafka, Spark and Hadoop frameworks and PostgreSQL DBMS. A main contribution of this work is extending the Spark framework with a mediation layer, based on logical schema mappings and query rewriting, to facilitate data analysis over a consistent harmonized schema. The system provides both batch and stream analytic capabilities over the massive data generated by wearable and fixed sensors.","PeriodicalId":74570,"journal":{"name":"Proceedings. International Conference on Data Engineering","volume":" ","pages":"1407-1408"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/ICDE.2017.198","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36076618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Integrated Theory- and Data-driven Feature Selection in Gene Expression Data Analysis. 基因表达数据分析中整合理论和数据驱动的特征选择。

Proceedings. International Conference on Data Engineering

Pub Date : 2017-04-01 Epub Date: 2017-05-18 DOI: 10.1109/ICDE.2017.223

Vineet K Raghu, Xiaoyu Ge, Panos K Chrysanthis, Panayiotis V Benos

The exponential growth of high dimensional biological data has led to a rapid increase in demand for automated approaches for knowledge production. Existing methods rely on two general approaches to address this challenge: 1) the Theory-driven approach, which utilizes prior accumulated knowledge, and 2) the Data-driven approach, which solely utilizes the data to deduce scientific knowledge. Both of these approaches alone suffer from bias toward past/present knowledge, as they fail to incorporate all of the current knowledge that is available to make new discoveries. In this paper, we show how an integrated method can effectively address the high dimensionality of big biological data, which is a major problem for pure data-driven analysis approaches. We realize our approach in a novel two-step analytical workflow that incorporates a new feature selection paradigm as the first step to handling high-throughput gene expression data analysis and that utilizes graphical causal modeling as the second step to handle the automatic extraction of causal relationships. Our results, on real-world clinical datasets from The Cancer Genome Atlas (TCGA), demonstrate that our method is capable of intelligently selecting genes for learning effective causal networks.

高维生物数据的指数级增长导致对知识生产自动化方法的需求迅速增加。现有方法依赖于两种一般方法来应对这一挑战:1)理论驱动方法，利用先前积累的知识;2)数据驱动方法，仅利用数据推断科学知识。这两种方法都存在对过去/现在知识的偏见，因为它们无法将所有可用的当前知识纳入新发现中。在本文中，我们展示了一种集成方法如何有效地解决大生物数据的高维问题，这是纯数据驱动分析方法的一个主要问题。我们在一种新的两步分析工作流程中实现了我们的方法，该工作流程将新的特征选择范式作为处理高通量基因表达数据分析的第一步，并利用图形因果建模作为处理因果关系自动提取的第二步。我们在来自癌症基因组图谱(TCGA)的真实临床数据集上的研究结果表明，我们的方法能够智能地选择基因以学习有效的因果网络。

{"title":"Integrated Theory- and Data-driven Feature Selection in Gene Expression Data Analysis.","authors":"Vineet K Raghu, Xiaoyu Ge, Panos K Chrysanthis, Panayiotis V Benos","doi":"10.1109/ICDE.2017.223","DOIUrl":"10.1109/ICDE.2017.223","url":null,"abstract":"The exponential growth of high dimensional biological data has led to a rapid increase in demand for automated approaches for knowledge production. Existing methods rely on two general approaches to address this challenge: 1) the Theory-driven approach, which utilizes prior accumulated knowledge, and 2) the Data-driven approach, which solely utilizes the data to deduce scientific knowledge. Both of these approaches alone suffer from bias toward past/present knowledge, as they fail to incorporate all of the current knowledge that is available to make new discoveries. In this paper, we show how an integrated method can effectively address the high dimensionality of big biological data, which is a major problem for pure data-driven analysis approaches. We realize our approach in a novel two-step analytical workflow that incorporates a new feature selection paradigm as the first step to handling high-throughput gene expression data analysis and that utilizes graphical causal modeling as the second step to handle the automatic extraction of causal relationships. Our results, on real-world clinical datasets from The Cancer Genome Atlas (TCGA), demonstrate that our method is capable of intelligently selecting genes for learning effective causal networks.","PeriodicalId":74570,"journal":{"name":"Proceedings. International Conference on Data Engineering","volume":"2017 ","pages":"1525-1532"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5799807/pdf/nihms937517.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10034704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Secure Skyline Queries on Cloud Platform. 云平台上的安全Skyline查询。

Proceedings. International Conference on Data Engineering

Pub Date : 2017-04-01 Epub Date: 2017-05-18 DOI: 10.1109/ICDE.2017.117

Jinfei Liu, Juncheng Yang, Li Xiong, Jian Pei

Outsourcing data and computation to cloud server provides a cost-effective way to support large scale data storage and query processing. However, due to security and privacy concerns, sensitive data (e.g., medical records) need to be protected from the cloud server and other unauthorized users. One approach is to outsource encrypted data to the cloud server and have the cloud server perform query processing on the encrypted data only. It remains a challenging task to support various queries over encrypted data in a secure and efficient way such that the cloud server does not gain any knowledge about the data, query, and query result. In this paper, we study the problem of secure skyline queries over encrypted data. The skyline query is particularly important for multi-criteria decision making but also presents significant challenges due to its complex computations. We propose a fully secure skyline query protocol on data encrypted using semantically-secure encryption. As a key subroutine, we present a new secure dominance protocol, which can be also used as a building block for other queries. Finally, we provide both serial and parallelized implementations and empirically study the protocols in terms of efficiency and scalability under different parameter settings, verifying the feasibility of our proposed solutions.

将数据和计算外包给云服务器提供了一种经济有效的方式来支持大规模数据存储和查询处理。但是，出于安全和隐私方面的考虑，需要保护敏感数据(例如医疗记录)不被云服务器和其他未经授权的用户窃取。一种方法是将加密数据外包给云服务器，让云服务器只对加密数据执行查询处理。以安全和有效的方式支持对加密数据的各种查询仍然是一项具有挑战性的任务，这样云服务器就不会获得关于数据、查询和查询结果的任何知识。本文研究了加密数据的安全天际线查询问题。天际线查询对于多准则决策尤为重要，但由于其复杂的计算，也带来了巨大的挑战。针对使用语义安全加密的数据，提出了一种完全安全的天际线查询协议。作为一个关键子程序，我们提出了一个新的安全优势协议，该协议也可以作为其他查询的构建块。最后，我们提供了串行和并行实现，并在不同参数设置下对协议的效率和可扩展性进行了实证研究，验证了我们提出的解决方案的可行性。

{"title":"Secure Skyline Queries on Cloud Platform.","authors":"Jinfei Liu, Juncheng Yang, Li Xiong, Jian Pei","doi":"10.1109/ICDE.2017.117","DOIUrl":"https://doi.org/10.1109/ICDE.2017.117","url":null,"abstract":"Outsourcing data and computation to cloud server provides a cost-effective way to support large scale data storage and query processing. However, due to security and privacy concerns, sensitive data (e.g., medical records) need to be protected from the cloud server and other unauthorized users. One approach is to outsource encrypted data to the cloud server and have the cloud server perform query processing on the encrypted data only. It remains a challenging task to support various queries over encrypted data in a secure and efficient way such that the cloud server does not gain any knowledge about the data, query, and query result. In this paper, we study the problem of secure skyline queries over encrypted data. The skyline query is particularly important for multi-criteria decision making but also presents significant challenges due to its complex computations. We propose a fully secure skyline query protocol on data encrypted using semantically-secure encryption. As a key subroutine, we present a new secure dominance protocol, which can be also used as a building block for other queries. Finally, we provide both serial and parallelized implementations and empirically study the protocols in terms of efficiency and scalability under different parameter settings, verifying the feasibility of our proposed solutions.","PeriodicalId":74570,"journal":{"name":"Proceedings. International Conference on Data Engineering","volume":" ","pages":"633-644"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/ICDE.2017.117","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35482178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 67

Quantifying Differential Privacy under Temporal Correlations. 时间相关下的差分隐私量化。

Proceedings. International Conference on Data Engineering

Pub Date : 2017-04-01 Epub Date: 2017-05-18 DOI: 10.1109/ICDE.2017.132

Yang Cao, Masatoshi Yoshikawa, Yonghui Xiao, Li Xiong

Differential Privacy (DP) has received increasing attention as a rigorous privacy framework. Many existing studies employ traditional DP mechanisms (e.g., the Laplace mechanism) as primitives, which assume that the data are independent, or that adversaries do not have knowledge of the data correlations. However, continuous generated data in the real world tend to be temporally correlated, and such correlations can be acquired by adversaries. In this paper, we investigate the potential privacy loss of a traditional DP mechanism under temporal correlations in the context of continuous data release. First, we model the temporal correlations using Markov model and analyze the privacy leakage of a DP mechanism when adversaries have knowledge of such temporal correlations. Our analysis reveals that the privacy loss of a DP mechanism may accumulate and increase over time. We call it temporal privacy leakage. Second, to measure such privacy loss, we design an efficient algorithm for calculating it in polynomial time. Although the temporal privacy leakage may increase over time, we also show that its supremum may exist in some cases. Third, to bound the privacy loss, we propose mechanisms that convert any existing DP mechanism into one against temporal privacy leakage. Experiments with synthetic data confirm that our approach is efficient and effective.

差分隐私(DP)作为一种严格的隐私框架受到越来越多的关注。许多现有的研究采用传统的DP机制(例如拉普拉斯机制)作为基元，假设数据是独立的，或者对手不知道数据的相关性。然而，在现实世界中，连续生成的数据往往是暂时相关的，而这种相关性可以被对手获取。在本文中，我们研究了在持续数据发布的背景下，在时间相关性下传统DP机制的潜在隐私损失。首先，我们使用马尔可夫模型对时间相关性进行建模，并分析了当对手知道这种时间相关性时DP机制的隐私泄漏。我们的分析表明，DP机制的隐私损失可能会随着时间的推移而累积和增加。我们称之为暂时隐私泄露。其次，为了测量这种隐私损失，我们设计了一个在多项式时间内计算隐私损失的有效算法。虽然时间隐私泄漏可能随着时间的推移而增加，但我们也表明，在某些情况下，它可能存在最大值。第三，为了约束隐私损失，我们提出了将现有的数据保护机制转换为防止暂时隐私泄露的机制。综合数据实验证实了该方法的有效性。

{"title":"Quantifying Differential Privacy under Temporal Correlations.","authors":"Yang Cao, Masatoshi Yoshikawa, Yonghui Xiao, Li Xiong","doi":"10.1109/ICDE.2017.132","DOIUrl":"https://doi.org/10.1109/ICDE.2017.132","url":null,"abstract":"Differential Privacy (DP) has received increasing attention as a rigorous privacy framework. Many existing studies employ traditional DP mechanisms (e.g., the Laplace mechanism) as primitives, which assume that the data are independent, or that adversaries do not have knowledge of the data correlations. However, continuous generated data in the real world tend to be temporally correlated, and such correlations can be acquired by adversaries. In this paper, we investigate the potential privacy loss of a traditional DP mechanism under temporal correlations in the context of continuous data release. First, we model the temporal correlations using Markov model and analyze the privacy leakage of a DP mechanism when adversaries have knowledge of such temporal correlations. Our analysis reveals that the privacy loss of a DP mechanism may accumulate and increase over time. We call it temporal privacy leakage. Second, to measure such privacy loss, we design an efficient algorithm for calculating it in polynomial time. Although the temporal privacy leakage may increase over time, we also show that its supremum may exist in some cases. Third, to bound the privacy loss, we propose mechanisms that convert any existing DP mechanism into one against temporal privacy leakage. Experiments with synthetic data confirm that our approach is efficient and effective.","PeriodicalId":74570,"journal":{"name":"Proceedings. International Conference on Data Engineering","volume":" ","pages":"821-832"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/ICDE.2017.132","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35482180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 95

CRD: Fast Co-clustering on Large Datasets Utilizing Sampling-Based Matrix Decomposition. CRD：利用基于采样的矩阵分解对大型数据集进行快速协同聚类。

Proceedings. International Conference on Data Engineering

Pub Date : 2008-04-25 Epub Date: 2008-04-07

Feng Pan, Xiang Zhang, Wei Wang

The problem of simultaneously clustering columns and rows (co-clustering) arises in important applications, such as text data mining, microarray analysis, and recommendation system analysis. Compared with the classical clustering algorithms, co-clustering algorithms have been shown to be more effective in discovering hidden clustering structures in the data matrix. The complexity of previous co-clustering algorithms is usually O(m × n), where m and n are the numbers of rows and columns in the data matrix respectively. This limits their applicability to data matrices involving a large number of columns and rows. Moreover, some huge datasets can not be entirely held in main memory during co-clustering which violates the assumption made by the previous algorithms. In this paper, we propose a general framework for fast co-clustering large datasets, CRD. By utilizing recently developed sampling-based matrix decomposition methods, CRD achieves an execution time linear in m and n. Also, CRD does not require the whole data matrix be in the main memory. We conducted extensive experiments on both real and synthetic data. Compared with previous co-clustering algorithms, CRD achieves competitive accuracy but with much less computational cost.

在文本数据挖掘、微阵列分析和推荐系统分析等重要应用中，会出现同时对列和行进行聚类（共聚类）的问题。与经典聚类算法相比，共聚类算法在发现数据矩阵中隐藏的聚类结构方面更为有效。以往的共聚类算法的复杂度通常为 O(m×n)，其中 m 和 n 分别为数据矩阵中的行数和列数。这就限制了它们对涉及大量列和行的数据矩阵的适用性。此外，在协同聚类过程中，一些庞大的数据集无法完全保存在主内存中，这也违反了之前算法的假设。在本文中，我们提出了一种用于快速对大型数据集进行协同聚类的通用框架--CRD。通过利用最近开发的基于抽样的矩阵分解方法，CRD 的执行时间与 m 和 n 成线性关系。我们在真实数据和合成数据上进行了大量实验。与之前的协同聚类算法相比，CRD 的精确度具有竞争力，但计算成本却大大降低。

{"title":"CRD: Fast Co-clustering on Large Datasets Utilizing Sampling-Based Matrix Decomposition.","authors":"Feng Pan, Xiang Zhang, Wei Wang","doi":"","DOIUrl":"","url":null,"abstract":"The problem of simultaneously clustering columns and rows (co-clustering) arises in important applications, such as text data mining, microarray analysis, and recommendation system analysis. Compared with the classical clustering algorithms, co-clustering algorithms have been shown to be more effective in discovering hidden clustering structures in the data matrix. The complexity of previous co-clustering algorithms is usually O(m × n), where m and n are the numbers of rows and columns in the data matrix respectively. This limits their applicability to data matrices involving a large number of columns and rows. Moreover, some huge datasets can not be entirely held in main memory during co-clustering which violates the assumption made by the previous algorithms. In this paper, we propose a general framework for fast co-clustering large datasets, CRD. By utilizing recently developed sampling-based matrix decomposition methods, CRD achieves an execution time linear in m and n. Also, CRD does not require the whole data matrix be in the main memory. We conducted extensive experiments on both real and synthetic data. Compared with previous co-clustering algorithms, CRD achieves competitive accuracy but with much less computational cost.","PeriodicalId":74570,"journal":{"name":"Proceedings. International Conference on Data Engineering","volume":"2008 ","pages":"1337-1339"},"PeriodicalIF":0.0,"publicationDate":"2008-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3422895/pdf/nihms-132005.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30853449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CRD: fast co-clustering on large datasets utilizing sampling-based matrix decomposition CRD:利用基于采样的矩阵分解对大型数据集进行快速共聚类

Proceedings. International Conference on Data Engineering

Pub Date : 2008-04-25 DOI: 10.1145/1376616.1376637

Feng Pan, Xiang Zhang, Wei Wang

The problem of simultaneously clustering columns and rows (co-clustering) arises in important applications, such as text data mining, microarray analysis, and recommendation system analysis. Compared with the classical clustering algorithms, co-clustering algorithms have been shown to be more effective in discovering hidden clustering structures in the data matrix. The complexity of previous co-clustering algorithms is usually O(m X n), where m and n are the numbers of rows and columns in the data matrix respectively. This limits their applicability to data matrices involving a large number of columns and rows. Moreover, some huge datasets can not be entirely held in main memory during co-clustering which violates the assumption made by the previous algorithms. In this paper, we propose a general framework for fast co-clustering large datasets, CRD. By utilizing recently developed sampling-based matrix decomposition methods, CRD achieves an execution time linear in m and n. Also, CRD does not require the whole data matrix be in the main memory. We conducted extensive experiments on both real and synthetic data. Compared with previous co-clustering algorithms, CRD achieves competitive accuracy but with much less computational cost.

在文本数据挖掘、微阵列分析和推荐系统分析等重要应用中，会出现同时聚类列和行(共聚类)的问题。与经典聚类算法相比，共聚类算法在发现数据矩阵中隐藏的聚类结构方面更有效。以往的共聚类算法的复杂度通常为O(m X n)，其中m和n分别为数据矩阵的行数和列数。这限制了它们对包含大量列和行的数据矩阵的适用性。此外，在共聚类过程中，一些庞大的数据集不能完全保存在主存中，这违背了以前算法的假设。在本文中，我们提出了一个快速共聚大数据集的通用框架，CRD。通过使用最近开发的基于采样的矩阵分解方法，CRD实现了在m和n上的线性执行时间，并且CRD不需要整个数据矩阵在主存中。我们对真实数据和合成数据进行了广泛的实验。与以往的共聚类算法相比，CRD算法在具有一定精度的同时，计算成本也大大降低。

{"title":"CRD: fast co-clustering on large datasets utilizing sampling-based matrix decomposition","authors":"Feng Pan, Xiang Zhang, Wei Wang","doi":"10.1145/1376616.1376637","DOIUrl":"https://doi.org/10.1145/1376616.1376637","url":null,"abstract":"The problem of simultaneously clustering columns and rows (co-clustering) arises in important applications, such as text data mining, microarray analysis, and recommendation system analysis. Compared with the classical clustering algorithms, co-clustering algorithms have been shown to be more effective in discovering hidden clustering structures in the data matrix. The complexity of previous co-clustering algorithms is usually O(m X n), where m and n are the numbers of rows and columns in the data matrix respectively. This limits their applicability to data matrices involving a large number of columns and rows. Moreover, some huge datasets can not be entirely held in main memory during co-clustering which violates the assumption made by the previous algorithms. In this paper, we propose a general framework for fast co-clustering large datasets, CRD. By utilizing recently developed sampling-based matrix decomposition methods, CRD achieves an execution time linear in m and n. Also, CRD does not require the whole data matrix be in the main memory. We conducted extensive experiments on both real and synthetic data. Compared with previous co-clustering algorithms, CRD achieves competitive accuracy but with much less computational cost.","PeriodicalId":74570,"journal":{"name":"Proceedings. International Conference on Data Engineering","volume":"27 1","pages":"1337-1339"},"PeriodicalIF":0.0,"publicationDate":"2008-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90879314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41

CARE: Finding Local Linear Correlations in High Dimensional Data. 在高维数据中寻找局部线性相关性。

Proceedings. International Conference on Data Engineering

Pub Date : 2008-04-25 DOI: 10.1109/ICDE.2008.4497421

Xiang Zhang, Feng Pan, Wei Wang

Finding latent patterns in high dimensional data is an important research problem with numerous applications. Existing approaches can be summarized into 3 categories: feature selection, feature transformation (or feature projection) and projected clustering. Being widely used in many applications, these methods aim to capture global patterns and are typically performed in the full feature space. In many emerging biomedical applications, however, scientists are interested in the local latent patterns held by feature subsets, which may be invisible via any global transformation. In this paper, we investigate the problem of finding local linear correlations in high dimensional data. Our goal is to find the latent pattern structures that may exist only in some subspaces. We formalize this problem as finding strongly correlated feature subsets which are supported by a large portion of the data points. Due to the combinatorial nature of the problem and lack of monotonicity of the correlation measurement, it is prohibitively expensive to exhaustively explore the whole search space. In our algorithm, CARE, we utilize spectrum properties and effective heuristic to prune the search space. Extensive experimental results show that our approach is effective in finding local linear correlations that may not be identified by existing methods.

在高维数据中发现潜在模式是一个重要的研究问题，有着广泛的应用。现有的方法可以归纳为三大类:特征选择、特征变换(或特征投影)和投影聚类。这些方法在许多应用程序中广泛使用，旨在捕获全局模式，并且通常在完整的特征空间中执行。然而，在许多新兴的生物医学应用中，科学家们对特征子集持有的局部潜在模式感兴趣，这些模式可能通过任何全局转换都是不可见的。在本文中，我们研究了高维数据中寻找局部线性相关的问题。我们的目标是找到可能只存在于某些子空间中的潜在模式结构。我们将这个问题形式化为找到由大部分数据点支持的强相关特征子集。由于问题的组合性质和相关性度量的缺乏单调性，耗尽地探索整个搜索空间的代价非常高。在我们的算法CARE中，我们利用频谱特性和有效的启发式来修剪搜索空间。大量的实验结果表明，我们的方法可以有效地找到现有方法无法识别的局部线性相关性。

{"title":"CARE: Finding Local Linear Correlations in High Dimensional Data.","authors":"Xiang Zhang, Feng Pan, Wei Wang","doi":"10.1109/ICDE.2008.4497421","DOIUrl":"https://doi.org/10.1109/ICDE.2008.4497421","url":null,"abstract":"Finding latent patterns in high dimensional data is an important research problem with numerous applications. Existing approaches can be summarized into 3 categories: feature selection, feature transformation (or feature projection) and projected clustering. Being widely used in many applications, these methods aim to capture global patterns and are typically performed in the full feature space. In many emerging biomedical applications, however, scientists are interested in the local latent patterns held by feature subsets, which may be invisible via any global transformation. In this paper, we investigate the problem of finding local linear correlations in high dimensional data. Our goal is to find the latent pattern structures that may exist only in some subspaces. We formalize this problem as finding strongly correlated feature subsets which are supported by a large portion of the data points. Due to the combinatorial nature of the problem and lack of monotonicity of the correlation measurement, it is prohibitively expensive to exhaustively explore the whole search space. In our algorithm, CARE, we utilize spectrum properties and effective heuristic to prune the search space. Extensive experimental results show that our approach is effective in finding local linear correlations that may not be identified by existing methods.","PeriodicalId":74570,"journal":{"name":"Proceedings. International Conference on Data Engineering","volume":"24 ","pages":"130-139"},"PeriodicalIF":0.0,"publicationDate":"2008-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/ICDE.2008.4497421","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"28945407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24