Pub Date : 2024-05-01Epub Date: 2024-07-23DOI: 10.1109/icde60146.2024.00483
Sepanta Zeighami, Raghav Seshadri, Cyrus Shahabi
{"title":"A Neural Database for Answering Aggregate Queries on Incomplete Relational Data (Extended Abstract).","authors":"Sepanta Zeighami, Raghav Seshadri, Cyrus Shahabi","doi":"10.1109/icde60146.2024.00483","DOIUrl":"10.1109/icde60146.2024.00483","url":null,"abstract":"","PeriodicalId":74570,"journal":{"name":"Proceedings. International Conference on Data Engineering","volume":"2024 ","pages":"5703-5704"},"PeriodicalIF":0.0,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11542930/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142607433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-01Epub Date: 2024-07-23DOI: 10.1109/ICDE60146.2024.00419
Arash Hajisafi, Maria Despoina Siampou, Jize Bi, Luciano Nocera, Cyrus Shahabi
The Wearables for Health Toolkit (W4H Toolkit) is an open-source platform that provides a robust, end-to-end solution for the centralized management and analysis of wearable data. With integrated tools and frameworks, the toolkit facilitates seamless data acquisition, integration, storage, analysis, and visualization of both stored and streaming data from various wearable devices. The W4H Toolkit is designed to provide medical researchers and health practitioners with a unified framework that enables the analysis of health-related data for various clinical applications. We provide an overview of the system and demonstrate how it can be used by health researchers to import and analyze a wide range of wearable data and perform data analysis, highlighting the versatility and functionality of the system across diverse healthcare domains and applications.
{"title":"Wearables for Health (W4H) Toolkit for Acquisition, Storage, Analysis and Visualization of Data from Various Wearable Devices.","authors":"Arash Hajisafi, Maria Despoina Siampou, Jize Bi, Luciano Nocera, Cyrus Shahabi","doi":"10.1109/ICDE60146.2024.00419","DOIUrl":"10.1109/ICDE60146.2024.00419","url":null,"abstract":"<p><p>The Wearables for Health Toolkit (W4H Toolkit) is an open-source platform that provides a robust, end-to-end solution for the centralized management and analysis of wearable data. With integrated tools and frameworks, the toolkit facilitates seamless data acquisition, integration, storage, analysis, and visualization of both stored and streaming data from various wearable devices. The W4H Toolkit is designed to provide medical researchers and health practitioners with a unified framework that enables the analysis of health-related data for various clinical applications. We provide an overview of the system and demonstrate how it can be used by health researchers to import and analyze a wide range of wearable data and perform data analysis, highlighting the versatility and functionality of the system across diverse healthcare domains and applications.</p>","PeriodicalId":74570,"journal":{"name":"Proceedings. International Conference on Data Engineering","volume":"2024 ","pages":"5425-5428"},"PeriodicalIF":0.0,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11524438/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142559584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-04-01Epub Date: 2017-05-18DOI: 10.1109/ICDE.2017.224
Luca Bonomi, Xiaoqian Jiang
The study of patients in Intensive Care Units (ICUs) is a crucial task in critical care research which has significant implications both in identifying clinical risk factors and defining institutional guidances. The mortality study of ICU patients is of particular interest because it provides useful indications to healthcare institutions for improving patients experience, internal policies, and procedures (e.g. allocation of resources). To this end, many research works have been focused on the length of stay (LOS) for ICU patients as a feature for studying the mortality. In this work, we propose a novel mortality study based on the notion of burstiness, where the temporal information of patients longitudinal data is taken into consideration. The burstiness of temporal data is a popular measure in network analysis and time-series anomaly detection, where high values of burstiness indicate presence of rapidly occurring events in short time periods (i.e. burst). Our intuition is that these bursts may relate to possible complications in the patient's medical condition and hence provide indications on the mortality. Compared to the LOS, the burstiness parameter captures the temporality of the medical events providing information about the overall dynamic of the patients condition. To the best of our knowledge, we are the first to apply the burstiness measure in the clinical research domain. Our preliminary results on a real dataset show that patients with high values of burstiness tend to have higher mortality rate compared to patients with more regular medical events. Overall, our study shows promising results and provides useful insights for developing predictive models on temporal data and advancing modern critical care medicine.
{"title":"A Mortality Study for ICU Patients using Bursty Medical Events.","authors":"Luca Bonomi, Xiaoqian Jiang","doi":"10.1109/ICDE.2017.224","DOIUrl":"https://doi.org/10.1109/ICDE.2017.224","url":null,"abstract":"<p><p>The study of patients in Intensive Care Units (ICUs) is a crucial task in critical care research which has significant implications both in identifying clinical risk factors and defining institutional guidances. The mortality study of ICU patients is of particular interest because it provides useful indications to healthcare institutions for improving patients experience, internal policies, and procedures (e.g. allocation of resources). To this end, many research works have been focused on the length of stay (LOS) for ICU patients as a feature for studying the mortality. In this work, we propose a novel mortality study based on the notion of burstiness, where the temporal information of patients longitudinal data is taken into consideration. The burstiness of temporal data is a popular measure in network analysis and time-series anomaly detection, where high values of burstiness indicate presence of rapidly occurring events in short time periods (i.e. burst). Our intuition is that these bursts may relate to possible complications in the patient's medical condition and hence provide indications on the mortality. Compared to the LOS, the burstiness parameter captures the temporality of the medical events providing information about the overall dynamic of the patients condition. To the best of our knowledge, we are the first to apply the burstiness measure in the clinical research domain. Our preliminary results on a real dataset show that patients with high values of burstiness tend to have higher mortality rate compared to patients with more regular medical events. Overall, our study shows promising results and provides useful insights for developing predictive models on temporal data and advancing modern critical care medicine.</p>","PeriodicalId":74570,"journal":{"name":"Proceedings. International Conference on Data Engineering","volume":" ","pages":"1533-1540"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/ICDE.2017.224","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35279222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-04-01Epub Date: 2017-05-18DOI: 10.1109/ICDE.2017.198
Dimitris Stripelis, José Luis Ambite, Yao-Yi Chiang, Sandrah P Eckel, Rima Habre
According to the Centers for Disease Control, in the United States there are 6.8 million children living with asthma. Despite the importance of the disease, the available prognostic tools are not sufficient for biomedical researchers to thoroughly investigate the potential risks of the disease at scale. To overcome these challenges we present a big data integration and analysis infrastructure developed by our Data and Software Coordination and Integration Center (DSCIC) of the NIBIB-funded Pediatric Research using Integrated Sensor Monitoring Systems (PRISMS) program. Our goal is to help biomedical researchers to efficiently predict and prevent asthma attacks. The PRISMS-DSCIC is responsible for collecting, integrating, storing, and analyzing real-time environmental, physiological and behavioral data obtained from heterogeneous sensor and traditional data sources. Our architecture is based on the Apache Kafka, Spark and Hadoop frameworks and PostgreSQL DBMS. A main contribution of this work is extending the Spark framework with a mediation layer, based on logical schema mappings and query rewriting, to facilitate data analysis over a consistent harmonized schema. The system provides both batch and stream analytic capabilities over the massive data generated by wearable and fixed sensors.
{"title":"A Scalable Data Integration and Analysis Architecture for Sensor Data of Pediatric Asthma.","authors":"Dimitris Stripelis, José Luis Ambite, Yao-Yi Chiang, Sandrah P Eckel, Rima Habre","doi":"10.1109/ICDE.2017.198","DOIUrl":"https://doi.org/10.1109/ICDE.2017.198","url":null,"abstract":"<p><p>According to the Centers for Disease Control, in the United States there are 6.8 million children living with asthma. Despite the importance of the disease, the available prognostic tools are not sufficient for biomedical researchers to thoroughly investigate the potential risks of the disease at scale. To overcome these challenges we present a big data integration and analysis infrastructure developed by our Data and Software Coordination and Integration Center (DSCIC) of the NIBIB-funded Pediatric Research using Integrated Sensor Monitoring Systems (PRISMS) program. Our goal is to help biomedical researchers to efficiently predict and prevent asthma attacks. The PRISMS-DSCIC is responsible for collecting, integrating, storing, and analyzing real-time environmental, physiological and behavioral data obtained from heterogeneous sensor and traditional data sources. Our architecture is based on the Apache Kafka, Spark and Hadoop frameworks and PostgreSQL DBMS. A main contribution of this work is extending the Spark framework with a mediation layer, based on logical schema mappings and query rewriting, to facilitate data analysis over a consistent harmonized schema. The system provides both batch and stream analytic capabilities over the massive data generated by wearable and fixed sensors.</p>","PeriodicalId":74570,"journal":{"name":"Proceedings. International Conference on Data Engineering","volume":" ","pages":"1407-1408"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/ICDE.2017.198","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36076618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-04-01Epub Date: 2017-05-18DOI: 10.1109/ICDE.2017.223
Vineet K Raghu, Xiaoyu Ge, Panos K Chrysanthis, Panayiotis V Benos
The exponential growth of high dimensional biological data has led to a rapid increase in demand for automated approaches for knowledge production. Existing methods rely on two general approaches to address this challenge: 1) the Theory-driven approach, which utilizes prior accumulated knowledge, and 2) the Data-driven approach, which solely utilizes the data to deduce scientific knowledge. Both of these approaches alone suffer from bias toward past/present knowledge, as they fail to incorporate all of the current knowledge that is available to make new discoveries. In this paper, we show how an integrated method can effectively address the high dimensionality of big biological data, which is a major problem for pure data-driven analysis approaches. We realize our approach in a novel two-step analytical workflow that incorporates a new feature selection paradigm as the first step to handling high-throughput gene expression data analysis and that utilizes graphical causal modeling as the second step to handle the automatic extraction of causal relationships. Our results, on real-world clinical datasets from The Cancer Genome Atlas (TCGA), demonstrate that our method is capable of intelligently selecting genes for learning effective causal networks.
{"title":"Integrated Theory- and Data-driven Feature Selection in Gene Expression Data Analysis.","authors":"Vineet K Raghu, Xiaoyu Ge, Panos K Chrysanthis, Panayiotis V Benos","doi":"10.1109/ICDE.2017.223","DOIUrl":"10.1109/ICDE.2017.223","url":null,"abstract":"<p><p>The exponential growth of high dimensional biological data has led to a rapid increase in demand for automated approaches for knowledge production. Existing methods rely on two general approaches to address this challenge: 1) the Theory-driven approach, which utilizes prior accumulated knowledge, and 2) the Data-driven approach, which solely utilizes the data to deduce scientific knowledge. Both of these approaches alone suffer from bias toward past/present knowledge, as they fail to incorporate all of the current knowledge that is available to make new discoveries. In this paper, we show how an integrated method can effectively address the high dimensionality of big biological data, which is a major problem for pure data-driven analysis approaches. We realize our approach in a novel two-step analytical workflow that incorporates a new feature selection paradigm as the first step to handling high-throughput gene expression data analysis and that utilizes graphical causal modeling as the second step to handle the automatic extraction of causal relationships. Our results, on real-world clinical datasets from The Cancer Genome Atlas (TCGA), demonstrate that our method is capable of intelligently selecting genes for learning effective causal networks.</p>","PeriodicalId":74570,"journal":{"name":"Proceedings. International Conference on Data Engineering","volume":"2017 ","pages":"1525-1532"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5799807/pdf/nihms937517.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10034704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-04-01Epub Date: 2017-05-18DOI: 10.1109/ICDE.2017.117
Jinfei Liu, Juncheng Yang, Li Xiong, Jian Pei
Outsourcing data and computation to cloud server provides a cost-effective way to support large scale data storage and query processing. However, due to security and privacy concerns, sensitive data (e.g., medical records) need to be protected from the cloud server and other unauthorized users. One approach is to outsource encrypted data to the cloud server and have the cloud server perform query processing on the encrypted data only. It remains a challenging task to support various queries over encrypted data in a secure and efficient way such that the cloud server does not gain any knowledge about the data, query, and query result. In this paper, we study the problem of secure skyline queries over encrypted data. The skyline query is particularly important for multi-criteria decision making but also presents significant challenges due to its complex computations. We propose a fully secure skyline query protocol on data encrypted using semantically-secure encryption. As a key subroutine, we present a new secure dominance protocol, which can be also used as a building block for other queries. Finally, we provide both serial and parallelized implementations and empirically study the protocols in terms of efficiency and scalability under different parameter settings, verifying the feasibility of our proposed solutions.
{"title":"Secure Skyline Queries on Cloud Platform.","authors":"Jinfei Liu, Juncheng Yang, Li Xiong, Jian Pei","doi":"10.1109/ICDE.2017.117","DOIUrl":"https://doi.org/10.1109/ICDE.2017.117","url":null,"abstract":"<p><p>Outsourcing data and computation to cloud server provides a cost-effective way to support large scale data storage and query processing. However, due to security and privacy concerns, sensitive data (e.g., medical records) need to be protected from the cloud server and other unauthorized users. One approach is to outsource encrypted data to the cloud server and have the cloud server perform query processing on the encrypted data only. It remains a challenging task to support various queries over encrypted data in a secure and efficient way such that the cloud server does not gain any knowledge about the data, query, and query result. In this paper, we study the problem of secure skyline queries over encrypted data. The skyline query is particularly important for multi-criteria decision making but also presents significant challenges due to its complex computations. We propose a fully secure skyline query protocol on data encrypted using semantically-secure encryption. As a key subroutine, we present a new secure dominance protocol, which can be also used as a building block for other queries. Finally, we provide both serial and parallelized implementations and empirically study the protocols in terms of efficiency and scalability under different parameter settings, verifying the feasibility of our proposed solutions.</p>","PeriodicalId":74570,"journal":{"name":"Proceedings. International Conference on Data Engineering","volume":" ","pages":"633-644"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/ICDE.2017.117","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35482178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-04-01Epub Date: 2017-05-18DOI: 10.1109/ICDE.2017.132
Yang Cao, Masatoshi Yoshikawa, Yonghui Xiao, Li Xiong
Differential Privacy (DP) has received increasing attention as a rigorous privacy framework. Many existing studies employ traditional DP mechanisms (e.g., the Laplace mechanism) as primitives, which assume that the data are independent, or that adversaries do not have knowledge of the data correlations. However, continuous generated data in the real world tend to be temporally correlated, and such correlations can be acquired by adversaries. In this paper, we investigate the potential privacy loss of a traditional DP mechanism under temporal correlations in the context of continuous data release. First, we model the temporal correlations using Markov model and analyze the privacy leakage of a DP mechanism when adversaries have knowledge of such temporal correlations. Our analysis reveals that the privacy loss of a DP mechanism may accumulate and increase over time. We call it temporal privacy leakage. Second, to measure such privacy loss, we design an efficient algorithm for calculating it in polynomial time. Although the temporal privacy leakage may increase over time, we also show that its supremum may exist in some cases. Third, to bound the privacy loss, we propose mechanisms that convert any existing DP mechanism into one against temporal privacy leakage. Experiments with synthetic data confirm that our approach is efficient and effective.
{"title":"Quantifying Differential Privacy under Temporal Correlations.","authors":"Yang Cao, Masatoshi Yoshikawa, Yonghui Xiao, Li Xiong","doi":"10.1109/ICDE.2017.132","DOIUrl":"https://doi.org/10.1109/ICDE.2017.132","url":null,"abstract":"<p><p>Differential Privacy (DP) has received increasing attention as a rigorous privacy framework. Many existing studies employ traditional DP mechanisms (e.g., the Laplace mechanism) as primitives, which assume that the data are independent, or that adversaries do not have knowledge of the data correlations. However, continuous generated data in the real world tend to be temporally correlated, and such correlations can be acquired by adversaries. In this paper, we investigate the potential privacy loss of a traditional DP mechanism under temporal correlations in the context of continuous data release. First, we model the temporal correlations using Markov model and analyze the privacy leakage of a DP mechanism when adversaries have knowledge of such temporal correlations. Our analysis reveals that the privacy loss of a DP mechanism may <i>accumulate and increase over time</i>. We call it <i>temporal privacy leakage</i>. Second, to measure such privacy loss, we design an efficient algorithm for calculating it in polynomial time. Although the temporal privacy leakage may increase over time, we also show that its supremum may exist in some cases. Third, to bound the privacy loss, we propose mechanisms that convert any existing DP mechanism into one against temporal privacy leakage. Experiments with synthetic data confirm that our approach is efficient and effective.</p>","PeriodicalId":74570,"journal":{"name":"Proceedings. International Conference on Data Engineering","volume":" ","pages":"821-832"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/ICDE.2017.132","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35482180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The problem of simultaneously clustering columns and rows (co-clustering) arises in important applications, such as text data mining, microarray analysis, and recommendation system analysis. Compared with the classical clustering algorithms, co-clustering algorithms have been shown to be more effective in discovering hidden clustering structures in the data matrix. The complexity of previous co-clustering algorithms is usually O(m × n), where m and n are the numbers of rows and columns in the data matrix respectively. This limits their applicability to data matrices involving a large number of columns and rows. Moreover, some huge datasets can not be entirely held in main memory during co-clustering which violates the assumption made by the previous algorithms. In this paper, we propose a general framework for fast co-clustering large datasets, CRD. By utilizing recently developed sampling-based matrix decomposition methods, CRD achieves an execution time linear in m and n. Also, CRD does not require the whole data matrix be in the main memory. We conducted extensive experiments on both real and synthetic data. Compared with previous co-clustering algorithms, CRD achieves competitive accuracy but with much less computational cost.
在文本数据挖掘、微阵列分析和推荐系统分析等重要应用中,会出现同时对列和行进行聚类(共聚类)的问题。与经典聚类算法相比,共聚类算法在发现数据矩阵中隐藏的聚类结构方面更为有效。以往的共聚类算法的复杂度通常为 O(m×n),其中 m 和 n 分别为数据矩阵中的行数和列数。这就限制了它们对涉及大量列和行的数据矩阵的适用性。此外,在协同聚类过程中,一些庞大的数据集无法完全保存在主内存中,这也违反了之前算法的假设。在本文中,我们提出了一种用于快速对大型数据集进行协同聚类的通用框架--CRD。通过利用最近开发的基于抽样的矩阵分解方法,CRD 的执行时间与 m 和 n 成线性关系。我们在真实数据和合成数据上进行了大量实验。与之前的协同聚类算法相比,CRD 的精确度具有竞争力,但计算成本却大大降低。
{"title":"CRD: Fast Co-clustering on Large Datasets Utilizing Sampling-Based Matrix Decomposition.","authors":"Feng Pan, Xiang Zhang, Wei Wang","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>The problem of simultaneously clustering columns and rows (co-clustering) arises in important applications, such as text data mining, microarray analysis, and recommendation system analysis. Compared with the classical clustering algorithms, co-clustering algorithms have been shown to be more effective in discovering hidden clustering structures in the data matrix. The complexity of previous co-clustering algorithms is usually O(m × n), where m and n are the numbers of rows and columns in the data matrix respectively. This limits their applicability to data matrices involving a large number of columns and rows. Moreover, some huge datasets can not be entirely held in main memory during co-clustering which violates the assumption made by the previous algorithms. In this paper, we propose a general framework for fast co-clustering large datasets, CRD. By utilizing recently developed sampling-based matrix decomposition methods, CRD achieves an execution time linear in m and n. Also, CRD does not require the whole data matrix be in the main memory. We conducted extensive experiments on both real and synthetic data. Compared with previous co-clustering algorithms, CRD achieves competitive accuracy but with much less computational cost.</p>","PeriodicalId":74570,"journal":{"name":"Proceedings. International Conference on Data Engineering","volume":"2008 ","pages":"1337-1339"},"PeriodicalIF":0.0,"publicationDate":"2008-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3422895/pdf/nihms-132005.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30853449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The problem of simultaneously clustering columns and rows (co-clustering) arises in important applications, such as text data mining, microarray analysis, and recommendation system analysis. Compared with the classical clustering algorithms, co-clustering algorithms have been shown to be more effective in discovering hidden clustering structures in the data matrix. The complexity of previous co-clustering algorithms is usually O(m X n), where m and n are the numbers of rows and columns in the data matrix respectively. This limits their applicability to data matrices involving a large number of columns and rows. Moreover, some huge datasets can not be entirely held in main memory during co-clustering which violates the assumption made by the previous algorithms. In this paper, we propose a general framework for fast co-clustering large datasets, CRD. By utilizing recently developed sampling-based matrix decomposition methods, CRD achieves an execution time linear in m and n. Also, CRD does not require the whole data matrix be in the main memory. We conducted extensive experiments on both real and synthetic data. Compared with previous co-clustering algorithms, CRD achieves competitive accuracy but with much less computational cost.
在文本数据挖掘、微阵列分析和推荐系统分析等重要应用中,会出现同时聚类列和行(共聚类)的问题。与经典聚类算法相比,共聚类算法在发现数据矩阵中隐藏的聚类结构方面更有效。以往的共聚类算法的复杂度通常为O(m X n),其中m和n分别为数据矩阵的行数和列数。这限制了它们对包含大量列和行的数据矩阵的适用性。此外,在共聚类过程中,一些庞大的数据集不能完全保存在主存中,这违背了以前算法的假设。在本文中,我们提出了一个快速共聚大数据集的通用框架,CRD。通过使用最近开发的基于采样的矩阵分解方法,CRD实现了在m和n上的线性执行时间,并且CRD不需要整个数据矩阵在主存中。我们对真实数据和合成数据进行了广泛的实验。与以往的共聚类算法相比,CRD算法在具有一定精度的同时,计算成本也大大降低。
{"title":"CRD: fast co-clustering on large datasets utilizing sampling-based matrix decomposition","authors":"Feng Pan, Xiang Zhang, Wei Wang","doi":"10.1145/1376616.1376637","DOIUrl":"https://doi.org/10.1145/1376616.1376637","url":null,"abstract":"The problem of simultaneously clustering columns and rows (co-clustering) arises in important applications, such as text data mining, microarray analysis, and recommendation system analysis. Compared with the classical clustering algorithms, co-clustering algorithms have been shown to be more effective in discovering hidden clustering structures in the data matrix. The complexity of previous co-clustering algorithms is usually O(m X n), where m and n are the numbers of rows and columns in the data matrix respectively. This limits their applicability to data matrices involving a large number of columns and rows. Moreover, some huge datasets can not be entirely held in main memory during co-clustering which violates the assumption made by the previous algorithms. In this paper, we propose a general framework for fast co-clustering large datasets, CRD. By utilizing recently developed sampling-based matrix decomposition methods, CRD achieves an execution time linear in m and n. Also, CRD does not require the whole data matrix be in the main memory. We conducted extensive experiments on both real and synthetic data. Compared with previous co-clustering algorithms, CRD achieves competitive accuracy but with much less computational cost.","PeriodicalId":74570,"journal":{"name":"Proceedings. International Conference on Data Engineering","volume":"27 1","pages":"1337-1339"},"PeriodicalIF":0.0,"publicationDate":"2008-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90879314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-04-25DOI: 10.1109/ICDE.2008.4497421
Xiang Zhang, Feng Pan, Wei Wang
Finding latent patterns in high dimensional data is an important research problem with numerous applications. Existing approaches can be summarized into 3 categories: feature selection, feature transformation (or feature projection) and projected clustering. Being widely used in many applications, these methods aim to capture global patterns and are typically performed in the full feature space. In many emerging biomedical applications, however, scientists are interested in the local latent patterns held by feature subsets, which may be invisible via any global transformation. In this paper, we investigate the problem of finding local linear correlations in high dimensional data. Our goal is to find the latent pattern structures that may exist only in some subspaces. We formalize this problem as finding strongly correlated feature subsets which are supported by a large portion of the data points. Due to the combinatorial nature of the problem and lack of monotonicity of the correlation measurement, it is prohibitively expensive to exhaustively explore the whole search space. In our algorithm, CARE, we utilize spectrum properties and effective heuristic to prune the search space. Extensive experimental results show that our approach is effective in finding local linear correlations that may not be identified by existing methods.
{"title":"CARE: Finding Local Linear Correlations in High Dimensional Data.","authors":"Xiang Zhang, Feng Pan, Wei Wang","doi":"10.1109/ICDE.2008.4497421","DOIUrl":"https://doi.org/10.1109/ICDE.2008.4497421","url":null,"abstract":"<p><p>Finding latent patterns in high dimensional data is an important research problem with numerous applications. Existing approaches can be summarized into 3 categories: feature selection, feature transformation (or feature projection) and projected clustering. Being widely used in many applications, these methods aim to capture global patterns and are typically performed in the full feature space. In many emerging biomedical applications, however, scientists are interested in the local latent patterns held by feature subsets, which may be invisible via any global transformation. In this paper, we investigate the problem of finding local linear correlations in high dimensional data. Our goal is to find the latent pattern structures that may exist only in some subspaces. We formalize this problem as finding strongly correlated feature subsets which are supported by a large portion of the data points. Due to the combinatorial nature of the problem and lack of monotonicity of the correlation measurement, it is prohibitively expensive to exhaustively explore the whole search space. In our algorithm, CARE, we utilize spectrum properties and effective heuristic to prune the search space. Extensive experimental results show that our approach is effective in finding local linear correlations that may not be identified by existing methods.</p>","PeriodicalId":74570,"journal":{"name":"Proceedings. International Conference on Data Engineering","volume":"24 ","pages":"130-139"},"PeriodicalIF":0.0,"publicationDate":"2008-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/ICDE.2008.4497421","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"28945407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}