Association rule mining has received significant attention from both the data mining and machine learning communities. While data mining researchers focus more on designing efficient algorithms to mine rules from large datasets, the learning community has explored applications of rule mining to classification. A major problem with rule mining algorithms is the explosion of rules even for moderate sized datasets making it very difficult for end users to identify both statistically significant and potentially novel rules that could lead to interesting new insights and hypotheses. Researchers have proposed many domain independent interestingness measures using which, one can rank the rules and potentially glean useful rules from the top ranked ones. However, these measures have not been fully explored for rule mining in clinical datasets owing to the relatively large sizes of the datasets often encountered in healthcare and also due to limited access to domain experts for review/analysis. In this paper, using an electronic medical record (EMR) dataset of diagnoses and medications from over three million patient visits to the University of Kentucky medical center and affiliated clinics, we conduct a thorough evaluation of dozens of interestingness measures proposed in data mining literature, including some new composite measures. Using cumulative relevance metrics from information retrieval, we compare these interestingness measures against human judgments obtained from a practicing psychiatrist for association rules involving the depressive disorders class as the consequent. Our results not only surface new interesting associations for depressive disorders but also indicate classes of interestingness measures that weight rule novelty and statistical strength in contrasting ways, offering new insights for end users in identifying interesting rules.
Patient similarity measurement is an important tool for cohort identification in clinical decision support applications. A reliable similarity metric can be used for deriving diagnostic or prognostic information about a target patient using other patients with similar trajectories of health-care events. However, the measure of similar care trajectories is challenged by the irregularity of measurements, inherent in health care. To address this challenge, we propose a novel temporal similarity measure for patients based on irregularly measured laboratory test data from the Multiparameter Intelligent Monitoring in Intensive Care database and the pediatric Intensive Care Unit (ICU) database of Children's Healthcare of Atlanta. This similarity measure, which is modified from the Smith Waterman algorithm, identifies patients that share sequentially similar laboratory results separated by time intervals of similar length. We demonstrate the predictive power of our method; that is, patients with higher similarity in their previous histories will most likely have higher similarity in their later histories. In addition, compared with other non-temporal measures, our method is stronger at predicting mortality in ICU patients diagnosed with acute kidney injury and sepsis.
Categories and subject descriptors: H.3.3 [Information Storage and Retrieval]: Retrieval models and rankings - similarity measures; J.3 [Applied Computing]: Life and medical sciences - health and medical information systems.
General term: Algorithm.
As computational modeling, simulation, and predictions are becoming integral parts of biomedical pipelines, it behooves us to emphasize the reliability of the computational protocol. For any reported quantity of interest (QOI), one must also compute and report a measure of the uncertainty or error associated with the QOI. This is especially important in molecular modeling, since in most practical applications the inputs to the computational protocol are often noisy, incomplete, or low-resolution. Unfortunately, currently available modeling tools do not account for uncertainties and their effect on the final QOIs with sufficient rigor. We have developed a statistical framework that expresses the uncertainty of the QOI as the probability that the reported value deviates from the true value by more than some user-defined threshold. First, we provide a theoretical approach where this probability can be bounded using Azuma-Hoeffding like inequalities. Second, we approximate this probability empirically by sampling the space of uncertainties of the input and provide applications of our framework to bound uncertainties of several QOIs commonly used in molecular modeling. Finally, we also present several visualization techniques to effectively and quantitavely visualize the uncertainties: in the input, final QOIs, and also intermediate states.
Association rule mining has been utilized extensively in many areas because it has the ability to discover relationships among variables in large databases. However, one main drawback of association rule mining is that it attempts to generate a large number of rules and does not guarantee that the rules are meaningful in the real world. Many visualization techniques have been proposed for association rules. These techniques were designed to provide a global overview of all rules so as to identify the most meaningful rules. However, using these visualization techniques to search for specific rules becomes challenging especially when the volume of rules is extremely large. In this study, we have developed an interactive association rule visualization technique, called InterVisAR, specifically designed for effective rule search. We conducted a user study with 24 participants, and the results demonstrated that InterVisAR provides an efficient and accurate visualization solution. We also verified that InterVisAR satisfies a non-factorial property that should be guaranteed in performing rule search. All participants also expressed high preference towards InterVisAR as it provides a more comfortable and pleasing visualization in association rule search.
Motivation: Despite several reported acceleration successes of programmable GPUs (Graphics Processing Units) for molecular modeling and simulation tools, the general focus has been on fast computation with small molecules. This was primarily due to the limited memory size on the GPU. Moreover simultaneous use of CPU and GPU cores for a single kernel execution - a necessity for achieving high parallelism - has also not been fully considered.
Results: We present fast computation methods for molecular mechanical (Lennard-Jones and Coulombic) and generalized Born solvation energetics which run on commodity multicore CPUs and manycore GPUs. The key idea is to trade off accuracy of pairwise, long-range atomistic energetics for higher speed of execution. A simple yet efficient CUDA kernel for GPU acceleration is presented which ensures high arithmetic intensity and memory efficiency. Our CUDA kernel uses a cache-friendly, recursive and linear-space octree data structure to handle very large molecular structures with up to several million atoms. Based on this CUDA kernel, we present a hybrid method which simultaneously exploits both CPU and GPU cores to provide the best performance based on selected parameters of the approximation scheme. Our CUDA kernels achieve more than two orders of magnitude speedup over serial computation for many of the molecular energetics terms. The hybrid method is shown to be able to achieve the best performance for all values of the approximation parameter.
Availability: The source code and binaries are freely available as PMEOPA (Parallel Molecular Energetic using Octree Pairwise Approximation) and downloadable from http://cvcweb.ices.utexas.edu/software.
Developing clinical risk prediction models is one of the main tasks of healthcare data mining. Advanced data collection techniques in current Big Data era have created an emerging and urgent need for scalable, computer-based data mining methods. These methods can turn data into useful, personalized decision support knowledge in a flexible, cost-effective, and productive way. In our previous study, we developed a tool, called icuARM- II, that can generate personalized clinical risk prediction evidence using a temporal rule mining framework. However, the generation of final risk prediction possibility with icuARM-II still relied on human interpretation, which was subjective and, most of time, biased. In this study, we propose a new mechanism to improve icuARM-II's rule selection by including the concept of causal analysis. The generated risk prediction is quantitatively assessed using calibration statistics. To evaluate the performance of the new rule selection mechanism, we conducted a case study to predict short-term intensive care unit mortality based on personalized lab testing abnormalities. Our results demonstrated a better-calibrated ICU risk prediction using the new causality-base rule selection solution by comparing with conventional confidence-only rule selection methods.
Increased understanding of the transcriptomic patterns underlying head and neck squamous cell carcinoma (HNSCC) can facilitate earlier diagnosis and better treatment outcomes. Integrating knowledge from multiple studies is necessary to identify fundamental, consistent gene expression signatures that distinguish HNSCC patient samples from disease-free samples, and particularly for detecting HNSCC at an early pathological stage. This study utilizes feature integration and heterogeneous ensemble modeling techniques to develop robust models for predicting HNSCC disease status in both microarray and RNAseq datasets. Several alternative models demonstrated good performance, with MCC and AUC values exceeding 0.8. These models were also applied to discriminate between early pathological stage HNSCC and normal RNA-seq samples, showing encouraging results. The predictive modeling workflow was integrated into a software tool with a graphical user interface. This tool enables HNSCC researchers to harness frequently observed transcriptomic features and ensembles of previously developed models when investigating new HNSCC gene expression datasets.
While numerous RNA-seq data analysis pipelines are available, research has shown that the choice of pipeline influences the results of differentially expressed gene detection and gene expression estimation. Gene expression estimation is a key step in RNA-seq data analysis, since the accuracy of gene expression estimates profoundly affects the subsequent analysis. Generally, gene expression estimation involves sequence alignment and quantification, and accurate gene expression estimation requires accurate alignment. However, the impact of aligners on gene expression estimation remains unclear. We address this need by constructing nine pipelines consisting of nine spliced aligners and one quantifier. We then use simulated data to investigate the impact of aligners on gene expression estimation. To evaluate alignment, we introduce three alignment performance metrics, (1) the percentage of reads aligned, (2) the percentage of reads aligned with zero mismatch (ZeroMismatchPercentage), and (3) the percentage of reads aligned with at most one mismatch (ZeroOneMismatchPercentage). We then evaluate the impact of alignment performance on gene expression estimation using three metrics, (1) gene detection accuracy, (2) the number of genes falsely quantified (FalseExpNum), and (3) the number of genes with falsely estimated fold changes (FalseFcNum). We found that among various pipelines, FalseExpNum and FalseFcNum are correlated. Moreover, FalseExpNum is linearly correlated with the percentage of reads aligned and ZeroMismatchPercentage, and FalseFcNum is linearly correlated with ZeroMismatchPercentage. Because of this correlation, the percentage of reads aligned and ZeroMismatchPercentage may be used to assess the performance of gene expression estimation for all RNA-seq datasets.

