Pub Date : 2016-12-01DOI: 10.1109/BIBM.2016.7822814
Dan Meng, Guitao Cao, Zhihai He, W. Cao
Facial expression recognition plays an important role in lie detection, and computer-aided diagnosis. Many deep learning facial expression feature extraction methods have a great improvement in recognition accuracy and robutness than traditional feature extraction methods. However, most of current deep learning methods need special parameter tuning and ad hoc fine-tuning tricks. This paper proposes a novel feature extraction model called Locally Linear Embedding Network (LLENet) for facial expression recognition. The proposed LLENet first reconstructs image sets for the cropped images. Unlike previous deep convolutional neural networks that initialized convolutional kernels randomly, we learn multi-stage kernels from reconstructed image sets directly in a supervised way. Also, we create an improved LLE to select kernels, from which we can obtain the most representative feature maps. Furthermore, to better measure the contribution of these kernels, a new distance based on kernel Euclidean is proposed. After the procedure of multi-scale feature analysis, feature representations are finally sent into a linear classifier. Experimental results on facial expression datasets (CK+) show that the proposed model can capture most representative features and thus improves previous results.
{"title":"Facial expression recognition based on LLENet","authors":"Dan Meng, Guitao Cao, Zhihai He, W. Cao","doi":"10.1109/BIBM.2016.7822814","DOIUrl":"https://doi.org/10.1109/BIBM.2016.7822814","url":null,"abstract":"Facial expression recognition plays an important role in lie detection, and computer-aided diagnosis. Many deep learning facial expression feature extraction methods have a great improvement in recognition accuracy and robutness than traditional feature extraction methods. However, most of current deep learning methods need special parameter tuning and ad hoc fine-tuning tricks. This paper proposes a novel feature extraction model called Locally Linear Embedding Network (LLENet) for facial expression recognition. The proposed LLENet first reconstructs image sets for the cropped images. Unlike previous deep convolutional neural networks that initialized convolutional kernels randomly, we learn multi-stage kernels from reconstructed image sets directly in a supervised way. Also, we create an improved LLE to select kernels, from which we can obtain the most representative feature maps. Furthermore, to better measure the contribution of these kernels, a new distance based on kernel Euclidean is proposed. After the procedure of multi-scale feature analysis, feature representations are finally sent into a linear classifier. Experimental results on facial expression datasets (CK+) show that the proposed model can capture most representative features and thus improves previous results.","PeriodicalId":345384,"journal":{"name":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128906021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/BIBM.2016.7822549
Shaswati Roy, P. Maji
Indirect immunofluorescence (IIF) analysis is the most effective test for antinuclear autoantibodies (ANAs) analysis, in order to reveal the occurrence of some autoimmune diseases, such as connective tissue disorders. In the tests of antinuclear antibodies, the human epithelial type 2 (HEp-2) cells is mostly used as substrate. However, the recognition of the staining pattern of ANAs in the IIF image requires proper detection of the region of interest. In this regard, automatic segmentation of IIF images is an essential prerequisite as manual segmentation is labor intensive, time consuming, and subjective. Recently, rough-fuzzy clustering has been shown to provide significant results for image segmentation by handling different uncertainties present in the images. But, the existing robust rough-fuzzy clustering algorithm does not consider spatial distribution of the image. This is useful when the image is distorted by noise and other artifacts. In this regard, the paper proposes a segmentation algorithm by incorporating the spatial constraint with the advantages of robust rough-fuzzy clustering. In the current study, class label of a pixel is influenced by its neighboring pixels depending on their spatial distance. In this way, more number of neighboring pixels can be incorporated into the calculation of a pixel feature. The performance of the proposed method is evaluated on several HEp-2 cell images and compared with the existing algorithms by presenting both qualitative and quantitative results.
{"title":"A modified rough-fuzzy clustering algorithm with spatial information for HEp-2 cell image segmentation","authors":"Shaswati Roy, P. Maji","doi":"10.1109/BIBM.2016.7822549","DOIUrl":"https://doi.org/10.1109/BIBM.2016.7822549","url":null,"abstract":"Indirect immunofluorescence (IIF) analysis is the most effective test for antinuclear autoantibodies (ANAs) analysis, in order to reveal the occurrence of some autoimmune diseases, such as connective tissue disorders. In the tests of antinuclear antibodies, the human epithelial type 2 (HEp-2) cells is mostly used as substrate. However, the recognition of the staining pattern of ANAs in the IIF image requires proper detection of the region of interest. In this regard, automatic segmentation of IIF images is an essential prerequisite as manual segmentation is labor intensive, time consuming, and subjective. Recently, rough-fuzzy clustering has been shown to provide significant results for image segmentation by handling different uncertainties present in the images. But, the existing robust rough-fuzzy clustering algorithm does not consider spatial distribution of the image. This is useful when the image is distorted by noise and other artifacts. In this regard, the paper proposes a segmentation algorithm by incorporating the spatial constraint with the advantages of robust rough-fuzzy clustering. In the current study, class label of a pixel is influenced by its neighboring pixels depending on their spatial distance. In this way, more number of neighboring pixels can be incorporated into the calculation of a pixel feature. The performance of the proposed method is evaluated on several HEp-2 cell images and compared with the existing algorithms by presenting both qualitative and quantitative results.","PeriodicalId":345384,"journal":{"name":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121116899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/BIBM.2016.7822615
Xin Mou, H. Jamil, R. Rinker
Data integration continues to baffle researchers even though substantial progress has been made. Although the emergence of technologies such as XML, web services, semantic web and cloud computing have helped, a system in which biologists are comfortable articulating new applications and developing them without technical assistance from a computing expert is yet to be realized. The distance between a friendly graphical interface that does little, and a “traditional” system though clunky yet powerful, is deemed too great more often than not. The question that remains unanswered is, if a user can state her query involving a set of complex, heterogeneous and distributed life sciences resources in an easy to use language and execute it without further help from a computer savvy programmer. In this paper, we present a declarative meta-language, called VisFlow, for requirement specification, and a translator for mapping requirements into executable queries in a variant of SQL augmented with integration artifacts.
{"title":"Visual orchestration and autonomous execution of distributed and heterogeneous computational biology pipelines","authors":"Xin Mou, H. Jamil, R. Rinker","doi":"10.1109/BIBM.2016.7822615","DOIUrl":"https://doi.org/10.1109/BIBM.2016.7822615","url":null,"abstract":"Data integration continues to baffle researchers even though substantial progress has been made. Although the emergence of technologies such as XML, web services, semantic web and cloud computing have helped, a system in which biologists are comfortable articulating new applications and developing them without technical assistance from a computing expert is yet to be realized. The distance between a friendly graphical interface that does little, and a “traditional” system though clunky yet powerful, is deemed too great more often than not. The question that remains unanswered is, if a user can state her query involving a set of complex, heterogeneous and distributed life sciences resources in an easy to use language and execute it without further help from a computer savvy programmer. In this paper, we present a declarative meta-language, called VisFlow, for requirement specification, and a translator for mapping requirements into executable queries in a variant of SQL augmented with integration artifacts.","PeriodicalId":345384,"journal":{"name":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116451952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/BIBM.2016.7822788
Sungjin Park, S. Nam
In general, Boolean networks have been addressed in time-series datasets. However, in the recent field of next-generation sequencing-based cancer genomics, cross-sectional data sets having enormous numbers of patients have been accumulated. Here, we deal with representation of cross-sectional datasets using Boolean networks, and specifically, combinational logic network approach. We then applied the approach to a real cancer patient dataset, demonstrating the feasibility of using Boolean networks in graphical representation of cross-sectional datasets.
{"title":"Combinational logic network for digitally coded gene expression of gastric cancer","authors":"Sungjin Park, S. Nam","doi":"10.1109/BIBM.2016.7822788","DOIUrl":"https://doi.org/10.1109/BIBM.2016.7822788","url":null,"abstract":"In general, Boolean networks have been addressed in time-series datasets. However, in the recent field of next-generation sequencing-based cancer genomics, cross-sectional data sets having enormous numbers of patients have been accumulated. Here, we deal with representation of cross-sectional datasets using Boolean networks, and specifically, combinational logic network approach. We then applied the approach to a real cancer patient dataset, demonstrating the feasibility of using Boolean networks in graphical representation of cross-sectional datasets.","PeriodicalId":345384,"journal":{"name":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126787694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/BIBM.2016.7822783
Shinuk Kim, M. Kon, Hyowon Lee
Numerous computational studies related to cancer have been published, but increasing prediction accuracy of molecular datasets remains a challenge. Here we present a comparison of prediction based on a feature selection method combined with machine learning, for microRNA-Seq (miRNA-Seq) and mRNA-Seq data. We have tested three different approaches: support vector machine, decision tree and k nearest neighbors, under two different feature selection methods: fisher feature selection and infinite feature selection.
{"title":"Some comparisons of gene expression classifiers","authors":"Shinuk Kim, M. Kon, Hyowon Lee","doi":"10.1109/BIBM.2016.7822783","DOIUrl":"https://doi.org/10.1109/BIBM.2016.7822783","url":null,"abstract":"Numerous computational studies related to cancer have been published, but increasing prediction accuracy of molecular datasets remains a challenge. Here we present a comparison of prediction based on a feature selection method combined with machine learning, for microRNA-Seq (miRNA-Seq) and mRNA-Seq data. We have tested three different approaches: support vector machine, decision tree and k nearest neighbors, under two different feature selection methods: fisher feature selection and infinite feature selection.","PeriodicalId":345384,"journal":{"name":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124053063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/BIBM.2016.7822650
A. Biswas, Romil Roy, Sourya Bhattacharyya, Deepak Khaneja, S. D. Bhattacharya, J. Mukhopadhyay
Delivering medical care to newborn babies in their early days of life, involves complex mathematical calculation for feeding, intravenous fluid and electrolytes requirements. Manual calculation of this process is time consuming and potential source of medical error. This work proposes a standalone Android application for newborn care unit, which can run in any handheld Android device like mobile phones and helps health-care professionals to calculate certain parameters regarding the feed and intravenous fluid to be given to a newborn baby. The parameters are - total fluid intake, Glucose Infusion Rate, energy, protein, lipid amount, electrolytes, etc. Its logic is based on the medical guidelines for feed and fluid management of newborn babies. It maintains consistency in a large set of inter-related variables using an existential abstraction approach excluding the possibility of having wrong proportions of dextrose, protein, lipid or fluid volume by showing error and warning messages wherever needed, which acts as a safety measure to avoid medication errors. The objective of the work is to make the medical calculation process faster, safer and accurate. A prototype of the application is being tested in a Sick Newborn Care Unit (SNCU) in Kolkata,India for evaluation.
{"title":"Android application for therapeutic feed and fluid calculation in neonatal care - a way to fast, accurate and safe health-care delivery","authors":"A. Biswas, Romil Roy, Sourya Bhattacharyya, Deepak Khaneja, S. D. Bhattacharya, J. Mukhopadhyay","doi":"10.1109/BIBM.2016.7822650","DOIUrl":"https://doi.org/10.1109/BIBM.2016.7822650","url":null,"abstract":"Delivering medical care to newborn babies in their early days of life, involves complex mathematical calculation for feeding, intravenous fluid and electrolytes requirements. Manual calculation of this process is time consuming and potential source of medical error. This work proposes a standalone Android application for newborn care unit, which can run in any handheld Android device like mobile phones and helps health-care professionals to calculate certain parameters regarding the feed and intravenous fluid to be given to a newborn baby. The parameters are - total fluid intake, Glucose Infusion Rate, energy, protein, lipid amount, electrolytes, etc. Its logic is based on the medical guidelines for feed and fluid management of newborn babies. It maintains consistency in a large set of inter-related variables using an existential abstraction approach excluding the possibility of having wrong proportions of dextrose, protein, lipid or fluid volume by showing error and warning messages wherever needed, which acts as a safety measure to avoid medication errors. The objective of the work is to make the medical calculation process faster, safer and accurate. A prototype of the application is being tested in a Sick Newborn Care Unit (SNCU) in Kolkata,India for evaluation.","PeriodicalId":345384,"journal":{"name":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126432404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/BIBM.2016.7822685
Isaac Akogwu, Nan Wang, Chaoyang Zhang, Hwanseok Choi, H. Hong, P. Gong
Error correction is a critical initial step in next-generation sequencing (NGS) data analysis. Although more than 60 tools have been developed, there is no systematic evidence-based comparison with regard to their strength and weakness, especially in terms of correction accuracy. Here we report a full factorial simulation study to examine how NGS dataset characteristics (genome size, coverage depth and read length in particular) affect error correction performance (precision and F-score), as well as to compare performance sensitivity/resistance of six k-mer spectrum-based methods to variations in dataset characteristics. Multi-way ANOVA tests indicate that choice of correction method and dataset characteristics had significant effects on performance metrics. Overall, BFC, Bless, Bloocoo and Musket performed better than Lighter and Trowel on 27 synthetic datasets. For each chosen method, read length and coverage depth showed more pronounced impact on performance than genome size. This study shed insights to the performance behavior of error correction methods in response to the common variables one would encounter in real-world NGS datasets. It also warrants further studies of wet lab-generated experimental NGS data to validate findings obtained from this simulation study.
{"title":"Factorial analysis of error correction performance using simulated next-generation sequencing data","authors":"Isaac Akogwu, Nan Wang, Chaoyang Zhang, Hwanseok Choi, H. Hong, P. Gong","doi":"10.1109/BIBM.2016.7822685","DOIUrl":"https://doi.org/10.1109/BIBM.2016.7822685","url":null,"abstract":"Error correction is a critical initial step in next-generation sequencing (NGS) data analysis. Although more than 60 tools have been developed, there is no systematic evidence-based comparison with regard to their strength and weakness, especially in terms of correction accuracy. Here we report a full factorial simulation study to examine how NGS dataset characteristics (genome size, coverage depth and read length in particular) affect error correction performance (precision and F-score), as well as to compare performance sensitivity/resistance of six k-mer spectrum-based methods to variations in dataset characteristics. Multi-way ANOVA tests indicate that choice of correction method and dataset characteristics had significant effects on performance metrics. Overall, BFC, Bless, Bloocoo and Musket performed better than Lighter and Trowel on 27 synthetic datasets. For each chosen method, read length and coverage depth showed more pronounced impact on performance than genome size. This study shed insights to the performance behavior of error correction methods in response to the common variables one would encounter in real-world NGS datasets. It also warrants further studies of wet lab-generated experimental NGS data to validate findings obtained from this simulation study.","PeriodicalId":345384,"journal":{"name":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128689090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/BIBM.2016.7822742
Yan Chen, Qinglin Zhao, Bin Hu, Jianpeng Li, Hua Jiang, Wenhua Lin, Yang Li, Shuangshuang Zhou, Hong Peng
Electroencephalogram (EEG) is a noninvasive method to record electrical activity of brain and it has been used extensively in research of brain function due to its high time resolution. However raw EEG is a mixture of signals, which contains noises such as Ocular Artifact (OA) that is irrelevant to the cognitive function of brain. To remove OAs from EEG, many methods have been proposed, such as Independent Components Analysis (ICA), Discrete Wavelet Transform (DWT), Adaptive Noise Cancellation (ANC) and Wavelet Packet Transform (WPT). In this paper, we present a novel hybrid de-noising method which uses Discrete Wavelet Transform (DWT) and Kalman Filtering to remove OAs in EEG. Firstly, we used this method on simulated data. The Mean Squared Error (MSE) of DWT-Kalman method was 0.0017, significantly lower compared to results using WPT-ICA and DWT-ANC, which were 0.0468 and 0.0052, respectively. Meanwhile, the Mean Absolute Error (MAE) using DWT-Kalman achieved an average of 0.0052, which also performed better than WPT-ICA and DWT-ANC, which were 0.0218 and 0.0115, respectively. Then we applied the proposed approach to the raw data collected by our prototype three-channel EEG collector and 64-channel Braincap from BRAIN PRODUCTS. On both data, our method achieved satisfying results. This method does not rely on any particular electrode or the number of electrodes in certain system, so it is recommended for ubiquitous applications.
{"title":"A method of removing Ocular Artifacts from EEG using Discrete Wavelet Transform and Kalman Filtering","authors":"Yan Chen, Qinglin Zhao, Bin Hu, Jianpeng Li, Hua Jiang, Wenhua Lin, Yang Li, Shuangshuang Zhou, Hong Peng","doi":"10.1109/BIBM.2016.7822742","DOIUrl":"https://doi.org/10.1109/BIBM.2016.7822742","url":null,"abstract":"Electroencephalogram (EEG) is a noninvasive method to record electrical activity of brain and it has been used extensively in research of brain function due to its high time resolution. However raw EEG is a mixture of signals, which contains noises such as Ocular Artifact (OA) that is irrelevant to the cognitive function of brain. To remove OAs from EEG, many methods have been proposed, such as Independent Components Analysis (ICA), Discrete Wavelet Transform (DWT), Adaptive Noise Cancellation (ANC) and Wavelet Packet Transform (WPT). In this paper, we present a novel hybrid de-noising method which uses Discrete Wavelet Transform (DWT) and Kalman Filtering to remove OAs in EEG. Firstly, we used this method on simulated data. The Mean Squared Error (MSE) of DWT-Kalman method was 0.0017, significantly lower compared to results using WPT-ICA and DWT-ANC, which were 0.0468 and 0.0052, respectively. Meanwhile, the Mean Absolute Error (MAE) using DWT-Kalman achieved an average of 0.0052, which also performed better than WPT-ICA and DWT-ANC, which were 0.0218 and 0.0115, respectively. Then we applied the proposed approach to the raw data collected by our prototype three-channel EEG collector and 64-channel Braincap from BRAIN PRODUCTS. On both data, our method achieved satisfying results. This method does not rely on any particular electrode or the number of electrodes in certain system, so it is recommended for ubiquitous applications.","PeriodicalId":345384,"journal":{"name":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128961348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/BIBM.2016.7822746
Guixia Kang, Zhuang Ni
most cancers at early stages show no obvious symptoms and curative treatment is not an option any more when cancer is diagnosed. Therefore, making accurate predictions for the risk of early cancer has become urgently necessary in the field of medicine. In this paper, our purpose is to fully utilize real-world routine physical examination data to analyze the most discriminative features of cancer based on ReliefF algorithm and generate early risk predictive model of cancer taking advantage of three machine learning (ML) algorithms. We use physical examination data with a return visit followed 1 month later derived from CiMing Health Checkup Center. The ReliefF algorithm selects the top 30 features written as Sub(30) based on weight value from our data collections consisting of 34 features and 2300 candidates. The 4-layer (2 hidden layers) deep neutral network (DNN) based on B-P algorithm, the support machine vector with the linear kernel and decision tree CART are proposed for predicting the risk of cancer by 5-fold cross validation. We implement these criteria such as predictive accuracy, AUC-ROC, sensitivity and specificity to identify the discriminative ability of three proposed method for cancer. The results show that compared with the other two methods, SVM obtains higher AUC and specificity of 0.926 and 95.27%, respectively. The superior predictive accuracy (86%) is achieved by DNN. Moreover, the fuzzy interval of threshold in DNN is proposed and the sensitivity, specificity and accuracy of DNN is 90.20%, 94.22% and 93.22%, respectively, using the revised threshold interval. The research indicates that the application of ML methods together with risk feature selection based on real-world routine physical examination data is meaningful and promising in the area of cancer prediction.
{"title":"Research on early risk predictive model and discriminative feature selection of cancer based on real-world routine physical examination data","authors":"Guixia Kang, Zhuang Ni","doi":"10.1109/BIBM.2016.7822746","DOIUrl":"https://doi.org/10.1109/BIBM.2016.7822746","url":null,"abstract":"most cancers at early stages show no obvious symptoms and curative treatment is not an option any more when cancer is diagnosed. Therefore, making accurate predictions for the risk of early cancer has become urgently necessary in the field of medicine. In this paper, our purpose is to fully utilize real-world routine physical examination data to analyze the most discriminative features of cancer based on ReliefF algorithm and generate early risk predictive model of cancer taking advantage of three machine learning (ML) algorithms. We use physical examination data with a return visit followed 1 month later derived from CiMing Health Checkup Center. The ReliefF algorithm selects the top 30 features written as Sub(30) based on weight value from our data collections consisting of 34 features and 2300 candidates. The 4-layer (2 hidden layers) deep neutral network (DNN) based on B-P algorithm, the support machine vector with the linear kernel and decision tree CART are proposed for predicting the risk of cancer by 5-fold cross validation. We implement these criteria such as predictive accuracy, AUC-ROC, sensitivity and specificity to identify the discriminative ability of three proposed method for cancer. The results show that compared with the other two methods, SVM obtains higher AUC and specificity of 0.926 and 95.27%, respectively. The superior predictive accuracy (86%) is achieved by DNN. Moreover, the fuzzy interval of threshold in DNN is proposed and the sensitivity, specificity and accuracy of DNN is 90.20%, 94.22% and 93.22%, respectively, using the revised threshold interval. The research indicates that the application of ML methods together with risk feature selection based on real-world routine physical examination data is meaningful and promising in the area of cancer prediction.","PeriodicalId":345384,"journal":{"name":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133780512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/BIBM.2016.7822641
Fan Jiang, C. Leung, O. Sarumi, Christine Y. Zhang
Big data has become ubiquitous as high volumes of wide varieties of valuable data of different veracities (e.g., precise, imprecise or uncertain data) are made available at a high velocity through fast throughput machines and techniques for data gathering and curation in many real life applications in various domains and application areas such as bioinformatics, biomedicine, finance, social networking, and weather forecasting. In bioinformatics, terabytes of deoxyribonucleic acid (DNA) sequences can now be generated within a few hours with the use of next generation sequencing (NGS) technologies such as Illumina HiSeq X and Illumina Genome Analyzer. Due to the nature of these NGS technologies, generated data are usually inherent with some noise or other forms of error. These uncertain data are embedded with a wealth of information in the form of frequent patterns. Mining frequently occurring patterns (e.g., motifs) from these big uncertain DNA sequences is a challenge in bioinformatics and biomedicine. Many existing algorithms are serial and mine DNA sequence motifs using precise data mining methods. Mining of motifs from big DNA sequences is a computationally intensive task because of the high volume and the associated uncertainty of these DNA sequences. In this paper, we propose a scalable algorithm for high performance computing on bioinformatics. Specifically, our parallel algorithm uses a fault-tolerant collection of resilient distributed datasets (RDDs) in Apache Spark computing framework to mine sequence motifs from uncertain big DNA data. Experimental results show that our algorithm extracts accurate motifs within a short time frame.
{"title":"Mining sequential patterns from uncertain big DNA in the spark framework","authors":"Fan Jiang, C. Leung, O. Sarumi, Christine Y. Zhang","doi":"10.1109/BIBM.2016.7822641","DOIUrl":"https://doi.org/10.1109/BIBM.2016.7822641","url":null,"abstract":"Big data has become ubiquitous as high volumes of wide varieties of valuable data of different veracities (e.g., precise, imprecise or uncertain data) are made available at a high velocity through fast throughput machines and techniques for data gathering and curation in many real life applications in various domains and application areas such as bioinformatics, biomedicine, finance, social networking, and weather forecasting. In bioinformatics, terabytes of deoxyribonucleic acid (DNA) sequences can now be generated within a few hours with the use of next generation sequencing (NGS) technologies such as Illumina HiSeq X and Illumina Genome Analyzer. Due to the nature of these NGS technologies, generated data are usually inherent with some noise or other forms of error. These uncertain data are embedded with a wealth of information in the form of frequent patterns. Mining frequently occurring patterns (e.g., motifs) from these big uncertain DNA sequences is a challenge in bioinformatics and biomedicine. Many existing algorithms are serial and mine DNA sequence motifs using precise data mining methods. Mining of motifs from big DNA sequences is a computationally intensive task because of the high volume and the associated uncertainty of these DNA sequences. In this paper, we propose a scalable algorithm for high performance computing on bioinformatics. Specifically, our parallel algorithm uses a fault-tolerant collection of resilient distributed datasets (RDDs) in Apache Spark computing framework to mine sequence motifs from uncertain big DNA data. Experimental results show that our algorithm extracts accurate motifs within a short time frame.","PeriodicalId":345384,"journal":{"name":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127017972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}