Mental fatigue is closely related to our daily life and work, a considerable number of studies have achieved good results in quantifying and predicting them. Although some studies have achieved a high accuracy by using only a single channel, and a few have explored the optimal solution for feature and channel selection. However, detailed research of optimally setting the electrodes position and determining the number channels are rarely seen. In this study, by designing a novel genetic operator and applying the GA-SVM model, we compared the maximum number of optimal channels and their distributions. The result suggests that the classification accuracy almost reaches its optimum (94.0±5.3 %) when the maximum number of channels reaches 5, and is not affected by the epoch length. The whole brain optimal channels topographic map analysis shows that the optimal channels are mainly distributed in the prefrontal, occipital and temporal lobes, while hardly any is located in the parietal lobe, which indicates that the mental fatigue induced by visual search task characterized similarly among different individuals and highly task-related.
{"title":"The Optimal Number and Distribution of Channels in Mental Fatigue Classification Based on GA-SVM","authors":"Yinhe Sheng, Kang Huang, Liping Wang, Pengfei Wei","doi":"10.1145/3309129.3309140","DOIUrl":"https://doi.org/10.1145/3309129.3309140","url":null,"abstract":"Mental fatigue is closely related to our daily life and work, a considerable number of studies have achieved good results in quantifying and predicting them. Although some studies have achieved a high accuracy by using only a single channel, and a few have explored the optimal solution for feature and channel selection. However, detailed research of optimally setting the electrodes position and determining the number channels are rarely seen. In this study, by designing a novel genetic operator and applying the GA-SVM model, we compared the maximum number of optimal channels and their distributions. The result suggests that the classification accuracy almost reaches its optimum (94.0±5.3 %) when the maximum number of channels reaches 5, and is not affected by the epoch length. The whole brain optimal channels topographic map analysis shows that the optimal channels are mainly distributed in the prefrontal, occipital and temporal lobes, while hardly any is located in the parietal lobe, which indicates that the mental fatigue induced by visual search task characterized similarly among different individuals and highly task-related.","PeriodicalId":326530,"journal":{"name":"Proceedings of the 5th International Conference on Bioinformatics Research and Applications","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126221987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Chaware, Omkar Udawant, Kiran Joshi, Tejas Deshpande
Battlefield is an area where you cannot predict the attacking situation from an opposition. The situation may become worse when the enemy tankers may attack from various position and we will not enough get chance to think about our security. If by any mean we can analysis the situation of battling, we can easily decide the attacking strategy against any attack. This entire environment may simulate through a simulator where we can decide to attack and defend ourselves. In this paper, we had proposed a battlefield simulator which helps in eliminating manual efforts of artillery testing and the demonstration cost required for the same. This simulator takes parameters such as type of artillery to be tested, environmental conditions and strategic planning. Damage caused by the artillery is calculated using physics formulae designed for achieving actual results. We had compared the situation with CPU and GPU processor and found that GPU is must faster than CPU and gives more accuracy.
{"title":"Proposed Battlefield Simulator Using GPU","authors":"S. Chaware, Omkar Udawant, Kiran Joshi, Tejas Deshpande","doi":"10.1145/3309129.3309131","DOIUrl":"https://doi.org/10.1145/3309129.3309131","url":null,"abstract":"Battlefield is an area where you cannot predict the attacking situation from an opposition. The situation may become worse when the enemy tankers may attack from various position and we will not enough get chance to think about our security. If by any mean we can analysis the situation of battling, we can easily decide the attacking strategy against any attack. This entire environment may simulate through a simulator where we can decide to attack and defend ourselves. In this paper, we had proposed a battlefield simulator which helps in eliminating manual efforts of artillery testing and the demonstration cost required for the same. This simulator takes parameters such as type of artillery to be tested, environmental conditions and strategic planning. Damage caused by the artillery is calculated using physics formulae designed for achieving actual results. We had compared the situation with CPU and GPU processor and found that GPU is must faster than CPU and gives more accuracy.","PeriodicalId":326530,"journal":{"name":"Proceedings of the 5th International Conference on Bioinformatics Research and Applications","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127943263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The development of next generation sequencing facilitates the study of metagenomics. Computational gene prediction aims to find the location of genes in a given DNA sequence. Gene prediction in metagenomics is a challenging task because of the short and fragmented nature of the data. Our previous framework minimum redundancy maximum relevance - support vector machines (mRMR-SVM) produced promising results in metagenomics gene prediction. In this paper, we review available metagenomics gene prediction programs and study the effect of the machine learning approach on gene prediction by altering the underlining machine learning algorithm in our previous framework. Overall, SVM produces the highest accuracy based on tests performed on a simulated dataset.
{"title":"The Effect of Machine Learning Algorithms on Metagenomics Gene Prediction","authors":"Amani A. Al-Ajlan, Achraf El Allali","doi":"10.1145/3309129.3309136","DOIUrl":"https://doi.org/10.1145/3309129.3309136","url":null,"abstract":"The development of next generation sequencing facilitates the study of metagenomics. Computational gene prediction aims to find the location of genes in a given DNA sequence. Gene prediction in metagenomics is a challenging task because of the short and fragmented nature of the data. Our previous framework minimum redundancy maximum relevance - support vector machines (mRMR-SVM) produced promising results in metagenomics gene prediction. In this paper, we review available metagenomics gene prediction programs and study the effect of the machine learning approach on gene prediction by altering the underlining machine learning algorithm in our previous framework. Overall, SVM produces the highest accuracy based on tests performed on a simulated dataset.","PeriodicalId":326530,"journal":{"name":"Proceedings of the 5th International Conference on Bioinformatics Research and Applications","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132307373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Since the quantity and quality of DNA sequence directly affect the accuracy and efficiency of computation, the design of DNA sequence is critical to DNA computing. In order to improve the reliability of DNA computing, there is a rich literature targeting at making DNA sequences specifically hybridize at a lower melting temperature, no non-complementary bases pairs or mismatch hybridization in the reformed double helix. However, most of them are not good enough to control the melting temperature, because DNA sequence design problem under the constraints of hamming distance, secondary structure, molecular thermodynamic is known to be NP-hard. For the sake of achieving the lower and similar melting temperature for each DNA sequence, we proposed a DNA sequence coding method based on Bacterial Foraging Algorithm (BFA). An evaluation criterion is particularly proposed to assess the quality of DNA sequence in the optimization process. With BFA, high-quality DNA strands are replicated to avoid the participation of inferior strands in the operation. Experiments show our proposed approach significantly outperforms existing methods in terms of continuity and melting temperature.
{"title":"DNA Computing Sequence Design Based on Bacterial Foraging Algorithm","authors":"Jiankang Ren, Yao Yao","doi":"10.1145/3309129.3309147","DOIUrl":"https://doi.org/10.1145/3309129.3309147","url":null,"abstract":"Since the quantity and quality of DNA sequence directly affect the accuracy and efficiency of computation, the design of DNA sequence is critical to DNA computing. In order to improve the reliability of DNA computing, there is a rich literature targeting at making DNA sequences specifically hybridize at a lower melting temperature, no non-complementary bases pairs or mismatch hybridization in the reformed double helix. However, most of them are not good enough to control the melting temperature, because DNA sequence design problem under the constraints of hamming distance, secondary structure, molecular thermodynamic is known to be NP-hard. For the sake of achieving the lower and similar melting temperature for each DNA sequence, we proposed a DNA sequence coding method based on Bacterial Foraging Algorithm (BFA). An evaluation criterion is particularly proposed to assess the quality of DNA sequence in the optimization process. With BFA, high-quality DNA strands are replicated to avoid the participation of inferior strands in the operation. Experiments show our proposed approach significantly outperforms existing methods in terms of continuity and melting temperature.","PeriodicalId":326530,"journal":{"name":"Proceedings of the 5th International Conference on Bioinformatics Research and Applications","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124367354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anas Khaleel, A. Elbakkoush, Amneh H. Tarkhan, Aiman Mahdi
Colorectal cancer is one of the most common types of cancer in the world, and its incidence is mostly influenced by lifestyle factors. Despite having a much smaller role, genetics also affects the susceptibility and development of colorectal cancer. The aim of the present study is to investigate the regulatory functions of candidate microRNAs (miRs) 1 and 206 in the context of solute carrier family 16 member 3 (SLC16A3) and vascular endothelial growth factor (VEGF) expression. To achieve this, 24 oncogenes targeted by miR-1 and miR-206 were analyzed via GeneMANIA. The miRTarBase database was then employed to ascertain the nature of the miR-oncogene relationship. Our findings illustrate that miR-1/206 indirectly reduce CRC growth and infiltration by targeting the both the SLC16A3 and VEGF genes. Moreover, miR-1/206 targets the VEGF gene to reduce tumor angiogenesis and vasculature. Conclusively, the results of the current study illustrate a novel regulation pathway in CRC cells, suggesting new potential lines of CRC therapy.
{"title":"Microrna-1/206 Target both Monocarboxylate Transporter(MCT)-4 and Vascular Endothelial Growth Factor(VEGF)Genes Leading to Inhibition of Tumor Growth","authors":"Anas Khaleel, A. Elbakkoush, Amneh H. Tarkhan, Aiman Mahdi","doi":"10.1145/3309129.3309144","DOIUrl":"https://doi.org/10.1145/3309129.3309144","url":null,"abstract":"Colorectal cancer is one of the most common types of cancer in the world, and its incidence is mostly influenced by lifestyle factors. Despite having a much smaller role, genetics also affects the susceptibility and development of colorectal cancer. The aim of the present study is to investigate the regulatory functions of candidate microRNAs (miRs) 1 and 206 in the context of solute carrier family 16 member 3 (SLC16A3) and vascular endothelial growth factor (VEGF) expression. To achieve this, 24 oncogenes targeted by miR-1 and miR-206 were analyzed via GeneMANIA. The miRTarBase database was then employed to ascertain the nature of the miR-oncogene relationship. Our findings illustrate that miR-1/206 indirectly reduce CRC growth and infiltration by targeting the both the SLC16A3 and VEGF genes. Moreover, miR-1/206 targets the VEGF gene to reduce tumor angiogenesis and vasculature. Conclusively, the results of the current study illustrate a novel regulation pathway in CRC cells, suggesting new potential lines of CRC therapy.","PeriodicalId":326530,"journal":{"name":"Proceedings of the 5th International Conference on Bioinformatics Research and Applications","volume":"42 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128220822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Central sleep apnea (CSA) is a sleep-related disorder in which breathing is either diminished or absent, typically for 10 to 30 seconds, intermittently or in cycles. CSA is usually due to an instability in the body's feedback mechanisms that control respiration. Central sleep apnea can also be an indicator of Arnold-Chiari malformation. Therefore, various attempts have been made to produce a monitoring system for automatic Central sleep apnea scoring to reduce clinical efforts. This paper describes a system that can identify Central sleep apnea by means of a single-lead ECG and a Multilayer Perceptron network (MLP). Results show that a minute-by-minute classification accuracy of over 83% is achievable.
{"title":"Detection of Central Sleep Apnea Based on a Single-Lead ECG","authors":"P. D. Hung","doi":"10.1145/3309129.3309132","DOIUrl":"https://doi.org/10.1145/3309129.3309132","url":null,"abstract":"Central sleep apnea (CSA) is a sleep-related disorder in which breathing is either diminished or absent, typically for 10 to 30 seconds, intermittently or in cycles. CSA is usually due to an instability in the body's feedback mechanisms that control respiration. Central sleep apnea can also be an indicator of Arnold-Chiari malformation. Therefore, various attempts have been made to produce a monitoring system for automatic Central sleep apnea scoring to reduce clinical efforts. This paper describes a system that can identify Central sleep apnea by means of a single-lead ECG and a Multilayer Perceptron network (MLP). Results show that a minute-by-minute classification accuracy of over 83% is achievable.","PeriodicalId":326530,"journal":{"name":"Proceedings of the 5th International Conference on Bioinformatics Research and Applications","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125231471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Sugiono, Denny Widhayanuriyawan, Debrina P. Andriyani
Controlling driver stress level is going popular research and put it very important factor to reduce risk of road accident. The aim of the paper is to analysis the impact of road complexity on driver stress level based on physiological factor of Heart Rate Variability (HRV). The first step of the research is literature study on human stress, Heart Rate Variability (HRV), Electrocardiograph (ECG), and NASA TLX mental work load. The driver will use ECG to monitor and then recorded at every heart rate change at any time from three different road conditions of city road, rural road, and motorways. The collected sampling data are 26 male drivers with the average age of 21 years old and average driving experience of 4.08 years. Mental stress evaluation of driver was assessed by frustration level (F) in NASA TLX questioner (subjective measurenment) and HRV in time domain analysis mRR (objective measurenment). The statistic test demontrated that there are not signifficant different mental stress level for driver between mRR and F - NASA TLX. The city road produced avarage F - NASA TLX = 3.92 and mRR = 612.40ms, rural road produced avarage F - NASA TLX = 3.46 and mRR = 621.26 ms, and motorway produced avarage F - NASA TLX = 2.50 and mRR = 820.20 ms. In sort, the mRR of HRV data can be used to monitor the mental stress level of driver in real time as consequence it baneficely implemented in car alert safety system.
控制驾驶员的应激水平是降低道路交通事故风险的重要因素之一。本文的目的是基于心率变异性(HRV)这一生理因素,分析道路复杂程度对驾驶员应激水平的影响。研究的第一步是对人体压力、心率变异性(HRV)、心电图(ECG)和NASA TLX精神工作负荷进行文献研究。驾驶员将使用心电图监测并记录在城市道路、农村道路和高速公路三种不同道路条件下的每一次心率变化。采集的样本数据为26名男性驾驶员,平均年龄21岁,平均驾驶经验4.08年。采用NASA TLX提问者主观测量挫败度(F)和时域分析HRV mRR(客观测量)对驾驶员心理压力进行评价。统计检验表明,驾驶员心理应激水平在mRR和F - NASA TLX之间无显著差异。城市公路产生平均F - NASA TLX = 3.92, mRR = 612.40ms,农村公路产生平均F - NASA TLX = 3.46, mRR = 621.26 ms,高速公路产生平均F - NASA TLX = 2.50, mRR = 820.20 ms。因此,HRV数据的mRR可以用于实时监测驾驶员的心理压力水平,从而有效地实现了汽车警报安全系统。
{"title":"Mental Stress Evaluation of Car Driver in Different Road Complexity Using Heart Rate Variability (HRV) Analysis","authors":"S. Sugiono, Denny Widhayanuriyawan, Debrina P. Andriyani","doi":"10.1145/3309129.3309145","DOIUrl":"https://doi.org/10.1145/3309129.3309145","url":null,"abstract":"Controlling driver stress level is going popular research and put it very important factor to reduce risk of road accident. The aim of the paper is to analysis the impact of road complexity on driver stress level based on physiological factor of Heart Rate Variability (HRV). The first step of the research is literature study on human stress, Heart Rate Variability (HRV), Electrocardiograph (ECG), and NASA TLX mental work load. The driver will use ECG to monitor and then recorded at every heart rate change at any time from three different road conditions of city road, rural road, and motorways. The collected sampling data are 26 male drivers with the average age of 21 years old and average driving experience of 4.08 years. Mental stress evaluation of driver was assessed by frustration level (F) in NASA TLX questioner (subjective measurenment) and HRV in time domain analysis mRR (objective measurenment). The statistic test demontrated that there are not signifficant different mental stress level for driver between mRR and F - NASA TLX. The city road produced avarage F - NASA TLX = 3.92 and mRR = 612.40ms, rural road produced avarage F - NASA TLX = 3.46 and mRR = 621.26 ms, and motorway produced avarage F - NASA TLX = 2.50 and mRR = 820.20 ms. In sort, the mRR of HRV data can be used to monitor the mental stress level of driver in real time as consequence it baneficely implemented in car alert safety system.","PeriodicalId":326530,"journal":{"name":"Proceedings of the 5th International Conference on Bioinformatics Research and Applications","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129101140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The publication of biological literature increasing year by year. And the important information in biomedical articles may only appear in tables. However, research on information extraction from tables is rare. Nowadays, there are two ways to do table mining. The first way is that researchers convert the document to HTML format, but the performance of conversion is terrible. The second way is that researchers use documents in XML format directly, but the number of XML documents are limited. To solve this problem, we propose Biotable, a tool for mining biological tables in PDF documents. We use the concept of Connected Value to locate the table boundary and locate each cell after converting each page of the PDF into a picture. In the analysis of the table header field, we convert all the heterogeneous table headers into one row. Then we will have better understanding of the semantics of each column. Based on Biotable and the pipeline QTLMiners proposed, we performed a table mining experiment on QTLMiner's dataset. The precision value of the table detection is 98.12% and the recall value of table detection is 93.14%. The recall value of QTL statements is 86.53%.
{"title":"Biotable: A Tool to Extract Semantic Structure of Table in Biology Literature","authors":"Daipeng Luo, Jing Peng, Yuhua Fu","doi":"10.1145/3309129.3309139","DOIUrl":"https://doi.org/10.1145/3309129.3309139","url":null,"abstract":"The publication of biological literature increasing year by year. And the important information in biomedical articles may only appear in tables. However, research on information extraction from tables is rare. Nowadays, there are two ways to do table mining. The first way is that researchers convert the document to HTML format, but the performance of conversion is terrible. The second way is that researchers use documents in XML format directly, but the number of XML documents are limited. To solve this problem, we propose Biotable, a tool for mining biological tables in PDF documents. We use the concept of Connected Value to locate the table boundary and locate each cell after converting each page of the PDF into a picture. In the analysis of the table header field, we convert all the heterogeneous table headers into one row. Then we will have better understanding of the semantics of each column. Based on Biotable and the pipeline QTLMiners proposed, we performed a table mining experiment on QTLMiner's dataset. The precision value of the table detection is 98.12% and the recall value of table detection is 93.14%. The recall value of QTL statements is 86.53%.","PeriodicalId":326530,"journal":{"name":"Proceedings of the 5th International Conference on Bioinformatics Research and Applications","volume":"198 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121716973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
MarkDuplicate is typically one of the most time-consuming operations in the whole genome sequencing pipeline. Picard tool, which is widely used by biologists to sort reads in genome data and mark duplicate reads in sorted genome data, has relatively low performance on MarkDuplicate due to its single-thread sequential Java implementation, which has caused serious impact on nowadays bioinformatic researches. To accelerate MarkDuplicate in Picard, we present our two-stage optimization solution as a preliminary study on next generation bioinformatic software tools to better serve bioinformatic researches. In the first stage, we improve the original algorithm of tracking optical duplicate reads by eliminating large redundant operations. As a consequence, we achieve up to 50X speedup for the second step only and 9.57X overall process speedup. At the next stage, we redesign the I/O processing mechanism of MarkDuplicate as transforming between on-disk genome file and in-memory genome data by using ADAM format instead of previous SAM format, and implement cloud-scale MarkDuplicate application by Scala. Our evaluation is performed on top of Spark cluster with 25 worker nodes and Hadoop distributed file system. According to the evaluation results, our cloudscale MarkDuplicate can provide not only the same output but also better performance compared with the original Picard tool and other existing similar tools. Specifically, among the 13 sets of real whole genome data we used for evaluation at both stages, the best improvement we gain is reducing runtime by 92 hours in total. Average improvement reaches 48.69 decreasing hours.
{"title":"A Study on Optimizing MarkDuplicate in Genome Sequencing Pipeline","authors":"Qi Zhao","doi":"10.1145/3309129.3309134","DOIUrl":"https://doi.org/10.1145/3309129.3309134","url":null,"abstract":"MarkDuplicate is typically one of the most time-consuming operations in the whole genome sequencing pipeline. Picard tool, which is widely used by biologists to sort reads in genome data and mark duplicate reads in sorted genome data, has relatively low performance on MarkDuplicate due to its single-thread sequential Java implementation, which has caused serious impact on nowadays bioinformatic researches. To accelerate MarkDuplicate in Picard, we present our two-stage optimization solution as a preliminary study on next generation bioinformatic software tools to better serve bioinformatic researches. In the first stage, we improve the original algorithm of tracking optical duplicate reads by eliminating large redundant operations. As a consequence, we achieve up to 50X speedup for the second step only and 9.57X overall process speedup. At the next stage, we redesign the I/O processing mechanism of MarkDuplicate as transforming between on-disk genome file and in-memory genome data by using ADAM format instead of previous SAM format, and implement cloud-scale MarkDuplicate application by Scala. Our evaluation is performed on top of Spark cluster with 25 worker nodes and Hadoop distributed file system. According to the evaluation results, our cloudscale MarkDuplicate can provide not only the same output but also better performance compared with the original Picard tool and other existing similar tools. Specifically, among the 13 sets of real whole genome data we used for evaluation at both stages, the best improvement we gain is reducing runtime by 92 hours in total. Average improvement reaches 48.69 decreasing hours.","PeriodicalId":326530,"journal":{"name":"Proceedings of the 5th International Conference on Bioinformatics Research and Applications","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116202506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nowadays, Machine Learning has been applied in variety aspects of life especially in health care. Classifications using Machine learning has been greatly improved in order to make predictions and to support doctors making diagnoses. Furthermore, human lives are changing with Big Data covering a wide of array of science knowledge and with Data Mining solving problems by analyzing data and discovering patterns in present databases. The prediction process is heavily data driven and therefore advanced machine learning techniques are often utilized. In this paper, we will take a look at what types experiment data are typically used, do preliminary analysis on them, and generate breast cancer prediction models - all with PySpark and its machine learning frameworks. Using a database with more than a hundred sets of data gathered in routine blood analysis, the accuracy rates of detection and classification are about 72% and 83% respectively.
{"title":"Breast Cancer Prediction Using Spark MLlib and ML Packages","authors":"P. D. Hung, Tran Duc Hanh, V. Diep","doi":"10.1145/3309129.3309133","DOIUrl":"https://doi.org/10.1145/3309129.3309133","url":null,"abstract":"Nowadays, Machine Learning has been applied in variety aspects of life especially in health care. Classifications using Machine learning has been greatly improved in order to make predictions and to support doctors making diagnoses. Furthermore, human lives are changing with Big Data covering a wide of array of science knowledge and with Data Mining solving problems by analyzing data and discovering patterns in present databases. The prediction process is heavily data driven and therefore advanced machine learning techniques are often utilized. In this paper, we will take a look at what types experiment data are typically used, do preliminary analysis on them, and generate breast cancer prediction models - all with PySpark and its machine learning frameworks. Using a database with more than a hundred sets of data gathered in routine blood analysis, the accuracy rates of detection and classification are about 72% and 83% respectively.","PeriodicalId":326530,"journal":{"name":"Proceedings of the 5th International Conference on Bioinformatics Research and Applications","volume":"81 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133825562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}