Pub Date : 2022-10-19DOI: 10.1109/KSE56063.2022.9953767
Thu Kim Le, L. Vinh
The evolutionary process of characters (e.g., nucleotides or amino acids) is heterogeneous among sites of alignments. Applying the same evolutionary model for all sites leads to unreliable results in evolutionary studies. Partitioning alignments into sub-alignments (groups) such that sites in each sub-alignment follow the same model of evolution is a proper and promising approach to adequately handle the heterogeneity among sites. A number of computational methods have been proposed to partition alignments, however, they are unable to properly handle invariant sites. The iterative k-means algorithm is widely used to partition large alignments, unfortunately, recently suspended because it always groups all invariant sites into one group that might distort phylogenetic trees reconstructed from sub-alignments.In this paper, we improve the iterative k-means algorithm for protein alignments by combining both amino acids and their secondary structures to properly partition invariant sites. The protein secondary structure information helps classify invariant sites into different groups each includes both variant and invariant sites. Experiments on real large protein alignments showed that the new algorithm overcomes the pitfall of grouping all invariant sites into one group and consequently produces better partitioning schemes.
{"title":"A protein secondary structure-based algorithm for partitioning large protein alignments","authors":"Thu Kim Le, L. Vinh","doi":"10.1109/KSE56063.2022.9953767","DOIUrl":"https://doi.org/10.1109/KSE56063.2022.9953767","url":null,"abstract":"The evolutionary process of characters (e.g., nucleotides or amino acids) is heterogeneous among sites of alignments. Applying the same evolutionary model for all sites leads to unreliable results in evolutionary studies. Partitioning alignments into sub-alignments (groups) such that sites in each sub-alignment follow the same model of evolution is a proper and promising approach to adequately handle the heterogeneity among sites. A number of computational methods have been proposed to partition alignments, however, they are unable to properly handle invariant sites. The iterative k-means algorithm is widely used to partition large alignments, unfortunately, recently suspended because it always groups all invariant sites into one group that might distort phylogenetic trees reconstructed from sub-alignments.In this paper, we improve the iterative k-means algorithm for protein alignments by combining both amino acids and their secondary structures to properly partition invariant sites. The protein secondary structure information helps classify invariant sites into different groups each includes both variant and invariant sites. Experiments on real large protein alignments showed that the new algorithm overcomes the pitfall of grouping all invariant sites into one group and consequently produces better partitioning schemes.","PeriodicalId":330865,"journal":{"name":"2022 14th International Conference on Knowledge and Systems Engineering (KSE)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121146661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-19DOI: 10.1109/KSE56063.2022.9953787
N. M. Tuan, Huynh Thi Khanh Chi, N. Hop
The objective of this paper is to propose a hybrid machine learning approach using a so-called Cost-Complexity Pruning Decision Trees algorithm in predicting supply chain risks, particularly, delayed deliveries. The Recursive Feature Elimination with Cross-Validation solution is designed to improve the feature selection function of the Decision Trees classifier. Then, the Two-Phase Cost-Complexity Pruning technique is developed to reduce the overfitting of the tree-based algorithms. A case study of an e-commerce enabler in Vietnam is investigated to illustrate the efficiency of the proposed models. The obtained results show promise in terms of predictive performance.
{"title":"A Hybrid Machine Learning Approach in Predicting E-Commerce Supply Chain Risks","authors":"N. M. Tuan, Huynh Thi Khanh Chi, N. Hop","doi":"10.1109/KSE56063.2022.9953787","DOIUrl":"https://doi.org/10.1109/KSE56063.2022.9953787","url":null,"abstract":"The objective of this paper is to propose a hybrid machine learning approach using a so-called Cost-Complexity Pruning Decision Trees algorithm in predicting supply chain risks, particularly, delayed deliveries. The Recursive Feature Elimination with Cross-Validation solution is designed to improve the feature selection function of the Decision Trees classifier. Then, the Two-Phase Cost-Complexity Pruning technique is developed to reduce the overfitting of the tree-based algorithms. A case study of an e-commerce enabler in Vietnam is investigated to illustrate the efficiency of the proposed models. The obtained results show promise in terms of predictive performance.","PeriodicalId":330865,"journal":{"name":"2022 14th International Conference on Knowledge and Systems Engineering (KSE)","volume":"1690 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129390638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-19DOI: 10.1109/KSE56063.2022.9953798
L. Bui, V. Vu, Bich Van Pham, V. Phan
This paper proposes a cooperative coevolutionary approach namely COESDP to the software defect prediction (SDP) problem. The proposed method consists of three main phases. The first one conducts data preprocessing including data sampling and cleaning. The second phase utilizes a multi-population coevolutionary approach (MPCA) to find out optimal instance selection solutions. These first two phases help to deal with the imbalanced data challenge of the SDP problem. While the data sampling method aids in the creation of a more balanced data set, MPCA supports in the elimination of unnecessary data samples (or instances) and the selection of crucial instances. The output of phase 2 is a set of different optimal solutions. Each solution is a way of selecting instances from which to create a classifier (or weak learners). Phase 3 utilizes an ensemble learning method to combine these weak learners and produce the final result. The proposed algorithm is compared with conventional machine learning algorithms, ensemble learning algorithms, computational intelligence algorithms and an other multi-population algorithm on 6 standard SDP datasets. Experimental results show that the proposed method gives better and more stable results in comparison with other methods and it can tackle the challenge of imbalance in the SDP data.
{"title":"A multi-population coevolutionary approach for Software defect prediction with imbalanced data.","authors":"L. Bui, V. Vu, Bich Van Pham, V. Phan","doi":"10.1109/KSE56063.2022.9953798","DOIUrl":"https://doi.org/10.1109/KSE56063.2022.9953798","url":null,"abstract":"This paper proposes a cooperative coevolutionary approach namely COESDP to the software defect prediction (SDP) problem. The proposed method consists of three main phases. The first one conducts data preprocessing including data sampling and cleaning. The second phase utilizes a multi-population coevolutionary approach (MPCA) to find out optimal instance selection solutions. These first two phases help to deal with the imbalanced data challenge of the SDP problem. While the data sampling method aids in the creation of a more balanced data set, MPCA supports in the elimination of unnecessary data samples (or instances) and the selection of crucial instances. The output of phase 2 is a set of different optimal solutions. Each solution is a way of selecting instances from which to create a classifier (or weak learners). Phase 3 utilizes an ensemble learning method to combine these weak learners and produce the final result. The proposed algorithm is compared with conventional machine learning algorithms, ensemble learning algorithms, computational intelligence algorithms and an other multi-population algorithm on 6 standard SDP datasets. Experimental results show that the proposed method gives better and more stable results in comparison with other methods and it can tackle the challenge of imbalance in the SDP data.","PeriodicalId":330865,"journal":{"name":"2022 14th International Conference on Knowledge and Systems Engineering (KSE)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126907598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-19DOI: 10.1109/KSE56063.2022.9953768
Anh Phan Viet, Dung Le Duy, Van Anh Tran Thi, Hung Pham Duy, Truong Vu Van, L. Bui
This paper introduces a solution to assist visually impaired or blind (VIB) people in independently accessing printed and electronic documents. The highlight of the solution is the cost-effectiveness and accuracy. Extracting texts and reading out to users are performed by a pure smartphone application. To be usable by VIB people, advanced technologies in image and speech processing are leveraged to enhance the user experience and accuracy in converting images to texts. To build accurate optical character recognition (OCR) models with low-quality images, we combine different solutions includings 1) generating a large and balanced dataset with various backgrounds, 2) correcting the distortion and direction, and 3) applying the sequence to sequence model with transformers as the encoder. For ease of use, the text to speech (TTS) model generates voice instructions at every interaction, and the interface is designed and adjusted according to user feedback. A test on a scanned document set has showed the high accuracy of the OCR model with 98,6% by characters, and the fluency of the TTS model. As being indicated in a trial with VIB people, our application can help them read printed documents conveniently, and it is an affordable solution since the popularity of smartphones.
{"title":"Towards An Accurate and Effective Printed Document Reader for Visually Impaired People","authors":"Anh Phan Viet, Dung Le Duy, Van Anh Tran Thi, Hung Pham Duy, Truong Vu Van, L. Bui","doi":"10.1109/KSE56063.2022.9953768","DOIUrl":"https://doi.org/10.1109/KSE56063.2022.9953768","url":null,"abstract":"This paper introduces a solution to assist visually impaired or blind (VIB) people in independently accessing printed and electronic documents. The highlight of the solution is the cost-effectiveness and accuracy. Extracting texts and reading out to users are performed by a pure smartphone application. To be usable by VIB people, advanced technologies in image and speech processing are leveraged to enhance the user experience and accuracy in converting images to texts. To build accurate optical character recognition (OCR) models with low-quality images, we combine different solutions includings 1) generating a large and balanced dataset with various backgrounds, 2) correcting the distortion and direction, and 3) applying the sequence to sequence model with transformers as the encoder. For ease of use, the text to speech (TTS) model generates voice instructions at every interaction, and the interface is designed and adjusted according to user feedback. A test on a scanned document set has showed the high accuracy of the OCR model with 98,6% by characters, and the fluency of the TTS model. As being indicated in a trial with VIB people, our application can help them read printed documents conveniently, and it is an affordable solution since the popularity of smartphones.","PeriodicalId":330865,"journal":{"name":"2022 14th International Conference on Knowledge and Systems Engineering (KSE)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121809593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-19DOI: 10.1109/KSE56063.2022.9953788
A. Tran, T. Luong, Xuan Sang Pham, Thi-Luong Tran
The complexity of today’s web applications entails many security risks, mainly targeted attacks on zero-day vulnerabilities. New attack types often disable the detection capabilities of intrusion detection systems (IDS) and web application firewalls (WAFs) based on traditional pattern matching rules. Therefore, the need for new generation WAF systems using machine learning and deep learning technologies is urgent today. Deep learning models require an enormous amount of input data to be able to train the models accurately, leading to the very resource-intensive problem of collecting and labeling data. In addition, web request data is often sensitive or private and should not be disclosed, imposing a challenge to develop high-accuracy deep learning and machine learning models. This paper proposes a privacy-preserving distributed training process for the web attack detection deep learning model. The proposed model allows the participants to share the training process to improve the accuracy of the deep model for web attack detection while preserving the privacy of the local data and local model parameters. The proposed model uses the technique of adding noise to the shared parameter to ensure differential privacy. The participants will train the local detection model and share intermediate training parameters with some noise that increases the privacy of the training process. The results evaluated on the CSIC 2010 benchmark dataset show that the detection accuracy is more than 98%, which is close to the model that does not guarantee privacy and is much higher than the maximum accuracy of all non-data-sharing local models.
{"title":"Deep Models with Differential Privacy for Distributed Web Attack Detection","authors":"A. Tran, T. Luong, Xuan Sang Pham, Thi-Luong Tran","doi":"10.1109/KSE56063.2022.9953788","DOIUrl":"https://doi.org/10.1109/KSE56063.2022.9953788","url":null,"abstract":"The complexity of today’s web applications entails many security risks, mainly targeted attacks on zero-day vulnerabilities. New attack types often disable the detection capabilities of intrusion detection systems (IDS) and web application firewalls (WAFs) based on traditional pattern matching rules. Therefore, the need for new generation WAF systems using machine learning and deep learning technologies is urgent today. Deep learning models require an enormous amount of input data to be able to train the models accurately, leading to the very resource-intensive problem of collecting and labeling data. In addition, web request data is often sensitive or private and should not be disclosed, imposing a challenge to develop high-accuracy deep learning and machine learning models. This paper proposes a privacy-preserving distributed training process for the web attack detection deep learning model. The proposed model allows the participants to share the training process to improve the accuracy of the deep model for web attack detection while preserving the privacy of the local data and local model parameters. The proposed model uses the technique of adding noise to the shared parameter to ensure differential privacy. The participants will train the local detection model and share intermediate training parameters with some noise that increases the privacy of the training process. The results evaluated on the CSIC 2010 benchmark dataset show that the detection accuracy is more than 98%, which is close to the model that does not guarantee privacy and is much higher than the maximum accuracy of all non-data-sharing local models.","PeriodicalId":330865,"journal":{"name":"2022 14th International Conference on Knowledge and Systems Engineering (KSE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128655184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-19DOI: 10.1109/KSE56063.2022.9953762
Le Duc Thuan, Pham Van Huong, H. Hiep, Nguyen Kim Khanh
This study proposes a new approach for feature selection in the Android malware detection problem based on the popularity and contrast in a multi-target approach. The popularity of a feature is built on the frequency of each feature in the sample set. The contrast of features consists of two types: a contrast between malware and benign, and a contrast among malware classes. Obviously, the greater the contrast between classes of a feature, the higher the ability to classify based on this feature. There is a trade-off between the popularity and contrast of features, i.e., as popularity increases, contrast may decrease and vice versa. Therefore, to evaluate the global value of each feature, we use the global evaluation function (global measurement) according to the Pareto multi-objective approach. To evaluate the feature selection method, the selected feature is fed into a convolutional neural network (CNN) model, and test the model on a popular Android malware dataset, the AMD dataset. When we removed 1,000 features (500 permission features and 500 API features) accuracy decreased by 0.42%, and recall increased by 0.08%.
{"title":"Feature selection based on popularity and value contrast for Android malware classification","authors":"Le Duc Thuan, Pham Van Huong, H. Hiep, Nguyen Kim Khanh","doi":"10.1109/KSE56063.2022.9953762","DOIUrl":"https://doi.org/10.1109/KSE56063.2022.9953762","url":null,"abstract":"This study proposes a new approach for feature selection in the Android malware detection problem based on the popularity and contrast in a multi-target approach. The popularity of a feature is built on the frequency of each feature in the sample set. The contrast of features consists of two types: a contrast between malware and benign, and a contrast among malware classes. Obviously, the greater the contrast between classes of a feature, the higher the ability to classify based on this feature. There is a trade-off between the popularity and contrast of features, i.e., as popularity increases, contrast may decrease and vice versa. Therefore, to evaluate the global value of each feature, we use the global evaluation function (global measurement) according to the Pareto multi-objective approach. To evaluate the feature selection method, the selected feature is fed into a convolutional neural network (CNN) model, and test the model on a popular Android malware dataset, the AMD dataset. When we removed 1,000 features (500 permission features and 500 API features) accuracy decreased by 0.42%, and recall increased by 0.08%.","PeriodicalId":330865,"journal":{"name":"2022 14th International Conference on Knowledge and Systems Engineering (KSE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131281348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-19DOI: 10.1109/KSE56063.2022.9953749
Duong Nguyen, Nam Cao, Son Nguyen, S. ta, C. Dinh
There has been an increasing demand for good semantic representations of text in the financial sector when solving natural language processing tasks in Fintech. Previous work has shown that widely used modern language models trained in the general domain often perform poorly in this particular domain. There have been attempts to overcome this limitation by introducing domain-specific language models learned from financial text. However, these approaches suffer from the lack of in-domain data, which is further exacerbated for languages other than English. These problems motivate us to develop a simple and efficient pipeline to extract large amounts of financial text from large-scale multilingual corpora such as OSCAR and C4. We conduct extensive experiments with various downstream tasks in three different languages to demonstrate the effectiveness of our approach across a wide range of standard benchmarks.
{"title":"MFinBERT: Multilingual Pretrained Language Model For Financial Domain","authors":"Duong Nguyen, Nam Cao, Son Nguyen, S. ta, C. Dinh","doi":"10.1109/KSE56063.2022.9953749","DOIUrl":"https://doi.org/10.1109/KSE56063.2022.9953749","url":null,"abstract":"There has been an increasing demand for good semantic representations of text in the financial sector when solving natural language processing tasks in Fintech. Previous work has shown that widely used modern language models trained in the general domain often perform poorly in this particular domain. There have been attempts to overcome this limitation by introducing domain-specific language models learned from financial text. However, these approaches suffer from the lack of in-domain data, which is further exacerbated for languages other than English. These problems motivate us to develop a simple and efficient pipeline to extract large amounts of financial text from large-scale multilingual corpora such as OSCAR and C4. We conduct extensive experiments with various downstream tasks in three different languages to demonstrate the effectiveness of our approach across a wide range of standard benchmarks.","PeriodicalId":330865,"journal":{"name":"2022 14th International Conference on Knowledge and Systems Engineering (KSE)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132600852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-19DOI: 10.1109/KSE56063.2022.9953792
Xuan-Hanh Vu, Xuan Dau Hoang, Thi Hong Hai Chu
Recently, DGA has been becoming a popular technique used by many malwares in general and botnets in particular. DGA allows hacking groups to automatically generate and register domain names for C&C servers of their botnets in order to avoid being blacklisted and disabled if using static domain names and IP addresses. Many types of sophisticated DGA techniques have been developed and used in practice, including character-based DGA, word-based DGA and mixed DGA. These techniques allow to generate from simple domain names of random combinations of characters, to complex domain names of combinations of meaningful words, which are very similar to legitimate domain names. This makes it difficult for solutions to monitor and detect botnets in general and DGA botnets in particular. Some solutions are able to efficiently detect character-based DGA domain names, but cannot detect word-based DGA and mixed DGA domain names. In contrast, some recent proposals can effectively detect word-based DGA domain names, but cannot effectively detect domain names of some character-based DGA botnets. This paper proposes a model based on ensemble learning that enables efficient detection of most DGA domain names, including character-based DGA and word-based DGA. The proposed model combines two component models, including the character-based DGA botnet detection model and the word-based DGA botnet detection model. The experimental results show that the proposed combined model is able to effectively detect 37/39 DGA botnet families with the average detection rate of over 89%.
{"title":"A Novel Model Based on Ensemble Learning for Detecting DGA Botnets","authors":"Xuan-Hanh Vu, Xuan Dau Hoang, Thi Hong Hai Chu","doi":"10.1109/KSE56063.2022.9953792","DOIUrl":"https://doi.org/10.1109/KSE56063.2022.9953792","url":null,"abstract":"Recently, DGA has been becoming a popular technique used by many malwares in general and botnets in particular. DGA allows hacking groups to automatically generate and register domain names for C&C servers of their botnets in order to avoid being blacklisted and disabled if using static domain names and IP addresses. Many types of sophisticated DGA techniques have been developed and used in practice, including character-based DGA, word-based DGA and mixed DGA. These techniques allow to generate from simple domain names of random combinations of characters, to complex domain names of combinations of meaningful words, which are very similar to legitimate domain names. This makes it difficult for solutions to monitor and detect botnets in general and DGA botnets in particular. Some solutions are able to efficiently detect character-based DGA domain names, but cannot detect word-based DGA and mixed DGA domain names. In contrast, some recent proposals can effectively detect word-based DGA domain names, but cannot effectively detect domain names of some character-based DGA botnets. This paper proposes a model based on ensemble learning that enables efficient detection of most DGA domain names, including character-based DGA and word-based DGA. The proposed model combines two component models, including the character-based DGA botnet detection model and the word-based DGA botnet detection model. The experimental results show that the proposed combined model is able to effectively detect 37/39 DGA botnet families with the average detection rate of over 89%.","PeriodicalId":330865,"journal":{"name":"2022 14th International Conference on Knowledge and Systems Engineering (KSE)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123822416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-19DOI: 10.1109/KSE56063.2022.9953620
Trang T. H. Tran, Mai H. Tran, D. T. Nguyen, Tien M. Pham, G. Vu, N. S. Vo, Nam N. Nguyen, Quang T. Vu
Genome-wide association studies (GWAS) with millions of genetic markers have proven to be useful for precision medicine applications as means of advanced calculation to provide a Polygenic risk score (PRS). However, the potential for interpretation and application of existing PRS models has limited transferability across ancestry groups due to the historical bias of GWAS toward European ancestry. Here we propose an adapted workflow to fine-tune the baseline PRS model to the dataset of target ancestry. We use the dataset of Vietnamese whole genomes from the 1KVG project and build a PRS model of height prediction for the Vietnamese population. Our best-fit model achieved an increase in R2 of 0.152 (according to 29.8%) compared to the null model, which only consists of the metadata.
{"title":"Polygenic risk scores adaptation for Height in a Vietnamese population","authors":"Trang T. H. Tran, Mai H. Tran, D. T. Nguyen, Tien M. Pham, G. Vu, N. S. Vo, Nam N. Nguyen, Quang T. Vu","doi":"10.1109/KSE56063.2022.9953620","DOIUrl":"https://doi.org/10.1109/KSE56063.2022.9953620","url":null,"abstract":"Genome-wide association studies (GWAS) with millions of genetic markers have proven to be useful for precision medicine applications as means of advanced calculation to provide a Polygenic risk score (PRS). However, the potential for interpretation and application of existing PRS models has limited transferability across ancestry groups due to the historical bias of GWAS toward European ancestry. Here we propose an adapted workflow to fine-tune the baseline PRS model to the dataset of target ancestry. We use the dataset of Vietnamese whole genomes from the 1KVG project and build a PRS model of height prediction for the Vietnamese population. Our best-fit model achieved an increase in R2 of 0.152 (according to 29.8%) compared to the null model, which only consists of the metadata.","PeriodicalId":330865,"journal":{"name":"2022 14th International Conference on Knowledge and Systems Engineering (KSE)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114586466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-19DOI: 10.1109/KSE56063.2022.9953622
K. Truong, T. Le
The detection and treatment of cancer and other disorders depend on the use of magnetic resonance imaging (MRI) and computed tomography (CT) scans. Compared to CT scan, MRI scans provide sharper pictures. An MRI is preferable to an X-ray or CT scan when the doctor needs to observe the soft tissues. Besides, MRI scans of organs and soft tissues, such as damaged ligaments and herniated discs, can be more accurate than CT imaging. However, capturing MRI typically takes longer than CT. Furthermore, MRI is substantially more expensive than CT because it requires more sophisticated current equipment. As a result, it is challenging to gather MRI scans to help with the medical image segmentation training issue. To address the aforementioned issue, we suggest using a deep learning network (TarGAN) to reconstruct MRI from CT scans. These created MRI images can then be used to enrich training data for MRI images segmentation issues.
{"title":"TarGAN: CT to MRI Translation Using Private Unpaired Data Domain","authors":"K. Truong, T. Le","doi":"10.1109/KSE56063.2022.9953622","DOIUrl":"https://doi.org/10.1109/KSE56063.2022.9953622","url":null,"abstract":"The detection and treatment of cancer and other disorders depend on the use of magnetic resonance imaging (MRI) and computed tomography (CT) scans. Compared to CT scan, MRI scans provide sharper pictures. An MRI is preferable to an X-ray or CT scan when the doctor needs to observe the soft tissues. Besides, MRI scans of organs and soft tissues, such as damaged ligaments and herniated discs, can be more accurate than CT imaging. However, capturing MRI typically takes longer than CT. Furthermore, MRI is substantially more expensive than CT because it requires more sophisticated current equipment. As a result, it is challenging to gather MRI scans to help with the medical image segmentation training issue. To address the aforementioned issue, we suggest using a deep learning network (TarGAN) to reconstruct MRI from CT scans. These created MRI images can then be used to enrich training data for MRI images segmentation issues.","PeriodicalId":330865,"journal":{"name":"2022 14th International Conference on Knowledge and Systems Engineering (KSE)","volume":"1225 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129066418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}