首页 > 最新文献

2022 14th International Conference on Knowledge and Systems Engineering (KSE)最新文献

英文 中文
A protein secondary structure-based algorithm for partitioning large protein alignments 基于蛋白质二级结构的大蛋白质序列划分算法
Pub Date : 2022-10-19 DOI: 10.1109/KSE56063.2022.9953767
Thu Kim Le, L. Vinh
The evolutionary process of characters (e.g., nucleotides or amino acids) is heterogeneous among sites of alignments. Applying the same evolutionary model for all sites leads to unreliable results in evolutionary studies. Partitioning alignments into sub-alignments (groups) such that sites in each sub-alignment follow the same model of evolution is a proper and promising approach to adequately handle the heterogeneity among sites. A number of computational methods have been proposed to partition alignments, however, they are unable to properly handle invariant sites. The iterative k-means algorithm is widely used to partition large alignments, unfortunately, recently suspended because it always groups all invariant sites into one group that might distort phylogenetic trees reconstructed from sub-alignments.In this paper, we improve the iterative k-means algorithm for protein alignments by combining both amino acids and their secondary structures to properly partition invariant sites. The protein secondary structure information helps classify invariant sites into different groups each includes both variant and invariant sites. Experiments on real large protein alignments showed that the new algorithm overcomes the pitfall of grouping all invariant sites into one group and consequently produces better partitioning schemes.
性状(例如,核苷酸或氨基酸)的进化过程在不同的位点之间是异质的。对所有地点采用相同的进化模型会导致进化研究的结果不可靠。将序列划分为子序列(组),使得每个子序列中的位点遵循相同的进化模型,这是一种适当且有前途的方法,可以充分处理位点之间的异质性。已经提出了许多计算方法来划分排列,然而,它们不能正确地处理不变位点。迭代k-means算法被广泛用于划分大型比对,不幸的是,最近被暂停,因为它总是将所有不变位点归为一组,这可能会扭曲由亚比对重建的系统发育树。在本文中,我们改进了迭代k-means算法,通过将氨基酸和它们的二级结构结合到适当的分割不变位点。蛋白质二级结构信息有助于将不变位点分为不同的组,每个组包括变异位点和不变位点。实验表明,新算法克服了将所有不变位点归为一组的缺陷,从而产生了更好的划分方案。
{"title":"A protein secondary structure-based algorithm for partitioning large protein alignments","authors":"Thu Kim Le, L. Vinh","doi":"10.1109/KSE56063.2022.9953767","DOIUrl":"https://doi.org/10.1109/KSE56063.2022.9953767","url":null,"abstract":"The evolutionary process of characters (e.g., nucleotides or amino acids) is heterogeneous among sites of alignments. Applying the same evolutionary model for all sites leads to unreliable results in evolutionary studies. Partitioning alignments into sub-alignments (groups) such that sites in each sub-alignment follow the same model of evolution is a proper and promising approach to adequately handle the heterogeneity among sites. A number of computational methods have been proposed to partition alignments, however, they are unable to properly handle invariant sites. The iterative k-means algorithm is widely used to partition large alignments, unfortunately, recently suspended because it always groups all invariant sites into one group that might distort phylogenetic trees reconstructed from sub-alignments.In this paper, we improve the iterative k-means algorithm for protein alignments by combining both amino acids and their secondary structures to properly partition invariant sites. The protein secondary structure information helps classify invariant sites into different groups each includes both variant and invariant sites. Experiments on real large protein alignments showed that the new algorithm overcomes the pitfall of grouping all invariant sites into one group and consequently produces better partitioning schemes.","PeriodicalId":330865,"journal":{"name":"2022 14th International Conference on Knowledge and Systems Engineering (KSE)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121146661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Hybrid Machine Learning Approach in Predicting E-Commerce Supply Chain Risks 预测电子商务供应链风险的混合机器学习方法
Pub Date : 2022-10-19 DOI: 10.1109/KSE56063.2022.9953787
N. M. Tuan, Huynh Thi Khanh Chi, N. Hop
The objective of this paper is to propose a hybrid machine learning approach using a so-called Cost-Complexity Pruning Decision Trees algorithm in predicting supply chain risks, particularly, delayed deliveries. The Recursive Feature Elimination with Cross-Validation solution is designed to improve the feature selection function of the Decision Trees classifier. Then, the Two-Phase Cost-Complexity Pruning technique is developed to reduce the overfitting of the tree-based algorithms. A case study of an e-commerce enabler in Vietnam is investigated to illustrate the efficiency of the proposed models. The obtained results show promise in terms of predictive performance.
本文的目的是提出一种混合机器学习方法,使用所谓的成本复杂性修剪决策树算法来预测供应链风险,特别是延迟交付。为了改进决策树分类器的特征选择功能,设计了具有交叉验证的递归特征消除方法。然后,提出了两阶段成本-复杂度修剪技术,以减少基于树的算法的过拟合。本文以越南的一个电子商务推动者为例,说明了所提出模型的有效性。所得结果在预测性能方面显示出良好的前景。
{"title":"A Hybrid Machine Learning Approach in Predicting E-Commerce Supply Chain Risks","authors":"N. M. Tuan, Huynh Thi Khanh Chi, N. Hop","doi":"10.1109/KSE56063.2022.9953787","DOIUrl":"https://doi.org/10.1109/KSE56063.2022.9953787","url":null,"abstract":"The objective of this paper is to propose a hybrid machine learning approach using a so-called Cost-Complexity Pruning Decision Trees algorithm in predicting supply chain risks, particularly, delayed deliveries. The Recursive Feature Elimination with Cross-Validation solution is designed to improve the feature selection function of the Decision Trees classifier. Then, the Two-Phase Cost-Complexity Pruning technique is developed to reduce the overfitting of the tree-based algorithms. A case study of an e-commerce enabler in Vietnam is investigated to illustrate the efficiency of the proposed models. The obtained results show promise in terms of predictive performance.","PeriodicalId":330865,"journal":{"name":"2022 14th International Conference on Knowledge and Systems Engineering (KSE)","volume":"1690 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129390638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A multi-population coevolutionary approach for Software defect prediction with imbalanced data. 不平衡数据下软件缺陷预测的多种群协同进化方法。
Pub Date : 2022-10-19 DOI: 10.1109/KSE56063.2022.9953798
L. Bui, V. Vu, Bich Van Pham, V. Phan
This paper proposes a cooperative coevolutionary approach namely COESDP to the software defect prediction (SDP) problem. The proposed method consists of three main phases. The first one conducts data preprocessing including data sampling and cleaning. The second phase utilizes a multi-population coevolutionary approach (MPCA) to find out optimal instance selection solutions. These first two phases help to deal with the imbalanced data challenge of the SDP problem. While the data sampling method aids in the creation of a more balanced data set, MPCA supports in the elimination of unnecessary data samples (or instances) and the selection of crucial instances. The output of phase 2 is a set of different optimal solutions. Each solution is a way of selecting instances from which to create a classifier (or weak learners). Phase 3 utilizes an ensemble learning method to combine these weak learners and produce the final result. The proposed algorithm is compared with conventional machine learning algorithms, ensemble learning algorithms, computational intelligence algorithms and an other multi-population algorithm on 6 standard SDP datasets. Experimental results show that the proposed method gives better and more stable results in comparison with other methods and it can tackle the challenge of imbalance in the SDP data.
针对软件缺陷预测(SDP)问题,提出了一种协同进化方法COESDP。所提出的方法包括三个主要阶段。首先进行数据预处理,包括数据采样和清洗。第二阶段利用多种群协同进化方法(MPCA)寻找最优实例选择解。这前两个阶段有助于处理SDP问题的数据不平衡挑战。虽然数据采样方法有助于创建更平衡的数据集,但MPCA支持消除不必要的数据样本(或实例)并选择关键实例。阶段2的输出是一组不同的最优解。每种解决方案都是一种选择实例的方法,从中创建分类器(或弱学习器)。阶段3使用集成学习方法将这些弱学习器组合在一起并产生最终结果。在6个标准SDP数据集上,将该算法与传统的机器学习算法、集成学习算法、计算智能算法和另一种多种群算法进行了比较。实验结果表明,与其他方法相比,该方法能较好地解决SDP数据不平衡的问题。
{"title":"A multi-population coevolutionary approach for Software defect prediction with imbalanced data.","authors":"L. Bui, V. Vu, Bich Van Pham, V. Phan","doi":"10.1109/KSE56063.2022.9953798","DOIUrl":"https://doi.org/10.1109/KSE56063.2022.9953798","url":null,"abstract":"This paper proposes a cooperative coevolutionary approach namely COESDP to the software defect prediction (SDP) problem. The proposed method consists of three main phases. The first one conducts data preprocessing including data sampling and cleaning. The second phase utilizes a multi-population coevolutionary approach (MPCA) to find out optimal instance selection solutions. These first two phases help to deal with the imbalanced data challenge of the SDP problem. While the data sampling method aids in the creation of a more balanced data set, MPCA supports in the elimination of unnecessary data samples (or instances) and the selection of crucial instances. The output of phase 2 is a set of different optimal solutions. Each solution is a way of selecting instances from which to create a classifier (or weak learners). Phase 3 utilizes an ensemble learning method to combine these weak learners and produce the final result. The proposed algorithm is compared with conventional machine learning algorithms, ensemble learning algorithms, computational intelligence algorithms and an other multi-population algorithm on 6 standard SDP datasets. Experimental results show that the proposed method gives better and more stable results in comparison with other methods and it can tackle the challenge of imbalance in the SDP data.","PeriodicalId":330865,"journal":{"name":"2022 14th International Conference on Knowledge and Systems Engineering (KSE)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126907598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards An Accurate and Effective Printed Document Reader for Visually Impaired People 为视障人士提供准确有效的打印文件阅读器
Pub Date : 2022-10-19 DOI: 10.1109/KSE56063.2022.9953768
Anh Phan Viet, Dung Le Duy, Van Anh Tran Thi, Hung Pham Duy, Truong Vu Van, L. Bui
This paper introduces a solution to assist visually impaired or blind (VIB) people in independently accessing printed and electronic documents. The highlight of the solution is the cost-effectiveness and accuracy. Extracting texts and reading out to users are performed by a pure smartphone application. To be usable by VIB people, advanced technologies in image and speech processing are leveraged to enhance the user experience and accuracy in converting images to texts. To build accurate optical character recognition (OCR) models with low-quality images, we combine different solutions includings 1) generating a large and balanced dataset with various backgrounds, 2) correcting the distortion and direction, and 3) applying the sequence to sequence model with transformers as the encoder. For ease of use, the text to speech (TTS) model generates voice instructions at every interaction, and the interface is designed and adjusted according to user feedback. A test on a scanned document set has showed the high accuracy of the OCR model with 98,6% by characters, and the fluency of the TTS model. As being indicated in a trial with VIB people, our application can help them read printed documents conveniently, and it is an affordable solution since the popularity of smartphones.
本文介绍了一种帮助视障或盲人(VIB)独立访问印刷和电子文档的解决方案。该解决方案的亮点是成本效益和准确性。提取文本并读出给用户是由一个纯智能手机应用程序执行的。为了让VIB人员能够使用,我们利用了图像和语音处理方面的先进技术来提高用户体验和将图像转换为文本的准确性。为了在低质量图像上建立精确的光学字符识别(OCR)模型,我们结合了不同的解决方案,包括1)生成具有不同背景的大型平衡数据集,2)校正失真和方向,以及3)将序列应用于序列模型,并将变压器作为编码器。为了方便使用,文本到语音(TTS)模型在每次交互时都会生成语音指令,并根据用户反馈设计和调整界面。在一个扫描文档集上的测试表明,OCR模型的字符识别率高达98.6%,TTS模型的流畅性较好。在与VIB人的试用中,我们的应用程序可以帮助他们方便地阅读打印文档,这是智能手机普及以来的一种经济实惠的解决方案。
{"title":"Towards An Accurate and Effective Printed Document Reader for Visually Impaired People","authors":"Anh Phan Viet, Dung Le Duy, Van Anh Tran Thi, Hung Pham Duy, Truong Vu Van, L. Bui","doi":"10.1109/KSE56063.2022.9953768","DOIUrl":"https://doi.org/10.1109/KSE56063.2022.9953768","url":null,"abstract":"This paper introduces a solution to assist visually impaired or blind (VIB) people in independently accessing printed and electronic documents. The highlight of the solution is the cost-effectiveness and accuracy. Extracting texts and reading out to users are performed by a pure smartphone application. To be usable by VIB people, advanced technologies in image and speech processing are leveraged to enhance the user experience and accuracy in converting images to texts. To build accurate optical character recognition (OCR) models with low-quality images, we combine different solutions includings 1) generating a large and balanced dataset with various backgrounds, 2) correcting the distortion and direction, and 3) applying the sequence to sequence model with transformers as the encoder. For ease of use, the text to speech (TTS) model generates voice instructions at every interaction, and the interface is designed and adjusted according to user feedback. A test on a scanned document set has showed the high accuracy of the OCR model with 98,6% by characters, and the fluency of the TTS model. As being indicated in a trial with VIB people, our application can help them read printed documents conveniently, and it is an affordable solution since the popularity of smartphones.","PeriodicalId":330865,"journal":{"name":"2022 14th International Conference on Knowledge and Systems Engineering (KSE)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121809593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep Models with Differential Privacy for Distributed Web Attack Detection 基于差分隐私的分布式Web攻击检测深度模型
Pub Date : 2022-10-19 DOI: 10.1109/KSE56063.2022.9953788
A. Tran, T. Luong, Xuan Sang Pham, Thi-Luong Tran
The complexity of today’s web applications entails many security risks, mainly targeted attacks on zero-day vulnerabilities. New attack types often disable the detection capabilities of intrusion detection systems (IDS) and web application firewalls (WAFs) based on traditional pattern matching rules. Therefore, the need for new generation WAF systems using machine learning and deep learning technologies is urgent today. Deep learning models require an enormous amount of input data to be able to train the models accurately, leading to the very resource-intensive problem of collecting and labeling data. In addition, web request data is often sensitive or private and should not be disclosed, imposing a challenge to develop high-accuracy deep learning and machine learning models. This paper proposes a privacy-preserving distributed training process for the web attack detection deep learning model. The proposed model allows the participants to share the training process to improve the accuracy of the deep model for web attack detection while preserving the privacy of the local data and local model parameters. The proposed model uses the technique of adding noise to the shared parameter to ensure differential privacy. The participants will train the local detection model and share intermediate training parameters with some noise that increases the privacy of the training process. The results evaluated on the CSIC 2010 benchmark dataset show that the detection accuracy is more than 98%, which is close to the model that does not guarantee privacy and is much higher than the maximum accuracy of all non-data-sharing local models.
当今web应用程序的复杂性带来了许多安全风险,主要是针对零日漏洞的攻击。新的攻击类型往往使基于传统模式匹配规则的入侵检测系统(IDS)和web应用防火墙(waf)的检测能力失效。因此,今天迫切需要使用机器学习和深度学习技术的新一代WAF系统。深度学习模型需要大量的输入数据才能准确地训练模型,这导致了收集和标记数据的资源密集型问题。此外,web请求数据通常是敏感或私有的,不应该被披露,这给开发高精度深度学习和机器学习模型带来了挑战。针对web攻击检测深度学习模型,提出了一种保护隐私的分布式训练过程。该模型允许参与者共享训练过程,以提高深度模型用于web攻击检测的准确性,同时保留本地数据和本地模型参数的隐私性。该模型采用在共享参数中加入噪声的技术来保证差分隐私。参与者将训练局部检测模型并共享中间训练参数,其中带有一些噪声,从而增加了训练过程的隐私性。在CSIC 2010基准数据集上的评估结果表明,该方法的检测准确率达到98%以上,接近于不保证隐私的模型,远高于所有非数据共享局部模型的最高准确率。
{"title":"Deep Models with Differential Privacy for Distributed Web Attack Detection","authors":"A. Tran, T. Luong, Xuan Sang Pham, Thi-Luong Tran","doi":"10.1109/KSE56063.2022.9953788","DOIUrl":"https://doi.org/10.1109/KSE56063.2022.9953788","url":null,"abstract":"The complexity of today’s web applications entails many security risks, mainly targeted attacks on zero-day vulnerabilities. New attack types often disable the detection capabilities of intrusion detection systems (IDS) and web application firewalls (WAFs) based on traditional pattern matching rules. Therefore, the need for new generation WAF systems using machine learning and deep learning technologies is urgent today. Deep learning models require an enormous amount of input data to be able to train the models accurately, leading to the very resource-intensive problem of collecting and labeling data. In addition, web request data is often sensitive or private and should not be disclosed, imposing a challenge to develop high-accuracy deep learning and machine learning models. This paper proposes a privacy-preserving distributed training process for the web attack detection deep learning model. The proposed model allows the participants to share the training process to improve the accuracy of the deep model for web attack detection while preserving the privacy of the local data and local model parameters. The proposed model uses the technique of adding noise to the shared parameter to ensure differential privacy. The participants will train the local detection model and share intermediate training parameters with some noise that increases the privacy of the training process. The results evaluated on the CSIC 2010 benchmark dataset show that the detection accuracy is more than 98%, which is close to the model that does not guarantee privacy and is much higher than the maximum accuracy of all non-data-sharing local models.","PeriodicalId":330865,"journal":{"name":"2022 14th International Conference on Knowledge and Systems Engineering (KSE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128655184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Feature selection based on popularity and value contrast for Android malware classification 基于流行度和值对比的特征选择,用于Android恶意软件分类
Pub Date : 2022-10-19 DOI: 10.1109/KSE56063.2022.9953762
Le Duc Thuan, Pham Van Huong, H. Hiep, Nguyen Kim Khanh
This study proposes a new approach for feature selection in the Android malware detection problem based on the popularity and contrast in a multi-target approach. The popularity of a feature is built on the frequency of each feature in the sample set. The contrast of features consists of two types: a contrast between malware and benign, and a contrast among malware classes. Obviously, the greater the contrast between classes of a feature, the higher the ability to classify based on this feature. There is a trade-off between the popularity and contrast of features, i.e., as popularity increases, contrast may decrease and vice versa. Therefore, to evaluate the global value of each feature, we use the global evaluation function (global measurement) according to the Pareto multi-objective approach. To evaluate the feature selection method, the selected feature is fed into a convolutional neural network (CNN) model, and test the model on a popular Android malware dataset, the AMD dataset. When we removed 1,000 features (500 permission features and 500 API features) accuracy decreased by 0.42%, and recall increased by 0.08%.
本研究提出了一种基于多目标流行度和对比度的Android恶意软件检测问题特征选择的新方法。特征的受欢迎程度建立在样本集中每个特征的频率上。特征的对比包括两种类型:恶意软件与良性软件的对比和恶意软件类之间的对比。显然,一个特征的类别之间的对比越大,基于该特征进行分类的能力就越高。在特征的受欢迎程度和对比度之间存在一种权衡,即随着受欢迎程度的增加,对比度可能会降低,反之亦然。因此,为了评估每个特征的全局值,我们根据Pareto多目标方法使用全局评价函数(全局度量)。为了评估特征选择方法,将选择的特征输入到卷积神经网络(CNN)模型中,并在流行的Android恶意软件数据集AMD数据集上测试该模型。当我们删除1000个特征(500个权限特征和500个API特征)时,准确率下降了0.42%,召回率增加了0.08%。
{"title":"Feature selection based on popularity and value contrast for Android malware classification","authors":"Le Duc Thuan, Pham Van Huong, H. Hiep, Nguyen Kim Khanh","doi":"10.1109/KSE56063.2022.9953762","DOIUrl":"https://doi.org/10.1109/KSE56063.2022.9953762","url":null,"abstract":"This study proposes a new approach for feature selection in the Android malware detection problem based on the popularity and contrast in a multi-target approach. The popularity of a feature is built on the frequency of each feature in the sample set. The contrast of features consists of two types: a contrast between malware and benign, and a contrast among malware classes. Obviously, the greater the contrast between classes of a feature, the higher the ability to classify based on this feature. There is a trade-off between the popularity and contrast of features, i.e., as popularity increases, contrast may decrease and vice versa. Therefore, to evaluate the global value of each feature, we use the global evaluation function (global measurement) according to the Pareto multi-objective approach. To evaluate the feature selection method, the selected feature is fed into a convolutional neural network (CNN) model, and test the model on a popular Android malware dataset, the AMD dataset. When we removed 1,000 features (500 permission features and 500 API features) accuracy decreased by 0.42%, and recall increased by 0.08%.","PeriodicalId":330865,"journal":{"name":"2022 14th International Conference on Knowledge and Systems Engineering (KSE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131281348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MFinBERT: Multilingual Pretrained Language Model For Financial Domain 金融领域的多语种预训练语言模型
Pub Date : 2022-10-19 DOI: 10.1109/KSE56063.2022.9953749
Duong Nguyen, Nam Cao, Son Nguyen, S. ta, C. Dinh
There has been an increasing demand for good semantic representations of text in the financial sector when solving natural language processing tasks in Fintech. Previous work has shown that widely used modern language models trained in the general domain often perform poorly in this particular domain. There have been attempts to overcome this limitation by introducing domain-specific language models learned from financial text. However, these approaches suffer from the lack of in-domain data, which is further exacerbated for languages other than English. These problems motivate us to develop a simple and efficient pipeline to extract large amounts of financial text from large-scale multilingual corpora such as OSCAR and C4. We conduct extensive experiments with various downstream tasks in three different languages to demonstrate the effectiveness of our approach across a wide range of standard benchmarks.
在解决金融科技领域的自然语言处理任务时,金融领域对文本的良好语义表示的需求越来越大。以前的工作表明,在一般领域训练的广泛使用的现代语言模型在这个特定领域的表现往往很差。已经有人尝试通过引入从金融文本中学习的领域特定语言模型来克服这一限制。然而,这些方法受到缺乏领域内数据的影响,对于英语以外的语言,这种情况进一步加剧。这些问题促使我们开发一种简单高效的管道,从OSCAR和C4等大型多语言语料库中提取大量金融文本。我们用三种不同的语言对各种下游任务进行了广泛的实验,以证明我们的方法在广泛的标准基准测试中的有效性。
{"title":"MFinBERT: Multilingual Pretrained Language Model For Financial Domain","authors":"Duong Nguyen, Nam Cao, Son Nguyen, S. ta, C. Dinh","doi":"10.1109/KSE56063.2022.9953749","DOIUrl":"https://doi.org/10.1109/KSE56063.2022.9953749","url":null,"abstract":"There has been an increasing demand for good semantic representations of text in the financial sector when solving natural language processing tasks in Fintech. Previous work has shown that widely used modern language models trained in the general domain often perform poorly in this particular domain. There have been attempts to overcome this limitation by introducing domain-specific language models learned from financial text. However, these approaches suffer from the lack of in-domain data, which is further exacerbated for languages other than English. These problems motivate us to develop a simple and efficient pipeline to extract large amounts of financial text from large-scale multilingual corpora such as OSCAR and C4. We conduct extensive experiments with various downstream tasks in three different languages to demonstrate the effectiveness of our approach across a wide range of standard benchmarks.","PeriodicalId":330865,"journal":{"name":"2022 14th International Conference on Knowledge and Systems Engineering (KSE)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132600852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Novel Model Based on Ensemble Learning for Detecting DGA Botnets 基于集成学习的DGA僵尸网络检测新模型
Pub Date : 2022-10-19 DOI: 10.1109/KSE56063.2022.9953792
Xuan-Hanh Vu, Xuan Dau Hoang, Thi Hong Hai Chu
Recently, DGA has been becoming a popular technique used by many malwares in general and botnets in particular. DGA allows hacking groups to automatically generate and register domain names for C&C servers of their botnets in order to avoid being blacklisted and disabled if using static domain names and IP addresses. Many types of sophisticated DGA techniques have been developed and used in practice, including character-based DGA, word-based DGA and mixed DGA. These techniques allow to generate from simple domain names of random combinations of characters, to complex domain names of combinations of meaningful words, which are very similar to legitimate domain names. This makes it difficult for solutions to monitor and detect botnets in general and DGA botnets in particular. Some solutions are able to efficiently detect character-based DGA domain names, but cannot detect word-based DGA and mixed DGA domain names. In contrast, some recent proposals can effectively detect word-based DGA domain names, but cannot effectively detect domain names of some character-based DGA botnets. This paper proposes a model based on ensemble learning that enables efficient detection of most DGA domain names, including character-based DGA and word-based DGA. The proposed model combines two component models, including the character-based DGA botnet detection model and the word-based DGA botnet detection model. The experimental results show that the proposed combined model is able to effectively detect 37/39 DGA botnet families with the average detection rate of over 89%.
最近,DGA已经成为许多恶意软件,特别是僵尸网络使用的一种流行技术。DGA允许黑客组织为僵尸网络的C&C服务器自动生成和注册域名,以避免在使用静态域名和IP地址时被列入黑名单和禁用。许多复杂的数据分析技术已经被开发出来并应用于实践中,包括基于字符的数据分析、基于词的数据分析和混合数据分析。这些技术允许从字符随机组合的简单域名生成,到有意义的单词组合的复杂域名,这与合法域名非常相似。这使得解决方案难以监控和检测一般的僵尸网络,特别是DGA僵尸网络。有些解决方案可以有效地检测基于字符的DGA域名,但无法检测基于单词的DGA域名和混合DGA域名。相比之下,最近的一些算法可以有效地检测基于单词的DGA域名,但不能有效地检测某些基于字符的DGA僵尸网络的域名。本文提出了一种基于集成学习的DGA域名高效检测模型,包括基于字符的DGA和基于词的DGA。该模型结合了基于字符的DGA僵尸网络检测模型和基于词的DGA僵尸网络检测模型。实验结果表明,所提出的组合模型能够有效检测37/39个DGA僵尸网络家族,平均检测率超过89%。
{"title":"A Novel Model Based on Ensemble Learning for Detecting DGA Botnets","authors":"Xuan-Hanh Vu, Xuan Dau Hoang, Thi Hong Hai Chu","doi":"10.1109/KSE56063.2022.9953792","DOIUrl":"https://doi.org/10.1109/KSE56063.2022.9953792","url":null,"abstract":"Recently, DGA has been becoming a popular technique used by many malwares in general and botnets in particular. DGA allows hacking groups to automatically generate and register domain names for C&C servers of their botnets in order to avoid being blacklisted and disabled if using static domain names and IP addresses. Many types of sophisticated DGA techniques have been developed and used in practice, including character-based DGA, word-based DGA and mixed DGA. These techniques allow to generate from simple domain names of random combinations of characters, to complex domain names of combinations of meaningful words, which are very similar to legitimate domain names. This makes it difficult for solutions to monitor and detect botnets in general and DGA botnets in particular. Some solutions are able to efficiently detect character-based DGA domain names, but cannot detect word-based DGA and mixed DGA domain names. In contrast, some recent proposals can effectively detect word-based DGA domain names, but cannot effectively detect domain names of some character-based DGA botnets. This paper proposes a model based on ensemble learning that enables efficient detection of most DGA domain names, including character-based DGA and word-based DGA. The proposed model combines two component models, including the character-based DGA botnet detection model and the word-based DGA botnet detection model. The experimental results show that the proposed combined model is able to effectively detect 37/39 DGA botnet families with the average detection rate of over 89%.","PeriodicalId":330865,"journal":{"name":"2022 14th International Conference on Knowledge and Systems Engineering (KSE)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123822416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Polygenic risk scores adaptation for Height in a Vietnamese population 越南人群身高适应的多基因风险评分
Pub Date : 2022-10-19 DOI: 10.1109/KSE56063.2022.9953620
Trang T. H. Tran, Mai H. Tran, D. T. Nguyen, Tien M. Pham, G. Vu, N. S. Vo, Nam N. Nguyen, Quang T. Vu
Genome-wide association studies (GWAS) with millions of genetic markers have proven to be useful for precision medicine applications as means of advanced calculation to provide a Polygenic risk score (PRS). However, the potential for interpretation and application of existing PRS models has limited transferability across ancestry groups due to the historical bias of GWAS toward European ancestry. Here we propose an adapted workflow to fine-tune the baseline PRS model to the dataset of target ancestry. We use the dataset of Vietnamese whole genomes from the 1KVG project and build a PRS model of height prediction for the Vietnamese population. Our best-fit model achieved an increase in R2 of 0.152 (according to 29.8%) compared to the null model, which only consists of the metadata.
具有数百万遗传标记的全基因组关联研究(GWAS)已被证明可用于精确医学应用,作为提供多基因风险评分(PRS)的高级计算手段。然而,由于GWAS对欧洲祖先的历史偏见,现有PRS模型的解释和应用潜力限制了跨祖先群体的可转移性。在这里,我们提出了一个适应的工作流程来微调基线PRS模型到目标祖先的数据集。利用1KVG项目的越南人全基因组数据集,建立了越南人口身高预测的PRS模型。与只包含元数据的零模型相比,我们的最佳拟合模型的R2增加了0.152(根据29.8%)。
{"title":"Polygenic risk scores adaptation for Height in a Vietnamese population","authors":"Trang T. H. Tran, Mai H. Tran, D. T. Nguyen, Tien M. Pham, G. Vu, N. S. Vo, Nam N. Nguyen, Quang T. Vu","doi":"10.1109/KSE56063.2022.9953620","DOIUrl":"https://doi.org/10.1109/KSE56063.2022.9953620","url":null,"abstract":"Genome-wide association studies (GWAS) with millions of genetic markers have proven to be useful for precision medicine applications as means of advanced calculation to provide a Polygenic risk score (PRS). However, the potential for interpretation and application of existing PRS models has limited transferability across ancestry groups due to the historical bias of GWAS toward European ancestry. Here we propose an adapted workflow to fine-tune the baseline PRS model to the dataset of target ancestry. We use the dataset of Vietnamese whole genomes from the 1KVG project and build a PRS model of height prediction for the Vietnamese population. Our best-fit model achieved an increase in R2 of 0.152 (according to 29.8%) compared to the null model, which only consists of the metadata.","PeriodicalId":330865,"journal":{"name":"2022 14th International Conference on Knowledge and Systems Engineering (KSE)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114586466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TarGAN: CT to MRI Translation Using Private Unpaired Data Domain TarGAN:利用私有非配对数据域的CT到MRI转换
Pub Date : 2022-10-19 DOI: 10.1109/KSE56063.2022.9953622
K. Truong, T. Le
The detection and treatment of cancer and other disorders depend on the use of magnetic resonance imaging (MRI) and computed tomography (CT) scans. Compared to CT scan, MRI scans provide sharper pictures. An MRI is preferable to an X-ray or CT scan when the doctor needs to observe the soft tissues. Besides, MRI scans of organs and soft tissues, such as damaged ligaments and herniated discs, can be more accurate than CT imaging. However, capturing MRI typically takes longer than CT. Furthermore, MRI is substantially more expensive than CT because it requires more sophisticated current equipment. As a result, it is challenging to gather MRI scans to help with the medical image segmentation training issue. To address the aforementioned issue, we suggest using a deep learning network (TarGAN) to reconstruct MRI from CT scans. These created MRI images can then be used to enrich training data for MRI images segmentation issues.
癌症和其他疾病的检测和治疗依赖于使用磁共振成像(MRI)和计算机断层扫描(CT)扫描。与CT扫描相比,MRI扫描提供更清晰的图像。当医生需要观察软组织时,核磁共振成像比x光或CT扫描更可取。此外,MRI扫描器官和软组织,如受损的韧带和椎间盘突出,可以比CT成像更准确。然而,捕获MRI通常比CT需要更长的时间。此外,MRI比CT贵得多,因为它需要更复杂的现有设备。因此,收集MRI扫描以帮助解决医学图像分割训练问题具有挑战性。为了解决上述问题,我们建议使用深度学习网络(TarGAN)从CT扫描中重建MRI。然后,这些创建的MRI图像可以用于丰富MRI图像分割问题的训练数据。
{"title":"TarGAN: CT to MRI Translation Using Private Unpaired Data Domain","authors":"K. Truong, T. Le","doi":"10.1109/KSE56063.2022.9953622","DOIUrl":"https://doi.org/10.1109/KSE56063.2022.9953622","url":null,"abstract":"The detection and treatment of cancer and other disorders depend on the use of magnetic resonance imaging (MRI) and computed tomography (CT) scans. Compared to CT scan, MRI scans provide sharper pictures. An MRI is preferable to an X-ray or CT scan when the doctor needs to observe the soft tissues. Besides, MRI scans of organs and soft tissues, such as damaged ligaments and herniated discs, can be more accurate than CT imaging. However, capturing MRI typically takes longer than CT. Furthermore, MRI is substantially more expensive than CT because it requires more sophisticated current equipment. As a result, it is challenging to gather MRI scans to help with the medical image segmentation training issue. To address the aforementioned issue, we suggest using a deep learning network (TarGAN) to reconstruct MRI from CT scans. These created MRI images can then be used to enrich training data for MRI images segmentation issues.","PeriodicalId":330865,"journal":{"name":"2022 14th International Conference on Knowledge and Systems Engineering (KSE)","volume":"1225 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129066418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2022 14th International Conference on Knowledge and Systems Engineering (KSE)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1