首页 > 最新文献

Machine learning and knowledge extraction最新文献

英文 中文
Knowledge Graph Extraction of Business Interactions from News Text for Business Networking Analysis 从新闻文本中提取商业互动知识图谱,用于商业网络分析
Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-01-07 DOI: 10.3390/make6010007
Didier Gohourou, Kazuhiro Kuwabara
Network representation of data is key to a variety of fields and their applications including trading and business. A major source of data that can be used to build insightful networks is the abundant amount of unstructured text data available through the web. The efforts to turn unstructured text data into a network have spawned different research endeavors, including the simplification of the process. This study presents the design and implementation of TraCER, a pipeline that turns unstructured text data into a graph, targeting the business networking domain. It describes the application of natural language processing techniques used to process the text, as well as the heuristics and learning algorithms that categorize the nodes and the links. The study also presents some simple yet efficient methods for the entity-linking and relation classification steps of the pipeline.
数据的网络表示是包括贸易和商业在内的各种领域及其应用的关键。可用于构建具有洞察力的网络的一个主要数据源是通过网络获得的大量非结构化文本数据。将非结构化文本数据转化为网络的努力催生了不同的研究工作,包括简化流程。本研究介绍了将非结构化文本数据转化为图的管道 TraCER 的设计和实施,目标是商业网络领域。它介绍了用于处理文本的自然语言处理技术的应用,以及对节点和链接进行分类的启发式算法和学习算法。研究还介绍了一些简单而高效的方法,用于管道中的实体链接和关系分类步骤。
{"title":"Knowledge Graph Extraction of Business Interactions from News Text for Business Networking Analysis","authors":"Didier Gohourou, Kazuhiro Kuwabara","doi":"10.3390/make6010007","DOIUrl":"https://doi.org/10.3390/make6010007","url":null,"abstract":"Network representation of data is key to a variety of fields and their applications including trading and business. A major source of data that can be used to build insightful networks is the abundant amount of unstructured text data available through the web. The efforts to turn unstructured text data into a network have spawned different research endeavors, including the simplification of the process. This study presents the design and implementation of TraCER, a pipeline that turns unstructured text data into a graph, targeting the business networking domain. It describes the application of natural language processing techniques used to process the text, as well as the heuristics and learning algorithms that categorize the nodes and the links. The study also presents some simple yet efficient methods for the entity-linking and relation classification steps of the pipeline.","PeriodicalId":93033,"journal":{"name":"Machine learning and knowledge extraction","volume":"5 9","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139448803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Data Mining Approach for Health Transport Demand 健康运输需求的数据挖掘方法
Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-01-04 DOI: 10.3390/make6010005
Jorge Blanco Prieto, Marina Ferreras González, S. Van Vaerenbergh, Oscar Jesús Cosido Cobos
Efficient planning and management of health transport services are crucial for improving accessibility and enhancing the quality of healthcare. This study focuses on the choice of determinant variables in the prediction of health transport demand using data mining and analysis techniques. Specifically, health transport services data from Asturias, spanning a seven-year period, are analyzed with the aim of developing accurate predictive models. The problem at hand requires the handling of large volumes of data and multiple predictor variables, leading to challenges in computational cost and interpretation of the results. Therefore, data mining techniques are applied to identify the most relevant variables in the design of predictive models. This approach allows for reducing the computational cost without sacrificing prediction accuracy. The findings of this study underscore that the selection of significant variables is essential for optimizing medical transport resources and improving the planning of emergency services. With the most relevant variables identified, a balance between prediction accuracy and computational efficiency is achieved. As a result, improved service management is observed to lead to increased accessibility to health services and better resource planning.
有效规划和管理医疗运输服务对于改善医疗服务的可及性和提高医疗质量至关重要。本研究的重点是利用数据挖掘和分析技术,选择预测医疗运输需求的决定性变量。具体而言,本研究分析了阿斯图里亚斯七年来的医疗交通服务数据,旨在建立准确的预测模型。当前的问题需要处理大量数据和多个预测变量,这给计算成本和结果解释带来了挑战。因此,在设计预测模型时,需要应用数据挖掘技术来确定最相关的变量。这种方法可以在不影响预测准确性的前提下降低计算成本。这项研究的结果强调,选择重要变量对于优化医疗转运资源和改善急救服务规划至关重要。确定了最相关的变量后,就能在预测准确性和计算效率之间取得平衡。结果表明,改进服务管理可提高医疗服务的可及性并改善资源规划。
{"title":"A Data Mining Approach for Health Transport Demand","authors":"Jorge Blanco Prieto, Marina Ferreras González, S. Van Vaerenbergh, Oscar Jesús Cosido Cobos","doi":"10.3390/make6010005","DOIUrl":"https://doi.org/10.3390/make6010005","url":null,"abstract":"Efficient planning and management of health transport services are crucial for improving accessibility and enhancing the quality of healthcare. This study focuses on the choice of determinant variables in the prediction of health transport demand using data mining and analysis techniques. Specifically, health transport services data from Asturias, spanning a seven-year period, are analyzed with the aim of developing accurate predictive models. The problem at hand requires the handling of large volumes of data and multiple predictor variables, leading to challenges in computational cost and interpretation of the results. Therefore, data mining techniques are applied to identify the most relevant variables in the design of predictive models. This approach allows for reducing the computational cost without sacrificing prediction accuracy. The findings of this study underscore that the selection of significant variables is essential for optimizing medical transport resources and improving the planning of emergency services. With the most relevant variables identified, a balance between prediction accuracy and computational efficiency is achieved. As a result, improved service management is observed to lead to increased accessibility to health services and better resource planning.","PeriodicalId":93033,"journal":{"name":"Machine learning and knowledge extraction","volume":"9 3","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139386625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine Learning for an Enhanced Credit Risk Analysis: A Comparative Study of Loan Approval Prediction Models Integrating Mental Health Data 增强信用风险分析的机器学习:整合心理健康数据的贷款审批预测模型比较研究
Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-01-04 DOI: 10.3390/make6010004
Adnan Alagic, Natasa Zivic, E. Kadusic, Dženan Hamzić, Narcisa Hadzajlic, Mejra Dizdarević, Elmedin Selmanovic
The number of loan requests is rapidly growing worldwide representing a multi-billion-dollar business in the credit approval industry. Large data volumes extracted from the banking transactions that represent customers’ behavior are available, but processing loan applications is a complex and time-consuming task for banking institutions. In 2022, over 20 million Americans had open loans, totaling USD 178 billion in debt, although over 20% of loan applications were rejected. Numerous statistical methods have been deployed to estimate loan risks opening the field to estimate whether machine learning techniques can better predict the potential risks. To study the machine learning paradigm in this sector, the mental health dataset and loan approval dataset presenting survey results from 1991 individuals are used as inputs to experiment with the credit risk prediction ability of the chosen machine learning algorithms. Giving a comprehensive comparative analysis, this paper shows how the chosen machine learning algorithms can distinguish between normal and risky loan customers who might never pay their debts back. The results from the tested algorithms show that XGBoost achieves the highest accuracy of 84% in the first dataset, surpassing gradient boost (83%) and KNN (83%). In the second dataset, random forest achieved the highest accuracy of 85%, followed by decision tree and KNN with 83%. Alongside accuracy, the precision, recall, and overall performance of the algorithms were tested and a confusion matrix analysis was performed producing numerical results that emphasized the superior performance of XGBoost and random forest in the classification tasks in the first dataset, and XGBoost and decision tree in the second dataset. Researchers and practitioners can rely on these findings to form their model selection process and enhance the accuracy and precision of their classification models.
贷款申请数量在全球范围内迅速增长,在信贷审批行业中代表着数十亿美元的业务。从银行交易中提取的大量数据代表了客户的行为,但对银行机构来说,处理贷款申请是一项复杂而耗时的任务。2022 年,超过 2,000 万美国人有未结贷款,债务总额达 1,780 亿美元,但超过 20% 的贷款申请被拒。为估算贷款风险,人们采用了大量统计方法,以估算机器学习技术能否更好地预测潜在风险。为了研究该领域的机器学习范例,我们使用了心理健康数据集和贷款审批数据集,这两个数据集展示了 1991 年的个人调查结果,并以此为输入,对所选机器学习算法的信贷风险预测能力进行了实验。通过综合比较分析,本文展示了所选机器学习算法如何区分正常贷款客户和可能永远无法偿还债务的高风险贷款客户。测试结果表明,在第一个数据集中,XGBoost 的准确率最高,达到 84%,超过了梯度提升(83%)和 KNN(83%)。在第二个数据集中,随机森林的准确率最高,达到 85%,其次是决策树和 KNN,均为 83%。除了准确率,还测试了算法的精确度、召回率和整体性能,并进行了混淆矩阵分析,得出的数值结果表明,在第一个数据集中,XGBoost 和随机森林在分类任务中表现出色,在第二个数据集中,XGBoost 和决策树表现出色。研究人员和从业人员可以利用这些发现来制定模型选择流程,并提高分类模型的准确性和精确度。
{"title":"Machine Learning for an Enhanced Credit Risk Analysis: A Comparative Study of Loan Approval Prediction Models Integrating Mental Health Data","authors":"Adnan Alagic, Natasa Zivic, E. Kadusic, Dženan Hamzić, Narcisa Hadzajlic, Mejra Dizdarević, Elmedin Selmanovic","doi":"10.3390/make6010004","DOIUrl":"https://doi.org/10.3390/make6010004","url":null,"abstract":"The number of loan requests is rapidly growing worldwide representing a multi-billion-dollar business in the credit approval industry. Large data volumes extracted from the banking transactions that represent customers’ behavior are available, but processing loan applications is a complex and time-consuming task for banking institutions. In 2022, over 20 million Americans had open loans, totaling USD 178 billion in debt, although over 20% of loan applications were rejected. Numerous statistical methods have been deployed to estimate loan risks opening the field to estimate whether machine learning techniques can better predict the potential risks. To study the machine learning paradigm in this sector, the mental health dataset and loan approval dataset presenting survey results from 1991 individuals are used as inputs to experiment with the credit risk prediction ability of the chosen machine learning algorithms. Giving a comprehensive comparative analysis, this paper shows how the chosen machine learning algorithms can distinguish between normal and risky loan customers who might never pay their debts back. The results from the tested algorithms show that XGBoost achieves the highest accuracy of 84% in the first dataset, surpassing gradient boost (83%) and KNN (83%). In the second dataset, random forest achieved the highest accuracy of 85%, followed by decision tree and KNN with 83%. Alongside accuracy, the precision, recall, and overall performance of the algorithms were tested and a confusion matrix analysis was performed producing numerical results that emphasized the superior performance of XGBoost and random forest in the classification tasks in the first dataset, and XGBoost and decision tree in the second dataset. Researchers and practitioners can rely on these findings to form their model selection process and enhance the accuracy and precision of their classification models.","PeriodicalId":93033,"journal":{"name":"Machine learning and knowledge extraction","volume":"15 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139386505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predicting Wind Comfort in an Urban Area: A Comparison of a Regression- with a Classification-CNN for General Wind Rose Statistics 预测城市地区的风舒适度:一般风玫瑰图统计回归与分类-CNN 的比较
Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-01-04 DOI: 10.3390/make6010006
Jennifer Werner, Dimitri Nowak, Franziska Hunger, Tomas Johnson, A. Mark, Alexander Gösta, F. Edelvik
Wind comfort is an important factor when new buildings in existing urban areas are planned. It is common practice to use computational fluid dynamics (CFD) simulations to model wind comfort. These simulations are usually time-consuming, making it impossible to explore a high number of different design choices for a new urban development with wind simulations. Data-driven approaches based on simulations have shown great promise, and have recently been used to predict wind comfort in urban areas. These surrogate models could be used in generative design software and would enable the planner to explore a large number of options for a new design. In this paper, we propose a novel machine learning workflow (MLW) for direct wind comfort prediction. The MLW incorporates a regression and a classification U-Net, trained based on CFD simulations. Furthermore, we present an augmentation strategy focusing on generating more training data independent of the underlying wind statistics needed to calculate the wind comfort criterion. We train the models based on different sets of training data and compare the results. All trained models (regression and classification) yield an F1-score greater than 80% and can be combined with any wind rose statistic.
在规划现有城市地区的新建筑时,风舒适度是一个重要因素。通常的做法是使用计算流体动力学(CFD)模拟来建立风舒适度模型。这些模拟通常非常耗时,因此不可能通过风模拟为新的城市开发项目探索大量不同的设计方案。基于模拟的数据驱动方法已显示出巨大的前景,最近已被用于预测城市地区的风舒适度。这些代用模型可用于生成式设计软件,使规划师能够探索新设计的大量选项。在本文中,我们提出了一种用于直接预测风舒适度的新型机器学习工作流程(MLW)。MLW 结合了基于 CFD 模拟训练的回归和分类 U-Net。此外,我们还提出了一种增强策略,重点是生成更多独立于计算风舒适度标准所需的基本风力统计数据的训练数据。我们根据不同的训练数据集训练模型,并对结果进行比较。所有训练模型(回归和分类)的 F1 分数都大于 80%,并且可以与任何风玫瑰图统计相结合。
{"title":"Predicting Wind Comfort in an Urban Area: A Comparison of a Regression- with a Classification-CNN for General Wind Rose Statistics","authors":"Jennifer Werner, Dimitri Nowak, Franziska Hunger, Tomas Johnson, A. Mark, Alexander Gösta, F. Edelvik","doi":"10.3390/make6010006","DOIUrl":"https://doi.org/10.3390/make6010006","url":null,"abstract":"Wind comfort is an important factor when new buildings in existing urban areas are planned. It is common practice to use computational fluid dynamics (CFD) simulations to model wind comfort. These simulations are usually time-consuming, making it impossible to explore a high number of different design choices for a new urban development with wind simulations. Data-driven approaches based on simulations have shown great promise, and have recently been used to predict wind comfort in urban areas. These surrogate models could be used in generative design software and would enable the planner to explore a large number of options for a new design. In this paper, we propose a novel machine learning workflow (MLW) for direct wind comfort prediction. The MLW incorporates a regression and a classification U-Net, trained based on CFD simulations. Furthermore, we present an augmentation strategy focusing on generating more training data independent of the underlying wind statistics needed to calculate the wind comfort criterion. We train the models based on different sets of training data and compare the results. All trained models (regression and classification) yield an F1-score greater than 80% and can be combined with any wind rose statistic.","PeriodicalId":93033,"journal":{"name":"Machine learning and knowledge extraction","volume":"28 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139386972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Evaluative Baseline for Sentence-Level Semantic Division 句子级语义划分的评估基线
Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-01-02 DOI: 10.3390/make6010003
Kuangsheng Cai, Zugang Chen, Hengliang Guo, Shaohua Wang, Guoqing Li, Jing Li, Feng Chen, Hang Feng
Semantic folding theory (SFT) is an emerging cognitive science theory that aims to explain how the human brain processes and organizes semantic information. The distribution of text into semantic grids is key to SFT. We propose a sentence-level semantic division baseline with 100 grids (SSDB-100), the only dataset we are currently aware of that performs a relevant validation of the sentence-level SFT algorithm, to evaluate the validity of text distribution in semantic grids and divide it using classical division algorithms on SSDB-100. In this article, we describe the construction of SSDB-100. First, a semantic division questionnaire with broad coverage was generated by limiting the uncertainty range of the topics and corpus. Subsequently, through an expert survey, 11 human experts provided feedback. Finally, we analyzed and processed the feedback; the average consistency index for the used feedback was 0.856 after eliminating the invalid feedback. SSDB-100 has 100 semantic grids with clear distinctions between the grids, allowing the dataset to be extended using semantic methods.
语义折叠理论(SFT)是一种新兴的认知科学理论,旨在解释人脑如何处理和组织语义信息。将文本划分为语义网格是语义折叠理论的关键。我们提出了一个包含 100 个网格的句子级语义划分基线(SSDB-100),这是我们目前所知的唯一一个对句子级 SFT 算法进行相关验证的数据集,用于评估文本在语义网格中分布的有效性,并在 SSDB-100 上使用经典划分算法进行划分。本文将介绍 SSDB-100 的构建。首先,通过限制主题和语料的不确定性范围,生成了一份覆盖面广的语义划分问卷。随后,通过专家调查,11 位人类专家提供了反馈意见。最后,我们对反馈意见进行了分析和处理;在剔除无效反馈意见后,所用反馈意见的平均一致性指数为 0.856。SSDB-100 有 100 个语义网格,网格之间有明确的区别,因此可以使用语义方法对数据集进行扩展。
{"title":"An Evaluative Baseline for Sentence-Level Semantic Division","authors":"Kuangsheng Cai, Zugang Chen, Hengliang Guo, Shaohua Wang, Guoqing Li, Jing Li, Feng Chen, Hang Feng","doi":"10.3390/make6010003","DOIUrl":"https://doi.org/10.3390/make6010003","url":null,"abstract":"Semantic folding theory (SFT) is an emerging cognitive science theory that aims to explain how the human brain processes and organizes semantic information. The distribution of text into semantic grids is key to SFT. We propose a sentence-level semantic division baseline with 100 grids (SSDB-100), the only dataset we are currently aware of that performs a relevant validation of the sentence-level SFT algorithm, to evaluate the validity of text distribution in semantic grids and divide it using classical division algorithms on SSDB-100. In this article, we describe the construction of SSDB-100. First, a semantic division questionnaire with broad coverage was generated by limiting the uncertainty range of the topics and corpus. Subsequently, through an expert survey, 11 human experts provided feedback. Finally, we analyzed and processed the feedback; the average consistency index for the used feedback was 0.856 after eliminating the invalid feedback. SSDB-100 has 100 semantic grids with clear distinctions between the grids, allowing the dataset to be extended using semantic methods.","PeriodicalId":93033,"journal":{"name":"Machine learning and knowledge extraction","volume":"17 9","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139390347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Statistical Analysis of Imbalanced Classification with Training Size Variation and Subsampling on Datasets of Research Papers in Biomedical Literature 在生物医学文献研究论文数据集上对带有训练规模差异和子采样的不平衡分类进行统计分析
Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-12-11 DOI: 10.3390/make5040095
Jose Dixon, M. Rahman
The overall purpose of this paper is to demonstrate how data preprocessing, training size variation, and subsampling can dynamically change the performance metrics of imbalanced text classification. The methodology encompasses using two different supervised learning classification approaches of feature engineering and data preprocessing with the use of five machine learning classifiers, five imbalanced sampling techniques, specified intervals of training and subsampling sizes, statistical analysis using R and tidyverse on a dataset of 1000 portable document format files divided into five labels from the World Health Organization Coronavirus Research Downloadable Articles of COVID-19 papers and PubMed Central databases of non-COVID-19 papers for binary classification that affects the performance metrics of precision, recall, receiver operating characteristic area under the curve, and accuracy. One approach that involves labeling rows of sentences based on regular expressions significantly improved the performance of imbalanced sampling techniques verified by performing statistical analysis using a t-test documenting performance metrics of iterations versus another approach that automatically labels the sentences based on how the documents are organized into positive and negative classes. The study demonstrates the effectiveness of ML classifiers and sampling techniques in text classification datasets, with different performance levels and class imbalance issues observed in manual and automatic methods of data processing.
本文的总体目标是展示数据预处理、训练规模变化和子采样如何动态地改变不平衡文本分类的性能指标。该方法包括使用特征工程和数据预处理两种不同的监督学习分类方法,并使用五种机器学习分类器、五种不平衡采样技术、指定的训练间隔和子采样大小、使用 R 和 tidyverse 对来自世界卫生组织 Coronavirus Research Downloadable Articles of COVID-19 论文和 PubMed Central 数据库中的非 COVID-19 论文的 1000 个便携式文档格式文件数据集进行统计分析,将其分为五个标签,进行二元分类,从而影响精确度、召回率、曲线下接收者操作特征面积和准确度等性能指标。其中一种方法是根据正则表达式对句子行进行标注,与另一种方法相比,前者能显著提高不平衡采样技术的性能,后者则通过记录迭代性能指标的 t 检验进行统计分析。这项研究证明了 ML 分类器和采样技术在文本分类数据集中的有效性,在人工和自动数据处理方法中观察到了不同的性能水平和类不平衡问题。
{"title":"Statistical Analysis of Imbalanced Classification with Training Size Variation and Subsampling on Datasets of Research Papers in Biomedical Literature","authors":"Jose Dixon, M. Rahman","doi":"10.3390/make5040095","DOIUrl":"https://doi.org/10.3390/make5040095","url":null,"abstract":"The overall purpose of this paper is to demonstrate how data preprocessing, training size variation, and subsampling can dynamically change the performance metrics of imbalanced text classification. The methodology encompasses using two different supervised learning classification approaches of feature engineering and data preprocessing with the use of five machine learning classifiers, five imbalanced sampling techniques, specified intervals of training and subsampling sizes, statistical analysis using R and tidyverse on a dataset of 1000 portable document format files divided into five labels from the World Health Organization Coronavirus Research Downloadable Articles of COVID-19 papers and PubMed Central databases of non-COVID-19 papers for binary classification that affects the performance metrics of precision, recall, receiver operating characteristic area under the curve, and accuracy. One approach that involves labeling rows of sentences based on regular expressions significantly improved the performance of imbalanced sampling techniques verified by performing statistical analysis using a t-test documenting performance metrics of iterations versus another approach that automatically labels the sentences based on how the documents are organized into positive and negative classes. The study demonstrates the effectiveness of ML classifiers and sampling techniques in text classification datasets, with different performance levels and class imbalance issues observed in manual and automatic methods of data processing.","PeriodicalId":93033,"journal":{"name":"Machine learning and knowledge extraction","volume":"36 12","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138981076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Effective Detection of Epileptic Seizures through EEG Signals Using Deep Learning Approaches 利用深度学习方法通过脑电信号有效检测癫痫发作
Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-12-11 DOI: 10.3390/make5040094
S. Mekruksavanich, A. Jitpattanakul
Epileptic seizures are a prevalent neurological condition that impacts a considerable portion of the global population. Timely and precise identification can result in as many as 70% of individuals achieving freedom from seizures. To achieve this, there is a pressing need for smart, automated systems to assist medical professionals in identifying neurological disorders correctly. Previous efforts have utilized raw electroencephalography (EEG) data and machine learning techniques to classify behaviors in patients with epilepsy. However, these studies required expertise in clinical domains like radiology and clinical procedures for feature extraction. Traditional machine learning for classification relied on manual feature engineering, limiting performance. Deep learning excels at automated feature learning directly from raw data sans human effort. For example, deep neural networks now show promise in analyzing raw EEG data to detect seizures, eliminating intensive clinical or engineering needs. Though still emerging, initial studies demonstrate practical applications across medical domains. In this work, we introduce a novel deep residual model called ResNet-BiGRU-ECA, analyzing brain activity through EEG data to accurately identify epileptic seizures. To evaluate our proposed deep learning model’s efficacy, we used a publicly available benchmark dataset on epilepsy. The results of our experiments demonstrated that our suggested model surpassed both the basic model and cutting-edge deep learning models, achieving an outstanding accuracy rate of 0.998 and the top F1-score of 0.998.
癫痫发作是一种常见的神经系统疾病,影响着全球相当一部分人口。及时准确的识别可使多达 70% 的患者摆脱癫痫发作。要实现这一目标,迫切需要智能自动化系统来协助医疗专业人员正确识别神经系统疾病。以往的研究利用原始脑电图(EEG)数据和机器学习技术对癫痫患者的行为进行分类。然而,这些研究需要放射学和临床程序等临床领域的专业知识来提取特征。传统的机器学习分类依赖于人工特征工程,限制了性能。深度学习擅长直接从原始数据中进行自动特征学习,无需人工操作。例如,深度神经网络目前在分析原始脑电图数据以检测癫痫发作方面大有可为,省去了大量的临床或工程需求。尽管深度神经网络仍处于新兴阶段,但初步研究已经证明了它在医疗领域的实际应用。在这项工作中,我们引入了一种名为 ResNet-BiGRU-ECA 的新型深度残差模型,通过脑电图数据分析大脑活动,从而准确识别癫痫发作。为了评估我们提出的深度学习模型的功效,我们使用了一个公开的癫痫基准数据集。实验结果表明,我们提出的模型超越了基本模型和前沿深度学习模型,准确率高达 0.998,F1 分数也高达 0.998。
{"title":"Effective Detection of Epileptic Seizures through EEG Signals Using Deep Learning Approaches","authors":"S. Mekruksavanich, A. Jitpattanakul","doi":"10.3390/make5040094","DOIUrl":"https://doi.org/10.3390/make5040094","url":null,"abstract":"Epileptic seizures are a prevalent neurological condition that impacts a considerable portion of the global population. Timely and precise identification can result in as many as 70% of individuals achieving freedom from seizures. To achieve this, there is a pressing need for smart, automated systems to assist medical professionals in identifying neurological disorders correctly. Previous efforts have utilized raw electroencephalography (EEG) data and machine learning techniques to classify behaviors in patients with epilepsy. However, these studies required expertise in clinical domains like radiology and clinical procedures for feature extraction. Traditional machine learning for classification relied on manual feature engineering, limiting performance. Deep learning excels at automated feature learning directly from raw data sans human effort. For example, deep neural networks now show promise in analyzing raw EEG data to detect seizures, eliminating intensive clinical or engineering needs. Though still emerging, initial studies demonstrate practical applications across medical domains. In this work, we introduce a novel deep residual model called ResNet-BiGRU-ECA, analyzing brain activity through EEG data to accurately identify epileptic seizures. To evaluate our proposed deep learning model’s efficacy, we used a publicly available benchmark dataset on epilepsy. The results of our experiments demonstrated that our suggested model surpassed both the basic model and cutting-edge deep learning models, achieving an outstanding accuracy rate of 0.998 and the top F1-score of 0.998.","PeriodicalId":93033,"journal":{"name":"Machine learning and knowledge extraction","volume":"31 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138981149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Social Intelligence Mining: Unlocking Insights from X 社交情报挖掘:从 X 中挖掘洞察力
Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-12-11 DOI: 10.3390/make5040093
Hossein Hassani, N. Komendantova, Elena Rovenskaya, M. R. Yeganegi
Social trend mining, situated at the confluence of data science and social research, provides a novel lens through which to examine societal dynamics and emerging trends. This paper explores the intricate landscape of social trend mining, with a specific emphasis on discerning leading and lagging trends. Within this context, our study employs social trend mining techniques to scrutinize X (formerly Twitter) data pertaining to risk management, earthquakes, and disasters. A comprehensive comprehension of how individuals perceive the significance of these pivotal facets within disaster risk management is essential for shaping policies that garner public acceptance. This paper sheds light on the intricacies of public sentiment and provides valuable insights for policymakers and researchers alike.
社会趋势挖掘处于数据科学和社会研究的交汇点,为研究社会动态和新兴趋势提供了一个新的视角。本文探讨了社会趋势挖掘的复杂面貌,特别强调了对领先趋势和滞后趋势的辨别。在此背景下,我们的研究采用了社会趋势挖掘技术,仔细研究了与风险管理、地震和灾难相关的 X(原 Twitter)数据。全面了解个人如何看待灾害风险管理中这些关键方面的重要性,对于制定获得公众认可的政策至关重要。本文揭示了公众情绪的复杂性,为政策制定者和研究人员提供了宝贵的见解。
{"title":"Social Intelligence Mining: Unlocking Insights from X","authors":"Hossein Hassani, N. Komendantova, Elena Rovenskaya, M. R. Yeganegi","doi":"10.3390/make5040093","DOIUrl":"https://doi.org/10.3390/make5040093","url":null,"abstract":"Social trend mining, situated at the confluence of data science and social research, provides a novel lens through which to examine societal dynamics and emerging trends. This paper explores the intricate landscape of social trend mining, with a specific emphasis on discerning leading and lagging trends. Within this context, our study employs social trend mining techniques to scrutinize X (formerly Twitter) data pertaining to risk management, earthquakes, and disasters. A comprehensive comprehension of how individuals perceive the significance of these pivotal facets within disaster risk management is essential for shaping policies that garner public acceptance. This paper sheds light on the intricacies of public sentiment and provides valuable insights for policymakers and researchers alike.","PeriodicalId":93033,"journal":{"name":"Machine learning and knowledge extraction","volume":"10 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138980651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Solving Partially Observable 3D-Visual Tasks with Visual Radial Basis Function Network and Proximal Policy Optimization 用视觉径向基函数网络和近端策略优化解决部分可观测的三维视觉任务
Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-12-01 DOI: 10.3390/make5040091
Julien Hautot, Céline Teulière, Nourddine Azzaoui
Visual Reinforcement Learning (RL) has been largely investigated in recent decades. Existing approaches are often composed of multiple networks requiring massive computational power to solve partially observable tasks from high-dimensional data such as images. Using State Representation Learning (SRL) has been shown to improve the performance of visual RL by reducing the high-dimensional data into compact representation, but still often relies on deep networks and on the environment. In contrast, we propose a lighter, more generic method to extract sparse and localized features from raw images without training. We achieve this using a Visual Radial Basis Function Network (VRBFN), which offers significant practical advantages, including efficient and accurate training with minimal complexity due to its two linear layers. For real-world applications, its scalability and resilience to noise are essential, as real sensors are subject to change and noise. Unlike CNNs, which may require extensive retraining, this network might only need minor fine-tuning. We test the efficiency of the VRBFN representation to solve different RL tasks using Proximal Policy Optimization (PPO). We present a large study and comparison of our extraction methods with five classical visual RL and SRL approaches on five different first-person partially observable scenarios. We show that this approach presents appealing features such as sparsity and robustness to noise and that the obtained results when training RL agents are better than other tested methods on four of the five proposed scenarios.
近几十年来,视觉强化学习(RL)得到了广泛的研究。现有的方法通常由多个网络组成,需要大量的计算能力来解决来自高维数据(如图像)的部分可观察任务。使用状态表示学习(SRL)已经被证明可以通过将高维数据简化为紧凑的表示来提高视觉强化学习的性能,但仍然经常依赖于深度网络和环境。相比之下,我们提出了一种更轻、更通用的方法,可以在未经训练的情况下从原始图像中提取稀疏和局部特征。我们使用视觉径向基函数网络(VRBFN)来实现这一目标,该网络具有显著的实用优势,包括由于其两个线性层而具有最小复杂性的高效和准确的训练。对于现实世界的应用,它的可扩展性和抗噪声的弹性是必不可少的,因为真实的传感器会受到变化和噪声的影响。与cnn不同,cnn可能需要大量的再训练,这个网络可能只需要轻微的微调。我们使用近端策略优化(PPO)来测试VRBFN表示解决不同RL任务的效率。我们在五种不同的第一人称部分可观察场景下,对我们的提取方法与五种经典视觉RL和SRL方法进行了大规模的研究和比较。我们表明,这种方法呈现出诸如稀疏性和对噪声的鲁棒性等吸引人的特征,并且在五个提议的场景中的四个场景中,训练强化学习代理时获得的结果优于其他测试方法。
{"title":"Solving Partially Observable 3D-Visual Tasks with Visual Radial Basis Function Network and Proximal Policy Optimization","authors":"Julien Hautot, Céline Teulière, Nourddine Azzaoui","doi":"10.3390/make5040091","DOIUrl":"https://doi.org/10.3390/make5040091","url":null,"abstract":"Visual Reinforcement Learning (RL) has been largely investigated in recent decades. Existing approaches are often composed of multiple networks requiring massive computational power to solve partially observable tasks from high-dimensional data such as images. Using State Representation Learning (SRL) has been shown to improve the performance of visual RL by reducing the high-dimensional data into compact representation, but still often relies on deep networks and on the environment. In contrast, we propose a lighter, more generic method to extract sparse and localized features from raw images without training. We achieve this using a Visual Radial Basis Function Network (VRBFN), which offers significant practical advantages, including efficient and accurate training with minimal complexity due to its two linear layers. For real-world applications, its scalability and resilience to noise are essential, as real sensors are subject to change and noise. Unlike CNNs, which may require extensive retraining, this network might only need minor fine-tuning. We test the efficiency of the VRBFN representation to solve different RL tasks using Proximal Policy Optimization (PPO). We present a large study and comparison of our extraction methods with five classical visual RL and SRL approaches on five different first-person partially observable scenarios. We show that this approach presents appealing features such as sparsity and robustness to noise and that the obtained results when training RL agents are better than other tested methods on four of the five proposed scenarios.","PeriodicalId":93033,"journal":{"name":"Machine learning and knowledge extraction","volume":" 39","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138619199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian Network Structural Learning Using Adaptive Genetic Algorithm with Varying Population Size 利用种群规模变化的自适应遗传算法进行贝叶斯网络结构学习
Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-12-01 DOI: 10.3390/make5040090
Rafael Rodrigues Mendes Ribeiro, Carlos Dias Maciel
A Bayesian network (BN) is a probabilistic graphical model that can model complex and nonlinear relationships. Its structural learning from data is an NP-hard problem because of its search-space size. One method to perform structural learning is a search and score approach, which uses a search algorithm and structural score. A study comparing 15 algorithms showed that hill climbing (HC) and tabu search (TABU) performed the best overall on the tests. This work performs a deeper analysis of the application of the adaptive genetic algorithm with varying population size (AGAVaPS) on the BN structural learning problem, which a preliminary test showed that it had the potential to perform well on. AGAVaPS is a genetic algorithm that uses the concept of life, where each solution is in the population for a number of iterations. Each individual also has its own mutation rate, and there is a small probability of undergoing mutation twice. Parameter analysis of AGAVaPS in BN structural leaning was performed. Also, AGAVaPS was compared to HC and TABU for six literature datasets considering F1 score, structural Hamming distance (SHD), balanced scoring function (BSF), Bayesian information criterion (BIC), and execution time. HC and TABU performed basically the same for all the tests made. AGAVaPS performed better than the other algorithms for F1 score, SHD, and BIC, showing that it can perform well and is a good choice for BN structural learning.
贝叶斯网络(BN)是一种概率图模型,可以对复杂的非线性关系进行建模。由于其搜索空间的大小,它从数据中进行结构化学习是一个np困难问题。执行结构学习的一种方法是搜索和评分方法,它使用搜索算法和结构评分。一项比较15种算法的研究表明,爬坡(HC)和禁忌搜索(tabu)在测试中的总体表现最好。本文对变种群大小自适应遗传算法(AGAVaPS)在BN结构学习问题上的应用进行了更深入的分析,初步测试表明该算法在该问题上具有良好的表现潜力。AGAVaPS是一种使用生命概念的遗传算法,其中每个解决方案都在种群中进行多次迭代。每个个体也有自己的突变率,经历两次突变的概率很小。对AGAVaPS在BN结构学习中的参数进行了分析。同时,考虑F1评分、结构汉明距离(SHD)、平衡评分函数(BSF)、贝叶斯信息准则(BIC)和执行时间,将AGAVaPS与HC和TABU在6个文献数据集上进行比较。HC和TABU在所有测试中的表现基本相同。AGAVaPS在F1分数、SHD和BIC方面的表现都优于其他算法,表明AGAVaPS具有良好的性能,是BN结构学习的良好选择。
{"title":"Bayesian Network Structural Learning Using Adaptive Genetic Algorithm with Varying Population Size","authors":"Rafael Rodrigues Mendes Ribeiro, Carlos Dias Maciel","doi":"10.3390/make5040090","DOIUrl":"https://doi.org/10.3390/make5040090","url":null,"abstract":"A Bayesian network (BN) is a probabilistic graphical model that can model complex and nonlinear relationships. Its structural learning from data is an NP-hard problem because of its search-space size. One method to perform structural learning is a search and score approach, which uses a search algorithm and structural score. A study comparing 15 algorithms showed that hill climbing (HC) and tabu search (TABU) performed the best overall on the tests. This work performs a deeper analysis of the application of the adaptive genetic algorithm with varying population size (AGAVaPS) on the BN structural learning problem, which a preliminary test showed that it had the potential to perform well on. AGAVaPS is a genetic algorithm that uses the concept of life, where each solution is in the population for a number of iterations. Each individual also has its own mutation rate, and there is a small probability of undergoing mutation twice. Parameter analysis of AGAVaPS in BN structural leaning was performed. Also, AGAVaPS was compared to HC and TABU for six literature datasets considering F1 score, structural Hamming distance (SHD), balanced scoring function (BSF), Bayesian information criterion (BIC), and execution time. HC and TABU performed basically the same for all the tests made. AGAVaPS performed better than the other algorithms for F1 score, SHD, and BIC, showing that it can perform well and is a good choice for BN structural learning.","PeriodicalId":93033,"journal":{"name":"Machine learning and knowledge extraction","volume":"6 16","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138624618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Machine learning and knowledge extraction
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1