2021 the 5th International Conference on Information System and Data Mining最新文献

英文中文

On Biclique Connectivity in Bipartite Graphs and Recommendation Systems 二部图的Biclique连通性与推荐系统

2021 the 5th International Conference on Information System and Data Mining

Pub Date : 2021-05-27 DOI: 10.1145/3471287.3471302

Cristina Maier, D. Simovici

Bipartite graphs can be used to model many real-world relationships, with applications in many domains such as medicine and social networks. We present an application of maximal bicliques of bipartite graphs to recommendation systems that makes use of the notion of biclique similarity of a set of vertices in order to recommend items to users in a certain order of preference. Experimental results using real-world datasets that justify our approach are presented.

二分图可以用来模拟许多现实世界的关系，在许多领域，如医学和社会网络的应用。我们提出了二部图的最大双曲线在推荐系统中的应用，该系统利用一组顶点的双曲线相似度的概念，以按一定的偏好顺序向用户推荐物品。使用真实世界数据集的实验结果证明了我们的方法。

引用次数: 2

Actor-Critic Neural Network Based Finite-time Control for Uncertain Robotic Systems 基于Actor-Critic神经网络的不确定机器人系统有限时间控制

2021 the 5th International Conference on Information System and Data Mining

Pub Date : 2021-05-27 DOI: 10.1145/3471287.3471288

Changyi Lei

This paper investigates reinforcement learning (RL) based finite-time control (FTC) of uncertain robotic systems. The proposed methodology consists of a terminal sliding mode based finite-time controller and an Actor-Critic (AC)-based RL loop that adjusts the output of the neural network. The terminal sliding mode controller is designed to ensure calculable settling time, as compared to conventional asymptotic stability. The AC-based RL loop uses recursive least square technique to update the critic network and policy gradient algorithm to estimate the parameters of actor network. We show that the AC is beneficial to improve robustness of terminal sliding mode controller both in approaching stage and near equilibrium. The performance of proposed controller is compared to that with only terminal sliding mode controller. The simulation results show that proposed controller outperforms pure terminal sliding mode controller, and that AC is a successful supplement to FTC.

本文研究了基于强化学习的不确定机器人系统有限时间控制。所提出的方法包括基于终端滑模的有限时间控制器和基于Actor-Critic (AC)的RL环路，用于调整神经网络的输出。与传统的渐近稳定性相比，终端滑模控制器的设计确保了可计算的稳定时间。基于交流的强化学习回路使用递归最小二乘技术更新批评网络，使用策略梯度算法估计参与者网络的参数。结果表明，交流控制有利于提高终端滑模控制器在逼近阶段和接近平衡阶段的鲁棒性。将所提控制器的性能与仅采用终端滑模控制器的性能进行了比较。仿真结果表明，该控制器的性能优于纯终端滑模控制器，交流控制器是FTC的有效补充。

引用次数: 0

Rumor Remove Order Strategy on Social Networks 谣言移除社会网络上的订单策略

2021 the 5th International Conference on Information System and Data Mining

Pub Date : 2021-05-27 DOI: 10.1145/3471287.3471294

Yuanda Wang, Haibo Wang, Shigang Chen, Ye Xia

Rumors are defined as widely spread talk with no reliable source to back it up. In modern society, the rumors are widely spreading on the social network. The spread of rumors poses great challenges for the society. A ”fake news” story can rile up your emotions and change your mood. Some rumors can even cause social panic and economic losses. As such, the influence of rumors can be far-reaching and long-lasting. Efficient and intelligent rumor control strategies are necessary to constrain the spread of rumors. Existing rumor control strategies are designed for controlling a single rumor. However, there are usually many rumors existing on social networks and only limited rumors can be removed at a time due to the limited detection capacity and CPU performance. Consequently, when dealing with multiple rumors, we should remove rumors in a certain order. We argue that the order of removing rumors matters as different rumors possess different properties, e.g., acceptance rate, propagation speed, etc. Unfortunately, to the best of our knowledge, there is no prior work on removing multiple rumors and the order of removing rumors. To this end, this paper proposes two novel rumor control strategies to remove the multiple rumors. We also extends the classical Susceptible Infected Recovered (SIR) model to simulate the dynamics of rumor propagation in a more practical manner. We evaluate the performance of strategies. The experiments show that our proposed rumor control strategies obviously outperform than benchmark strategy.

谣言被定义为没有可靠来源支持的广泛传播的谈话。在现代社会，谣言在社交网络上广泛传播。谣言的传播给社会带来了巨大的挑战。一个“假新闻”故事可以激怒你的情绪，改变你的情绪。一些谣言甚至会造成社会恐慌和经济损失。因此，谣言的影响可能是深远和持久的。有效、智能的谣言控制策略是遏制谣言传播的必要手段。现有的谣言控制策略都是针对单个谣言而设计的。然而，社交网络上通常存在许多谣言，由于检测能力和CPU性能的限制，一次只能去除有限的谣言。因此，在处理多重谣言时，我们应该按照一定的顺序去除谣言。我们认为去除谣言的顺序很重要，因为不同的谣言具有不同的性质，如接受率、传播速度等。不幸的是，据我们所知，目前还没有关于去除多个谣言和去除谣言的顺序的工作。为此，本文提出了两种新的谣言控制策略来消除多重谣言。我们还扩展了经典的易感感染恢复(SIR)模型，以更实际的方式模拟谣言传播的动态。我们评估策略的表现。实验表明，本文提出的谣言控制策略明显优于基准策略。

{"title":"Rumor Remove Order Strategy on Social Networks","authors":"Yuanda Wang, Haibo Wang, Shigang Chen, Ye Xia","doi":"10.1145/3471287.3471294","DOIUrl":"https://doi.org/10.1145/3471287.3471294","url":null,"abstract":"Rumors are defined as widely spread talk with no reliable source to back it up. In modern society, the rumors are widely spreading on the social network. The spread of rumors poses great challenges for the society. A ”fake news” story can rile up your emotions and change your mood. Some rumors can even cause social panic and economic losses. As such, the influence of rumors can be far-reaching and long-lasting. Efficient and intelligent rumor control strategies are necessary to constrain the spread of rumors. Existing rumor control strategies are designed for controlling a single rumor. However, there are usually many rumors existing on social networks and only limited rumors can be removed at a time due to the limited detection capacity and CPU performance. Consequently, when dealing with multiple rumors, we should remove rumors in a certain order. We argue that the order of removing rumors matters as different rumors possess different properties, e.g., acceptance rate, propagation speed, etc. Unfortunately, to the best of our knowledge, there is no prior work on removing multiple rumors and the order of removing rumors. To this end, this paper proposes two novel rumor control strategies to remove the multiple rumors. We also extends the classical Susceptible Infected Recovered (SIR) model to simulate the dynamics of rumor propagation in a more practical manner. We evaluate the performance of strategies. The experiments show that our proposed rumor control strategies obviously outperform than benchmark strategy.","PeriodicalId":306474,"journal":{"name":"2021 the 5th International Conference on Information System and Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129067616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SF-U-Net: Using Accurate Shape Estimation and Feature Restoration to Improve Retinal Vessel Segmentation SF-U-Net:利用精确形状估计和特征恢复改进视网膜血管分割

2021 the 5th International Conference on Information System and Data Mining

Pub Date : 2021-05-27 DOI: 10.1145/3471287.3471377

Wen-Chun Yang

The features of retinal blood vessels are very essential indicators playing an important part in the process of judging and diagnosing the eye diseases for doctors. Sometimes, these features can also be the indicators for the examination of hypertension, coronary heart disease and diabetes. However, retinal blood vessels are often very small and complex in distribution, which brings toughness to the doctors when doing the operations of the segmentation of retinal blood vessels. Although the deep learning manners represented by U-Net has performed very well in the field of the segmentation of the images of the retinal blood vessel in recent years, the above-mentioned inconvenience still cannot be effectively settled. For the purpose of improving the correct rate of the segmentation and settling the above-mentioned inconvenience we propose a network called SF-U-Net, which uses accurate shape estimation and feature restoration to achieve the improvement of the accuracy. We follow the structure of Fully Convolutional Networks (FCN) and Skip Connection of U-Net and use deformable convolution to accurately capture the shape of blood vessels when extracting features at the coding layer to overcome the problem of complex blood vessel distribution. At the decoding layer, we adopt a novel dual-stream up-sampling method to achieve accurate feature restoration. Experimental results show that our SF-U-Net is capable of improving the segmentation results of retinal blood vessels conspicuously. In the experiment, we use both fundus image datasets called DRIVE and CHASE-DB1 and the experimental results of multiple indicators on them surpass other deep-learning methods significantly. The experimental results of the SF-U-Net model on a variety of indicators on DRIVE dataset exceed the experimental performances of the currently most advanced methods. The mean accuracy is 0.9602 the area under the curve (AUC) is 0.9848 and the sensitivity is 0.8567.

视网膜血管的特征是医生判断和诊断眼病过程中非常重要的指标。有时，这些特征也可以作为检查高血压、冠心病和糖尿病的指标。然而，视网膜血管往往非常小，分布复杂，这给医生在进行视网膜血管分割手术时带来了很大的困难。尽管近年来以U-Net为代表的深度学习方式在视网膜血管图像分割领域表现非常出色，但上述不便仍然无法有效解决。为了提高分割正确率，解决上述不便，我们提出了一种称为SF-U-Net的网络，该网络使用精确的形状估计和特征恢复来提高分割精度。在编码层提取特征时，我们采用全卷积网络(Fully Convolutional Networks, FCN)和U-Net的Skip Connection的结构，利用可变形卷积准确捕捉血管的形状，克服了血管分布复杂的问题。在解码层，我们采用了一种新颖的双流上采样方法来实现准确的特征恢复。实验结果表明，SF-U-Net能够显著提高视网膜血管的分割效果。在实验中，我们同时使用了眼底图像数据集DRIVE和CHASE-DB1，在它们上面的多个指标的实验结果明显优于其他深度学习方法。SF-U-Net模型在DRIVE数据集多种指标上的实验结果超过了目前最先进的方法的实验性能。平均准确度为0.9602，曲线下面积(AUC)为0.9848，灵敏度为0.8567。

{"title":"SF-U-Net: Using Accurate Shape Estimation and Feature Restoration to Improve Retinal Vessel Segmentation","authors":"Wen-Chun Yang","doi":"10.1145/3471287.3471377","DOIUrl":"https://doi.org/10.1145/3471287.3471377","url":null,"abstract":"The features of retinal blood vessels are very essential indicators playing an important part in the process of judging and diagnosing the eye diseases for doctors. Sometimes, these features can also be the indicators for the examination of hypertension, coronary heart disease and diabetes. However, retinal blood vessels are often very small and complex in distribution, which brings toughness to the doctors when doing the operations of the segmentation of retinal blood vessels. Although the deep learning manners represented by U-Net has performed very well in the field of the segmentation of the images of the retinal blood vessel in recent years, the above-mentioned inconvenience still cannot be effectively settled. For the purpose of improving the correct rate of the segmentation and settling the above-mentioned inconvenience we propose a network called SF-U-Net, which uses accurate shape estimation and feature restoration to achieve the improvement of the accuracy. We follow the structure of Fully Convolutional Networks (FCN) and Skip Connection of U-Net and use deformable convolution to accurately capture the shape of blood vessels when extracting features at the coding layer to overcome the problem of complex blood vessel distribution. At the decoding layer, we adopt a novel dual-stream up-sampling method to achieve accurate feature restoration. Experimental results show that our SF-U-Net is capable of improving the segmentation results of retinal blood vessels conspicuously. In the experiment, we use both fundus image datasets called DRIVE and CHASE-DB1 and the experimental results of multiple indicators on them surpass other deep-learning methods significantly. The experimental results of the SF-U-Net model on a variety of indicators on DRIVE dataset exceed the experimental performances of the currently most advanced methods. The mean accuracy is 0.9602 the area under the curve (AUC) is 0.9848 and the sensitivity is 0.8567.","PeriodicalId":306474,"journal":{"name":"2021 the 5th International Conference on Information System and Data Mining","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126413715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Use of Efficient Machine Learning Techniques in the Identification of Patients with Heart Diseases 高效机器学习技术在心脏病患者识别中的应用

2021 the 5th International Conference on Information System and Data Mining

Pub Date : 2021-05-27 DOI: 10.1145/3471287.3471297

Pronab Ghosh, S. Azam, Asif Karim, M. Jonkman, Md. Zahid Hasan

Cardiovascular disease has become one of the world's major causes of death. Accurate and timely diagnosis is of crucial importance. We constructed an intelligent diagnostic framework for prediction of heart disease, using the Cleveland Heart disease dataset. We have used three machine learning approaches, Decision Tree (DT), K- Nearest Neighbor (KNN), and Random Forest (RF) in combination with different sets of features. We have applied the three techniques to the full set of features, to a set of ten features selected by “Pearson's Correlation” technique and to a set of six features selected by the Relief algorithm. Results were evaluated based on accuracy, precision, sensitivity, and several other indices. The best results were obtained with the combination of the RF classifier and the features selected by Relief achieving an accuracy of 98.36%. This could even further be improved by employing a 5-fold Cross Validation (CV) approach, resulting in an accuracy of 99.337%. CCS CONCEPTS • Applied computing • Life and medical sciences • Health informatics

心血管疾病已成为世界上主要的死亡原因之一。准确及时的诊断是至关重要的。我们使用克利夫兰心脏病数据集构建了一个预测心脏病的智能诊断框架。我们使用了三种机器学习方法，决策树(DT)， K近邻(KNN)和随机森林(RF)结合不同的特征集。我们将这三种技术应用于完整的特征集，应用于由“Pearson’s Correlation”技术选择的一组10个特征，以及应用于由Relief算法选择的一组6个特征。根据准确度、精密度、灵敏度和其他几个指标对结果进行评价。RF分类器与Relief选择的特征相结合，准确率达到98.36%，效果最好。这甚至可以通过采用5倍交叉验证(CV)方法进一步改进，其准确率为99.337%。CCS概念•应用计算•生命和医学科学•健康信息学

{"title":"Use of Efficient Machine Learning Techniques in the Identification of Patients with Heart Diseases","authors":"Pronab Ghosh, S. Azam, Asif Karim, M. Jonkman, Md. Zahid Hasan","doi":"10.1145/3471287.3471297","DOIUrl":"https://doi.org/10.1145/3471287.3471297","url":null,"abstract":"Cardiovascular disease has become one of the world's major causes of death. Accurate and timely diagnosis is of crucial importance. We constructed an intelligent diagnostic framework for prediction of heart disease, using the Cleveland Heart disease dataset. We have used three machine learning approaches, Decision Tree (DT), K- Nearest Neighbor (KNN), and Random Forest (RF) in combination with different sets of features. We have applied the three techniques to the full set of features, to a set of ten features selected by “Pearson's Correlation” technique and to a set of six features selected by the Relief algorithm. Results were evaluated based on accuracy, precision, sensitivity, and several other indices. The best results were obtained with the combination of the RF classifier and the features selected by Relief achieving an accuracy of 98.36%. This could even further be improved by employing a 5-fold Cross Validation (CV) approach, resulting in an accuracy of 99.337%. CCS CONCEPTS • Applied computing • Life and medical sciences • Health informatics","PeriodicalId":306474,"journal":{"name":"2021 the 5th International Conference on Information System and Data Mining","volume":"56 8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126126126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

FOB-Net: A Semantic Segmentation Method Focusing on Boundary Information FOB-Net:一种基于边界信息的语义分割方法

2021 the 5th International Conference on Information System and Data Mining

Pub Date : 2021-05-27 DOI: 10.1145/3471287.3471290

Jiayu Wang

In this paper, we will introduce an original neural network FOB-Net for semantic segmentation tasks. This network's innovation is that it pays more attention to the boundary information and gets the supervised information directly. After our extensive understanding and research about the several used networks in the semantic segmentation application scenario. We analyzed the various accepted network structures. In the end, we revise the existing network to get the FOB-Net.

在本文中，我们将介绍一种新颖的神经网络FOB-Net来完成语义分割任务。该网络的创新之处在于更加关注边界信息，直接获取监督信息。在对语义分割应用场景中常用的几种网络进行了广泛的了解和研究之后。我们分析了各种公认的网络结构。最后，我们对现有的网络进行了修改，得到了FOB-Net。

引用次数: 1

Ethnicity Based Consumer Buying Behavior Analysis and Prediction on Online Clothing Platforms in Sri Lanka 基于种族的斯里兰卡在线服装平台消费者购买行为分析与预测

2021 the 5th International Conference on Information System and Data Mining

Pub Date : 2021-05-27 DOI: 10.1145/3471287.3471291

T. Ginige, K. Mahima

With the busy lives, people feel uncomfortable and have less time to go to a shop and buy things nowadays. E-Shopping is rapidly growing during the last decade simply because people are able to purchase items from these online platforms 24x7. Online shopping is a process where consumers buy goods and services from a shop or a seller over the internet. Popular websites like AlieExpress, eBay are good examples of that. Presently this Online-shopping concept is also very popular in Sri Lanka. In this research, the authors mainly concentrate on the Sri Lankan Online Clothing stores and try to examine the online consumers buying behavior based on ethnicity. The authors chose Sri Lanka since Sri Lanka is a multi-ethnic country. There are mainly four ethnicities live in Sri Lanka. They are Sinhalese, Tamils, Muslims, and Burghers. For this research mainly Sinhalese and Tamils, the two main ethnicities are considered. In this research, the study authors analyze the buying behavior of those two ethnicities such as cloth types, favorite colors in online clothing shopping platforms. Moreover, in this research, authors implement classification models that can predict the cloth type and the color of the clothes based on consumers' ethnicities. The main benefit of this research would go to Sri Lankan online sellers. They would be able to get a clear understanding of the buying behaviors and the expectations of the consumers based on their ethnicities. Moreover, from this, sellers could improve their sales and online shopping users can get an attractive and good online shopping experience.

随着繁忙的生活，人们感到不舒服，没有时间去商店买东西。在过去的十年里，电子购物迅速发展，因为人们可以全天候从这些在线平台上购买商品。网上购物是消费者通过互联网从商店或卖家那里购买商品和服务的过程。像aliexpress和eBay这样的热门网站就是很好的例子。目前，这种网上购物的概念在斯里兰卡也很流行。在本研究中，作者主要集中在斯里兰卡的在线服装商店，并试图研究基于种族的在线消费者购买行为。作者之所以选择斯里兰卡，是因为斯里兰卡是一个多民族国家。斯里兰卡主要有四个民族。他们是僧伽罗人、泰米尔人、穆斯林和市民。本研究主要考虑了僧伽罗人和泰米尔人这两个主要民族。在这项研究中，研究作者分析了这两个民族在网上服装购物平台上的购买行为，如布料类型，最喜欢的颜色。此外，在本研究中，作者实现了分类模型，可以根据消费者的种族预测衣服的布料类型和颜色。这项研究的主要受益者将是斯里兰卡的在线卖家。他们将能够清楚地了解基于种族的消费者的购买行为和期望。此外，从这一点，卖家可以提高他们的销售和网上购物的用户可以获得一个有吸引力的和良好的网上购物体验。

{"title":"Ethnicity Based Consumer Buying Behavior Analysis and Prediction on Online Clothing Platforms in Sri Lanka","authors":"T. Ginige, K. Mahima","doi":"10.1145/3471287.3471291","DOIUrl":"https://doi.org/10.1145/3471287.3471291","url":null,"abstract":"With the busy lives, people feel uncomfortable and have less time to go to a shop and buy things nowadays. E-Shopping is rapidly growing during the last decade simply because people are able to purchase items from these online platforms 24x7. Online shopping is a process where consumers buy goods and services from a shop or a seller over the internet. Popular websites like AlieExpress, eBay are good examples of that. Presently this Online-shopping concept is also very popular in Sri Lanka. In this research, the authors mainly concentrate on the Sri Lankan Online Clothing stores and try to examine the online consumers buying behavior based on ethnicity. The authors chose Sri Lanka since Sri Lanka is a multi-ethnic country. There are mainly four ethnicities live in Sri Lanka. They are Sinhalese, Tamils, Muslims, and Burghers. For this research mainly Sinhalese and Tamils, the two main ethnicities are considered. In this research, the study authors analyze the buying behavior of those two ethnicities such as cloth types, favorite colors in online clothing shopping platforms. Moreover, in this research, authors implement classification models that can predict the cloth type and the color of the clothes based on consumers' ethnicities. The main benefit of this research would go to Sri Lankan online sellers. They would be able to get a clear understanding of the buying behaviors and the expectations of the consumers based on their ethnicities. Moreover, from this, sellers could improve their sales and online shopping users can get an attractive and good online shopping experience.","PeriodicalId":306474,"journal":{"name":"2021 the 5th International Conference on Information System and Data Mining","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114451720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Survey of CNN and Facial Recognition Methods in the Age of COVID-19∗ CNN和人脸识别方法在COVID-19时代的研究*

2021 the 5th International Conference on Information System and Data Mining

Pub Date : 2021-05-27 DOI: 10.1145/3471287.3471292

Adinma Chidumije, Fatima Gowher, Ehsan Kamalinejad, Justine Mercado, Jiwanjot Soni, Jiaofei Zhong

The rising popularity of facial recognition technology has prompted a lot of questions about its application, reliability, safety, and legality. The ability of a machine to identify an individual and their emotions through an image with near perfect accuracy is a testament to how far Artificial intelligence (AI) models have come. This study rigorously analyzes and consolidates several reputable materials with the purposes of answering the following questions: What is facial recognition? How is data acquired? What is the machine learning process? How does the Convolution Neural Network (CNN) work? It also explores the potential obstructions such as face masks that affect the machine's accuracy, security vulnerabilities, reliability, and legal concerns of the technology.

人脸识别技术的日益普及引发了许多关于其应用、可靠性、安全性和合法性的问题。机器能够近乎完美地通过图像识别个人及其情绪，这证明了人工智能(AI)模型已经走了多远。本研究严格分析和整合了一些有信誉的材料，目的是回答以下问题:什么是面部识别?如何获取数据?什么是机器学习过程?卷积神经网络(CNN)是如何工作的?它还探讨了潜在的障碍，如口罩，影响机器的准确性，安全漏洞，可靠性和技术的法律问题。

引用次数: 0

Predicting Water Quality Parameters in Lake Pontchartrain using Machine Learning: A comparison on K-Nearest Neighbors, Decision Trees, and Neural Networks to Predict Water Quality 使用机器学习预测庞恰特雷恩湖水质参数:k近邻、决策树和神经网络预测水质的比较

2021 the 5th International Conference on Information System and Data Mining

Pub Date : 2021-05-27 DOI: 10.1145/3471287.3471308

A. Daniels, C. Koutsougeras

This work is about the use of machine learning methods to improve the monitoring of water quality. The work aims to use machine learning to predict the normal values of a quality indicator (pH, salinity, etc.). Upon significant deviation from actual measurements, monitoring scientists would be alerted to the need to inspect the water way more closely thereby reducing the possibility of missing a problem and speeding up determinations of issues regarding water quality. This study compares methods to predict water quality parameters using water data from Lake Pontchartrain in Southeast Louisiana. K-Nearest neighbors, decision trees, and an artificial neural network have been used to determine which method most accurately predicted water quality parameters such as pH, temperature, salinity, specific conductance, and dissolved oxygen. The decision tree and k-nearest neighbors algorithms produced similar results which were only slightly below the standard deviation of the data. However, a neural network was able to predict the values with a much higher accuracy.

这项工作是关于使用机器学习方法来改善水质监测的。这项工作旨在使用机器学习来预测质量指标(pH，盐度等)的正常值。如果与实际测量结果有重大偏差，监测科学家就会被提醒需要更密切地检查水质，从而减少遗漏问题的可能性，并加快对水质问题的确定。本研究比较了使用路易斯安那州东南部庞恰特雷恩湖的水数据预测水质参数的方法。k -最近邻、决策树和人工神经网络已被用于确定哪种方法最准确地预测水质参数，如pH值、温度、盐度、比电导和溶解氧。决策树和k近邻算法产生了类似的结果，只是略低于数据的标准偏差。然而，神经网络能够以更高的精度预测这些值。

{"title":"Predicting Water Quality Parameters in Lake Pontchartrain using Machine Learning: A comparison on K-Nearest Neighbors, Decision Trees, and Neural Networks to Predict Water Quality","authors":"A. Daniels, C. Koutsougeras","doi":"10.1145/3471287.3471308","DOIUrl":"https://doi.org/10.1145/3471287.3471308","url":null,"abstract":"This work is about the use of machine learning methods to improve the monitoring of water quality. The work aims to use machine learning to predict the normal values of a quality indicator (pH, salinity, etc.). Upon significant deviation from actual measurements, monitoring scientists would be alerted to the need to inspect the water way more closely thereby reducing the possibility of missing a problem and speeding up determinations of issues regarding water quality. This study compares methods to predict water quality parameters using water data from Lake Pontchartrain in Southeast Louisiana. K-Nearest neighbors, decision trees, and an artificial neural network have been used to determine which method most accurately predicted water quality parameters such as pH, temperature, salinity, specific conductance, and dissolved oxygen. The decision tree and k-nearest neighbors algorithms produced similar results which were only slightly below the standard deviation of the data. However, a neural network was able to predict the values with a much higher accuracy.","PeriodicalId":306474,"journal":{"name":"2021 the 5th International Conference on Information System and Data Mining","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127262038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Data Mining Techniques in Direct Marketing on Imbalanced Data using Tomek Link Combined with Random Under-sampling 基于Tomek Link和随机欠抽样的不平衡直销数据挖掘技术

2021 the 5th International Conference on Information System and Data Mining

Pub Date : 2021-05-27 DOI: 10.1145/3471287.3471299

Ümit Yılmaz, C. Gezer, Z. Aydın, V. C. Gungor

Determining the potential customers is very important in direct marketing. Data mining techniques are one of the most important methods for companies to determine potential customers. However, since the number of potential customers is very low compared to the number of non-potential customers, there is a class imbalance problem that significantly affects the performance of data mining techniques. In this paper, different combinations of basic and advanced resampling techniques such as Synthetic Minority Oversampling Technique (SMOTE), Tomek Link, RUS, and ROS were evaluated to improve the performance of customer classification. Different feature selection techniques are used in order the decrease the number of non-informative features from the data such as Information Gain, Gain Ratio, Chi-squared, and Relief. Classification performance was compared and utilized using several data mining techniques, such as LightGBM, XGBoost, Gradient Boost, Random Forest, AdaBoost, ANN, Logistic Regression, Decision Trees, SVC, Bagging Classifier based on ROC AUC and sensitivity metrics. A combination of Tomek Link and Random Under-Sampling as a resampling technique and Chi-squared method as feature selection algorithm showed superior performance among the other combinations. Detailed performance evaluations demonstrated that with the proposed approach, LightGBM, which is a gradient boosting algorithm based on decision tree, gave the best results among the other classifiers with 0.947 sensitivity and 0.896 ROC AUC value.

确定潜在客户在直销中是非常重要的。数据挖掘技术是企业确定潜在客户的最重要方法之一。然而，由于潜在客户的数量与非潜在客户的数量相比非常低，因此存在类不平衡问题，这严重影响了数据挖掘技术的性能。本文通过综合少数派过采样技术(Synthetic Minority Oversampling Technique, SMOTE)、Tomek Link、RUS和ROS等基本重采样技术和高级重采样技术的不同组合进行评估，以提高客户分类的性能。为了减少数据中的非信息特征(如信息增益、增益比、卡方和救济)的数量，使用了不同的特征选择技术。使用LightGBM、XGBoost、Gradient Boost、Random Forest、AdaBoost、ANN、Logistic回归、决策树、SVC、Bagging Classifier等基于ROC AUC和灵敏度指标的数据挖掘技术对分类性能进行了比较和利用。结合Tomek Link和Random undersampling作为重采样技术和卡方方法作为特征选择算法的组合在其他组合中表现出更好的性能。详细的性能评估表明，基于决策树的梯度增强算法LightGBM在其他分类器中表现最佳，灵敏度为0.947,ROC AUC值为0.896。

{"title":"Data Mining Techniques in Direct Marketing on Imbalanced Data using Tomek Link Combined with Random Under-sampling","authors":"Ümit Yılmaz, C. Gezer, Z. Aydın, V. C. Gungor","doi":"10.1145/3471287.3471299","DOIUrl":"https://doi.org/10.1145/3471287.3471299","url":null,"abstract":"Determining the potential customers is very important in direct marketing. Data mining techniques are one of the most important methods for companies to determine potential customers. However, since the number of potential customers is very low compared to the number of non-potential customers, there is a class imbalance problem that significantly affects the performance of data mining techniques. In this paper, different combinations of basic and advanced resampling techniques such as Synthetic Minority Oversampling Technique (SMOTE), Tomek Link, RUS, and ROS were evaluated to improve the performance of customer classification. Different feature selection techniques are used in order the decrease the number of non-informative features from the data such as Information Gain, Gain Ratio, Chi-squared, and Relief. Classification performance was compared and utilized using several data mining techniques, such as LightGBM, XGBoost, Gradient Boost, Random Forest, AdaBoost, ANN, Logistic Regression, Decision Trees, SVC, Bagging Classifier based on ROC AUC and sensitivity metrics. A combination of Tomek Link and Random Under-Sampling as a resampling technique and Chi-squared method as feature selection algorithm showed superior performance among the other combinations. Detailed performance evaluations demonstrated that with the proposed approach, LightGBM, which is a gradient boosting algorithm based on decision tree, gave the best results among the other classifiers with 0.947 sensitivity and 0.896 ROC AUC value.","PeriodicalId":306474,"journal":{"name":"2021 the 5th International Conference on Information System and Data Mining","volume":"IM-36 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132899537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2021 the 5th International Conference on Information System and Data Mining

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀