数据分析和信息处理(英文)最新文献_第5页

Predicting Stock Movement Using Sentiment Analysis of Twitter Feed with Neural Networks 利用神经网络对推特消息的情绪分析预测股票走势

数据分析和信息处理(英文)

Pub Date : 2020-09-29 DOI: 10.4236/jdaip.2020.84018

Sai Vikram Kolasani, Rida Assaf

External factors, such as social media and financial news, can have wide-spread effects on stock price movement. For this reason, social media is considered a useful resource for precise market predictions. In this paper, we show the effectiveness of using Twitter posts to predict stock prices. We start by training various models on the Sentiment 140 Twitter data. We found that Support Vector Machines (SVM) performed best (0.83 accuracy) in the sentimental analysis, so we used it to predict the average sentiment of tweets for each day that the market was open. Next, we use the sentimental analysis of one year’s data of tweets that contain the “stock market”, “stocktwits”, “AAPL” keywords, with the goal of predicting the corresponding stock prices of Apple Inc. (AAPL) and the US’s Dow Jones Industrial Average (DJIA) index prices. Two models, Boosted Regression Trees and Multilayer Perceptron Neural Networks were used to predict the closing price difference of AAPL and DJIA prices. We show that neural networks perform substantially better than traditional models for stocks’ price prediction.

外部因素，如社交媒体和金融新闻，可以对股价走势产生广泛的影响。因此，社交媒体被认为是准确预测市场的有用资源。在本文中，我们展示了使用推特帖子预测股价的有效性。我们首先在Sentiment140推特数据上训练各种模型。我们发现支持向量机（SVM）在情感分析中表现最好（准确率为0.83），因此我们使用它来预测市场开放后每天推特的平均情绪。接下来，我们对包含“股市”、“股票”、“AAPL”关键字的推文的一年数据进行情感分析，目的是预测苹果股份有限公司（AAPL）和美国道琼斯工业平均指数（DJIA）的相应股价。采用Boosted回归树和多层感知器神经网络两个模型对AAPL和DJIA价格的收盘价差进行了预测。我们表明，神经网络在股票价格预测方面的表现明显优于传统模型。

引用次数: 17

Identifying Extreme Rainfall Events Using Functional Outliers Detection Methods 使用功能异常值检测方法识别极端降雨事件

数据分析和信息处理(英文)

Pub Date : 2020-09-29 DOI: 10.4236/jdaip.2020.84016

M. A. Hael, Y. Yuan

Outlier detection techniques play a vital role in exploring unusual data of extreme events that have a critical effect considerably in the modeling and forecasting of functional data. The functional methods have an effective way of identifying outliers graphically, which might not be visible through the original data plot in classical analysis. This study’s main objective is to detect the extreme rainfall events using functional outliers detection methods depending on the depth and density functions. In order to identify the unusual events of rainfall variation over long time intervals, this work conducts based on the average monthly rainfall of the Taiz region from 1998 to 2019. Data were extracted from the Tropical Rainfall Measuring Mission and the analysis has been processed by R software. The approaches applied in this study involve rainbow plots, functional highest density region box-plot as well as functional bag-plot. According to the current results, the functional density box-plot method has proven effective in detecting outlier compared to the functional depth bag-plot method. In conclusion, the results of the current study showed that the rainfall over the Taiz region during the last two decades was influenced by the extreme events of years 1999, 2004, 2005, and 2009.

异常值检测技术在探索极端事件的异常数据方面发挥着至关重要的作用，这些数据在功能数据的建模和预测中具有重要影响。函数方法有一种以图形方式识别异常值的有效方法，在经典分析中，通过原始数据图可能看不到异常值。本研究的主要目的是使用取决于深度和密度函数的函数异常值检测方法来检测极端降雨事件。为了识别长时间间隔内降雨量变化的异常事件，本工作基于1998年至2019年塔伊兹地区的月平均降雨量。数据是从热带降雨测量任务中提取的，分析已由R软件处理。本研究采用的方法包括彩虹图、功能最高密度区盒图和功能袋图。根据目前的结果，与函数深度袋图方法相比，函数密度盒图方法已被证明在检测异常值方面是有效的。总之，目前的研究结果表明，塔伊兹地区过去二十年的降雨量受到1999年、2004年、2005年和2009年极端事件的影响。

{"title":"Identifying Extreme Rainfall Events Using Functional Outliers Detection Methods","authors":"M. A. Hael, Y. Yuan","doi":"10.4236/jdaip.2020.84016","DOIUrl":"https://doi.org/10.4236/jdaip.2020.84016","url":null,"abstract":"Outlier detection techniques play a vital role in exploring unusual data of extreme events that have a critical effect considerably in the modeling and forecasting of functional data. The functional methods have an effective way of identifying outliers graphically, which might not be visible through the original data plot in classical analysis. This study’s main objective is to detect the extreme rainfall events using functional outliers detection methods depending on the depth and density functions. In order to identify the unusual events of rainfall variation over long time intervals, this work conducts based on the average monthly rainfall of the Taiz region from 1998 to 2019. Data were extracted from the Tropical Rainfall Measuring Mission and the analysis has been processed by R software. The approaches applied in this study involve rainbow plots, functional highest density region box-plot as well as functional bag-plot. According to the current results, the functional density box-plot method has proven effective in detecting outlier compared to the functional depth bag-plot method. In conclusion, the results of the current study showed that the rainfall over the Taiz region during the last two decades was influenced by the extreme events of years 1999, 2004, 2005, and 2009.","PeriodicalId":71434,"journal":{"name":"数据分析和信息处理(英文)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47738485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Injury Analysis Based on Machine Learning in NBA Data 基于机器学习的NBA数据损伤分析

数据分析和信息处理(英文)

Pub Date : 2020-09-29 DOI: 10.4236/jdaip.2020.84017

Wan-Ru Wu

It is a commonplace that the injury plays a vital influence in an NBA match and it may reverse the result of two teams with wide strength disparity. In this article, in order to decrease the uncertainty of the risk in the coming match, we propose a pipeline from gathering data at the player’s level including the fundamental statistics and the performance in the match before and data at the team’s level including the basic information and the opponent team’s status in the match we predict on. Confined to the limited and extremely unbalanced data, our result showed a limited power on injury prediction but it made a not bad result on the injury of the star player in a team. We also analyze the contribution of the factors to our prediction. It demonstrated that player’s own performance matters most in their injury. The Principal Component Analysis is also applied to help reduce the dimension of our data and to show the correlation of different features.

众所周知，伤病在NBA比赛中起着至关重要的作用，它可能会逆转两支实力悬殊的球队的结果。在这篇文章中，为了减少即将到来的比赛中风险的不确定性，我们提出了一个从球员层面收集数据的管道，包括基本统计数据和赛前表现，以及从球队层面收集数据，包括基本信息和对手球队在我们预测的比赛中的状态。受限于有限且极不平衡的数据，我们的结果显示出对伤病预测的能力有限，但对球队中明星球员的伤病预测结果并不差。我们还分析了这些因素对我们预测的贡献。这表明球员自身的表现在他们的伤病中最为重要。主成分分析也被应用于帮助降低我们的数据的维度，并显示不同特征的相关性。

引用次数: 6

Causes of Restocking Delays in Absence of Real Time Inventory Tracking of Airtel Airtime Airtel Airtime在缺乏实时库存跟踪的情况下重新进货延迟的原因

数据分析和信息处理(英文)

Pub Date : 2020-09-29 DOI: 10.4236/jdaip.2020.84019

Eddie Musana, A. H. Basaza-Ejiri

The purpose of this research was to ascertain causes of Restocking Delays in a Distributor Company of Airtel Airtime (AA) that give justification for benefits of using Real Time Inventory Tracking (R.T.I.T) in an attempt to mitigate Restocking Delays. From a study out at the Private Marketing and Trading Services (PMTS) an Authorized Distributor of Airtel Products undertaken in 2017 evidenced by Airtime scratch card and Electronic, E-Recharge Airtime among other forms to encourage R.T.I.T among other products in Telecom Companies and other Business Enterprises. The research comprises of the following areas among which included a detailed focus on a Qualitative and Quantitative approach in obtaining different categories of Restocking Delays in form of Themes and Sub Themes encountered in the Distribution Supply Chain (SC) of AA that is contained in this paper. This research continues to capture an in-depth explanation of the Managerial and Operational causes of restocking delays in respect to AA. Similarly, fast consumer products and services other than AA require a solution to Restocking Delays through implementation of Real Time Inventory Tracking Model (R.T.I.T.M) of AA among Distributor Companies (DCs). This paper also elaborated on Literature, Methodology and Findings obtained from the study. The results were obtained from regression analysis by using the Statistical Package for Social Sciences (SPSS) that showed a higher significance of Stock Turnover Period and Airtime Denomination as a contributor to Restocking Delays whereas Messages from Airtel Head office to the Distributor had a non-significant contribution to restocking Delays as in Figure 9. The research recommends a Model for R.T.I.T in Telecom Distribution SC of AA and Omnichannel Inventory Management (OIM) as a significant contributor to timely reliable inventory restocking and promotes higher sales among DCs and retailers through minimized Restocking Delays. It shows that the forces of Demand and Supply change over time with different tastes and preferences of customers. The imbalance in AA stock levels changes at given times due to unforeseen forces of consumer demand experienced by DCs, explained by the “Bullwhip Effect” due to information distortion in the Supply Chain (SC).

本研究的目的是确定Airtel Airtime (AA)分销商公司补货延迟的原因，为使用实时库存跟踪(R.T.I.T)的好处提供理由，以减轻补货延迟。根据私人营销和交易服务(PMTS)于2017年进行的一项研究，Airtel产品的授权经销商通过通话时间刮刮卡和电子、电子充值通话时间等形式证明了这一点，以鼓励电信公司和其他商业企业的其他产品之间的即时通信。研究包括以下领域，其中包括详细关注定性和定量方法，以获得AA分销供应链(SC)中遇到的主题和子主题形式的不同类别的补货延迟，这包含在本文中。这项研究继续深入地解释了AA方面补充库存延迟的管理和操作原因。同样，除AA以外的快速消费品和服务也需要通过在分销商公司(dc)之间实施AA的实时库存跟踪模型(R.T.I.T.M)来解决补货延迟问题。本文还对文献资料、研究方法和研究结果进行了阐述。通过使用社会科学统计软件包(SPSS)进行回归分析得出的结果显示，库存周转期和通话时间计价对补货延迟的影响更为显著，而Airtel总部向分销商发送的消息对补货延迟的影响不显著，如图9所示。本研究建议电信分销SC中的rti模型和全渠道库存管理(OIM)作为及时可靠的库存补充的重要贡献者，并通过最小化补货延迟来促进dc和零售商之间的更高销售额。它表明，需求和供给的力量随着时间的推移而变化，因为顾客的品味和偏好不同。AA库存水平的不平衡在给定时间发生变化，这是由于dc经历的不可预见的消费者需求力量，由供应链(SC)中信息扭曲引起的“牛鞭效应”解释。

{"title":"Causes of Restocking Delays in Absence of Real Time Inventory Tracking of Airtel Airtime","authors":"Eddie Musana, A. H. Basaza-Ejiri","doi":"10.4236/jdaip.2020.84019","DOIUrl":"https://doi.org/10.4236/jdaip.2020.84019","url":null,"abstract":"The purpose of this research was to ascertain causes of Restocking Delays in a Distributor Company of Airtel Airtime (AA) that give justification for benefits of using Real Time Inventory Tracking (R.T.I.T) in an attempt to mitigate Restocking Delays. From a study out at the Private Marketing and Trading Services (PMTS) an Authorized Distributor of Airtel Products undertaken in 2017 evidenced by Airtime scratch card and Electronic, E-Recharge Airtime among other forms to encourage R.T.I.T among other products in Telecom Companies and other Business Enterprises. The research comprises of the following areas among which included a detailed focus on a Qualitative and Quantitative approach in obtaining different categories of Restocking Delays in form of Themes and Sub Themes encountered in the Distribution Supply Chain (SC) of AA that is contained in this paper. This research continues to capture an in-depth explanation of the Managerial and Operational causes of restocking delays in respect to AA. Similarly, fast consumer products and services other than AA require a solution to Restocking Delays through implementation of Real Time Inventory Tracking Model (R.T.I.T.M) of AA among Distributor Companies (DCs). This paper also elaborated on Literature, Methodology and Findings obtained from the study. The results were obtained from regression analysis by using the Statistical Package for Social Sciences (SPSS) that showed a higher significance of Stock Turnover Period and Airtime Denomination as a contributor to Restocking Delays whereas Messages from Airtel Head office to the Distributor had a non-significant contribution to restocking Delays as in Figure 9. The research recommends a Model for R.T.I.T in Telecom Distribution SC of AA and Omnichannel Inventory Management (OIM) as a significant contributor to timely reliable inventory restocking and promotes higher sales among DCs and retailers through minimized Restocking Delays. It shows that the forces of Demand and Supply change over time with different tastes and preferences of customers. The imbalance in AA stock levels changes at given times due to unforeseen forces of consumer demand experienced by DCs, explained by the “Bullwhip Effect” due to information distortion in the Supply Chain (SC).","PeriodicalId":71434,"journal":{"name":"数据分析和信息处理(英文)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42057452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Clustering Approach for Analyzing the Student’s Efficiency and Performance Based on Data 基于数据的学生效率和表现聚类分析方法

数据分析和信息处理(英文)

Pub Date : 2020-07-02 DOI: 10.4236/jdaip.2020.83010

Tallal Omar, Abdullah M. Alzahrani, M. Zohdy

The academic community is currently confronting some challenges in terms of analyzing and evaluating the progress of a student’s academic performance. In the real world, classifying the performance of the students is a scientifically challenging task. Recently, some studies apply cluster analysis for evaluating the students’ results and utilize statistical techniques to part their score in regard to student’s performance. This approach, however, is not efficient. In this study, we combine two techniques, namely, k-mean and elbow clustering algorithm to evaluate the student’s performance. Based on this combination, the results of performance will be more accurate in analyzing and evaluating the progress of the student’s performance. In this study, the methodology has been implemented to define the diverse fascinating model taking the student test scores.

学术界目前在分析和评估学生学习成绩的进展方面面临着一些挑战。在现实世界中，对学生的表现进行分类是一项具有科学挑战性的任务。最近，一些研究应用聚类分析来评估学生的成绩，并利用统计技术根据学生的表现来划分他们的分数。然而，这种方法并不有效。在这项研究中，我们结合了两种技术，即k-means和肘部聚类算法来评估学生的表现。基于这种组合，成绩的结果将更准确地分析和评估学生的成绩进展。在这项研究中，采用该方法来定义学生考试成绩的多样化迷人模型。

引用次数: 8

Hierarchical Representations Feature Deep Learning for Face Recognition 用于人脸识别的层次表示特征深度学习

数据分析和信息处理(英文)

Pub Date : 2020-07-02 DOI: 10.4236/jdaip.2020.83012

Haijun Zhang, Yinghui Chen

Most modern face recognition and classification systems mainly rely on hand-crafted image feature descriptors. In this paper, we propose a novel deep learning algorithm combining unsupervised and supervised learning named deep belief network embedded with Softmax regress (DBNESR) as a natural source for obtaining additional, complementary hierarchical representations, which helps to relieve us from the complicated hand-crafted feature-design step. DBNESR first learns hierarchical representations of feature by greedy layer-wise unsupervised learning in a feed-forward (bottom-up) and back-forward (top-down) manner and then makes more efficient recognition with Softmax regress by supervised learning. As a comparison with the algorithms only based on supervised learning, we again propose and design many kinds of classifiers: BP, HBPNNs, RBF, HRBFNNs, SVM and multiple classification decision fusion classifier (MCDFC)—hybrid HBPNNs-HRBFNNs-SVM classifier. The conducted experiments validate: Firstly, the proposed DBNESR is optimal for face recognition with the highest and most stable recognition rates; second, the algorithm combining unsupervised and supervised learning has better effect than all supervised learning algorithms; third, hybrid neural networks have better effect than single model neural network; fourth, the average recognition rate and variance of these algorithms in order of the largest to the smallest are respectively shown as DBNESR, MCDFC, SVM, HRBFNNs, RBF, HBPNNs, BP and BP, RBF, HBPNNs, HRBFNNs, SVM, MCDFC, DBNESR; at last, it reflects hierarchical representations of feature by DBNESR in terms of its capability of modeling hard artificial intelligent tasks.

大多数现代人脸识别和分类系统主要依赖于手工制作的图像特征描述符。在本文中，我们提出了一种新的结合无监督和监督学习的深度学习算法，称为嵌入Softmax回归的深度信念网络（DBNESR），作为获得额外的、互补的层次表示的自然来源，这有助于我们从复杂的手工特征设计步骤中解脱出来。DBNESR首先通过前馈（自下而上）和前向（自上而下）的贪婪分层无监督学习来学习特征的层次表示，然后通过监督学习使用Softmax回归进行更有效的识别。与仅基于监督学习的算法相比，我们再次提出并设计了多种分类器：BP、HBPNNs、RBF、HRBFNNs、SVM和多分类决策融合分类器（MCDFC）——混合HBPNNs-HRBFNNs-SVM分类器。实验验证了：首先，所提出的DBNESR对于人脸识别是最优的，具有最高和最稳定的识别率；第二，将无监督和监督学习相结合的算法比所有监督学习算法都有更好的效果；第三，混合神经网络比单模型神经网络具有更好的效果；第四，这些算法的平均识别率和方差按从大到小的顺序分别表示为DBNESR、MCDFC、SVM、HRBFNN、RBF、HBPNN、BP和BP、RBF、hbPNN、HRBFNNs、SVM、MCDFC和DBNESR；最后，从DBNESR对硬人工智能任务建模的能力上反映了DBNESR的特征层次表示。

{"title":"Hierarchical Representations Feature Deep Learning for Face Recognition","authors":"Haijun Zhang, Yinghui Chen","doi":"10.4236/jdaip.2020.83012","DOIUrl":"https://doi.org/10.4236/jdaip.2020.83012","url":null,"abstract":"Most modern face recognition and classification systems mainly rely on hand-crafted image feature descriptors. In this paper, we propose a novel deep learning algorithm combining unsupervised and supervised learning named deep belief network embedded with Softmax regress (DBNESR) as a natural source for obtaining additional, complementary hierarchical representations, which helps to relieve us from the complicated hand-crafted feature-design step. DBNESR first learns hierarchical representations of feature by greedy layer-wise unsupervised learning in a feed-forward (bottom-up) and back-forward (top-down) manner and then makes more efficient recognition with Softmax regress by supervised learning. As a comparison with the algorithms only based on supervised learning, we again propose and design many kinds of classifiers: BP, HBPNNs, RBF, HRBFNNs, SVM and multiple classification decision fusion classifier (MCDFC)—hybrid HBPNNs-HRBFNNs-SVM classifier. The conducted experiments validate: Firstly, the proposed DBNESR is optimal for face recognition with the highest and most stable recognition rates; second, the algorithm combining unsupervised and supervised learning has better effect than all supervised learning algorithms; third, hybrid neural networks have better effect than single model neural network; fourth, the average recognition rate and variance of these algorithms in order of the largest to the smallest are respectively shown as DBNESR, MCDFC, SVM, HRBFNNs, RBF, HBPNNs, BP and BP, RBF, HBPNNs, HRBFNNs, SVM, MCDFC, DBNESR; at last, it reflects hierarchical representations of feature by DBNESR in terms of its capability of modeling hard artificial intelligent tasks.","PeriodicalId":71434,"journal":{"name":"数据分析和信息处理(英文)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45883017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Meta-Learning of Evolutionary Strategy for Stock Trading 股票交易进化策略的元学习

数据分析和信息处理(英文)

Pub Date : 2020-04-09 DOI: 10.4236/jdaip.2020.82005

Erik Sorensen, Ryan Ozzello, Rachael Rogan, Ethan Baker, N. Parks, Wei Hu

Meta-learning algorithms learn about the learning process itself so it can speed up subsequent similar learning tasks with fewer data and iterations. If achieved, these benefits expand the flexibility of traditional machine learning to areas where there are small windows of time or data available. One such area is stock trading, where the relevance of data decreases as time passes, requiring fast results on fewer data points to respond to fast-changing market trends. We, to the best of our knowledge, are the first to apply meta-learning algorithms to an evolutionary strategy for stock trading to decrease learning time by using fewer iterations and to achieve higher trading profits with fewer data points. We found that our meta-learning approach to stock trading earns profits similar to a purely evolutionary algorithm. However, it only requires 50 iterations during test, versus thousands that are typically required without meta-learning, or 50% of the training data during test.

元学习算法了解学习过程本身，因此可以用更少的数据和迭代来加快后续类似的学习任务。如果实现了这些好处，将传统机器学习的灵活性扩展到时间或数据窗口较小的领域。其中一个领域是股票交易，数据的相关性随着时间的推移而降低，需要在更少的数据点上快速得出结果，以应对快速变化的市场趋势。据我们所知，我们是第一个将元学习算法应用于股票交易的进化策略的人，通过使用更少的迭代来减少学习时间，并通过更少的数据点来实现更高的交易利润。我们发现，我们的股票交易元学习方法赚取的利润类似于纯粹的进化算法。然而，它在测试期间只需要50次迭代，而在没有元学习的情况下通常需要数千次迭代，或者在测试期间需要50%的训练数据。

引用次数: 3

Comparison of Different Machine Learning Algorithms for the Prediction of Coronary Artery Disease 不同机器学习算法在冠状动脉疾病预测中的比较

数据分析和信息处理(英文)

Pub Date : 2020-04-09 DOI: 10.4236/jdaip.2020.82003

Imran Chowdhury Dipto, Tanzila Islam, H. Rahman, A. F. M. Moshiur Rahman

Coronary Artery Disease (CAD) is the leading cause of mortality worldwide. It is a complex heart disease that is associated with numerous risk factors and a variety of Symptoms. During the past decade, Coronary Artery Disease (CAD) has undergone a remarkable evolution. The purpose of this research is to build a prototype system using different Machine Learning Algorithms (models) and compare their performance to identify a suitable model. This paper explores three most commonly used Machine Learning Algorithms named as Logistic Regression, Support Vector Machine and Artificial Neural Network. To conduct this research, a clinical dataset has been used. To evaluate the performance, different evaluation methods have been used such as Confusion Matrix, Stratified K-fold Cross Validation, Accuracy, AUC and ROC. To validate the results, the accuracy and AUC scores have been validated using the K-Fold Cross-validation technique. The dataset contains class imbalance, so the SMOTE Algorithm has been used to balance the dataset and the performance analysis has been carried out on both sets of data. The results show that accuracy scores of all the models have been increased while training the balanced dataset. Overall, Artificial Neural Network has the highest accuracy whereas Logistic Regression has the least accurate among the trained Algorithms.

冠状动脉疾病(CAD)是世界范围内导致死亡的主要原因。它是一种复杂的心脏病，与许多危险因素和各种症状有关。在过去的十年中，冠状动脉疾病(CAD)经历了显著的发展。本研究的目的是使用不同的机器学习算法(模型)构建一个原型系统，并比较它们的性能以确定合适的模型。本文探讨了三种最常用的机器学习算法:逻辑回归、支持向量机和人工神经网络。为了进行这项研究，我们使用了一个临床数据集。为了评估其性能，使用了不同的评估方法，如混淆矩阵、分层K-fold交叉验证、准确性、AUC和ROC。为了验证结果，使用K-Fold交叉验证技术验证了准确性和AUC分数。由于数据集存在类不平衡，因此采用SMOTE算法对数据集进行平衡，并对两组数据进行性能分析。结果表明，在平衡数据集的训练过程中，所有模型的准确率分数都有所提高。总的来说，人工神经网络的准确率最高，而逻辑回归的准确率最低。

{"title":"Comparison of Different Machine Learning Algorithms for the Prediction of Coronary Artery Disease","authors":"Imran Chowdhury Dipto, Tanzila Islam, H. Rahman, A. F. M. Moshiur Rahman","doi":"10.4236/jdaip.2020.82003","DOIUrl":"https://doi.org/10.4236/jdaip.2020.82003","url":null,"abstract":"Coronary Artery Disease (CAD) is the leading cause of mortality worldwide. It is a complex heart disease that is associated with numerous risk factors and a variety of Symptoms. During the past decade, Coronary Artery Disease (CAD) has undergone a remarkable evolution. The purpose of this research is to build a prototype system using different Machine Learning Algorithms (models) and compare their performance to identify a suitable model. This paper explores three most commonly used Machine Learning Algorithms named as Logistic Regression, Support Vector Machine and Artificial Neural Network. To conduct this research, a clinical dataset has been used. To evaluate the performance, different evaluation methods have been used such as Confusion Matrix, Stratified K-fold Cross Validation, Accuracy, AUC and ROC. To validate the results, the accuracy and AUC scores have been validated using the K-Fold Cross-validation technique. The dataset contains class imbalance, so the SMOTE Algorithm has been used to balance the dataset and the performance analysis has been carried out on both sets of data. The results show that accuracy scores of all the models have been increased while training the balanced dataset. Overall, Artificial Neural Network has the highest accuracy whereas Logistic Regression has the least accurate among the trained Algorithms.","PeriodicalId":71434,"journal":{"name":"数据分析和信息处理(英文)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44647043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Characteristics Classification of Mobile Apps on Apple Store Using Clustering 基于聚类的苹果商店移动应用程序特征分类

数据分析和信息处理(英文)

Pub Date : 2020-04-09 DOI: 10.4236/jdaip.2020.82004

Boxin Fu

This research is interested in the user ratings of Apps on Apple Stores. The purpose of this research is to have a better understanding of some characteristics of the good Apps on Apple Store so Apps makers can potentially focus on these traits to maximize their profit. The data for this research is collected from kaggle.com, and originally collected from iTunes Search API, according to the abstract of the data. Four different attributes contribute directly toward an App’s user rating: rating_count_tot, rating_count_ver, user_rating and user_rating_ver. The relationship between Apps receiving higher ratings and Apps receiving lower ratings is analyzed using Exploratory Data Analysis and Data Science technique “clustering” on their numerical attributes. Apps, which are represented as a data point, with similar characteristics in rating are classified as belonging to the same cluster, while common characteristics of all Apps in the same clusters are the determining traits of Apps for that cluster. Both techniques are achieved using Google Colab and libraries including pandas, numpy, seaborn, and matplotlib. The data reveals direct correlation from number of devices supported and languages supported to user rating and inverse correlation from size and price of the App to user rating. In conclusion, free small Apps that many different types of users are able to use are generally well rated by most users, according to the data.

这项研究关注的是苹果商店应用的用户评分。本研究的目的是为了更好地了解苹果商店中优秀应用程序的一些特征，以便应用程序制造商可以潜在地专注于这些特征，以最大化他们的利润。根据数据摘要，本研究的数据来源于kaggle.com，原始数据来源于iTunes Search API。有四个不同的属性直接影响应用的用户评级:rating_count_tot、rating_count_ver、user_rating和user_rating_ver。使用探索性数据分析和数据科学技术对其数值属性进行“聚类”，分析获得较高评级和较低评级的应用程序之间的关系。用数据点表示的具有相似特征的应用程序被归类为属于同一集群，而同一集群中所有应用程序的共同特征是该集群中应用程序的决定性特征。这两种技术都是使用谷歌Colab和包括pandas、numpy、seaborn和matplotlib在内的库实现的。数据显示，应用支持的设备数量和语言与用户评价呈正相关，而应用的大小和价格与用户评价呈负相关。综上所述，数据显示，许多不同类型的用户都能使用的免费小应用通常都得到了大多数用户的好评。

{"title":"Characteristics Classification of Mobile Apps on Apple Store Using Clustering","authors":"Boxin Fu","doi":"10.4236/jdaip.2020.82004","DOIUrl":"https://doi.org/10.4236/jdaip.2020.82004","url":null,"abstract":"This research is interested in the user ratings of Apps on Apple Stores. The purpose of this research is to have a better understanding of some characteristics of the good Apps on Apple Store so Apps makers can potentially focus on these traits to maximize their profit. The data for this research is collected from kaggle.com, and originally collected from iTunes Search API, according to the abstract of the data. Four different attributes contribute directly toward an App’s user rating: rating_count_tot, rating_count_ver, user_rating and user_rating_ver. The relationship between Apps receiving higher ratings and Apps receiving lower ratings is analyzed using Exploratory Data Analysis and Data Science technique “clustering” on their numerical attributes. Apps, which are represented as a data point, with similar characteristics in rating are classified as belonging to the same cluster, while common characteristics of all Apps in the same clusters are the determining traits of Apps for that cluster. Both techniques are achieved using Google Colab and libraries including pandas, numpy, seaborn, and matplotlib. The data reveals direct correlation from number of devices supported and languages supported to user rating and inverse correlation from size and price of the App to user rating. In conclusion, free small Apps that many different types of users are able to use are generally well rated by most users, according to the data.","PeriodicalId":71434,"journal":{"name":"数据分析和信息处理(英文)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47969933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Research on Spatial Pattern and Its Industrial Distribution of Commercial Space in Mianyang Based on POI Data 基于POI数据的绵阳商业空间空间格局及其产业分布研究

数据分析和信息处理(英文)

Pub Date : 2020-01-14 DOI: 10.4236/jdaip.2020.81002

Dacheng Zheng, Changqiu Li

The rational layout of urban commercial space is conducive to optimizing the allocation of commercial resources in the urban interior space. Based on the commercial POI (Point of Interest) data in the central district of Mianyang, the characteristics of urban commercial spatial pattern under different scales are analyzed by using Kernel Density Estimation, Getis-Ord , Ripley’s K Function and Location Entropy method, and the spatial agglomeration characteristics of various industries in urban commerce are studied. The results show that: 1) The spatial distribution characteristics of commercial outlets in downtown Mianyang are remarkable, and show a multi-center distribution pattern. The hot area distribution of commercial outlets based on road grid unit is generally consistent with the identified commercial density center distribution. 2) The commercial grade scale structure has been formed in the central urban area as a whole, and the distribution of commercial network hot spots based on road grid unit is generally consistent with the identified commercial density center distribution. 3) From the perspective of commercial industry, the differentiation of urban commercial space “center-periphery” is obvious, and different industries show different spatial agglomeration modes. 4) The multi-scale spatial agglomeration of each industry is different, the spatial scale of location choice of comprehensive retail, household appliances and other industries is larger, and the scale of location choice of textile, clothing, culture and sports is small. 5) There are significant differences in specialized functional areas from the perspective of industry. Mature areas show multi-functional elements, multi-advantage industry agglomeration characteristics, and a small number of developing areas also show multi-advantage industry agglomeration characteristics.

城市商业空间的合理布局有利于商业资源在城市内部空间的优化配置。以绵阳市中心区商业兴趣点(POI)数据为基础，运用核密度估计、Getis-Ord、Ripley’s K函数和位置熵等方法，分析了不同尺度下城市商业空间格局特征，研究了城市商业中各行业的空间集聚特征。结果表明:①绵阳市中心城区商业网点空间分布特征显著，呈现多中心分布格局;基于路网单元的商业网点热点区域分布与确定的商业密度中心分布基本一致。2)中心城区整体上已形成商业等级规模结构，基于路网单元的商业网络热点分布与确定的商业密度中心分布基本一致。(3)从商业产业角度看，城市商业空间“中心—边缘”分化明显，不同产业表现出不同的空间集聚模式。4)各行业多尺度空间集聚不同，综合零售、家电等行业区位选择空间规模较大，纺织服装、文化体育等行业区位选择规模较小。5)从行业角度看，专业功能区存在显著差异。成熟地区呈现多功能要素、多优势产业集聚特征，少数发展中地区也呈现多优势产业集聚特征。

{"title":"Research on Spatial Pattern and Its Industrial Distribution of Commercial Space in Mianyang Based on POI Data","authors":"Dacheng Zheng, Changqiu Li","doi":"10.4236/jdaip.2020.81002","DOIUrl":"https://doi.org/10.4236/jdaip.2020.81002","url":null,"abstract":"The rational layout of urban commercial space is conducive to optimizing the allocation of commercial resources in the urban interior space. Based on the commercial POI (Point of Interest) data in the central district of Mianyang, the characteristics of urban commercial spatial pattern under different scales are analyzed by using Kernel Density Estimation, Getis-Ord , Ripley’s K Function and Location Entropy method, and the spatial agglomeration characteristics of various industries in urban commerce are studied. The results show that: 1) The spatial distribution characteristics of commercial outlets in downtown Mianyang are remarkable, and show a multi-center distribution pattern. The hot area distribution of commercial outlets based on road grid unit is generally consistent with the identified commercial density center distribution. 2) The commercial grade scale structure has been formed in the central urban area as a whole, and the distribution of commercial network hot spots based on road grid unit is generally consistent with the identified commercial density center distribution. 3) From the perspective of commercial industry, the differentiation of urban commercial space “center-periphery” is obvious, and different industries show different spatial agglomeration modes. 4) The multi-scale spatial agglomeration of each industry is different, the spatial scale of location choice of comprehensive retail, household appliances and other industries is larger, and the scale of location choice of textile, clothing, culture and sports is small. 5) There are significant differences in specialized functional areas from the perspective of industry. Mature areas show multi-functional elements, multi-advantage industry agglomeration characteristics, and a small number of developing areas also show multi-advantage industry agglomeration characteristics.","PeriodicalId":71434,"journal":{"name":"数据分析和信息处理(英文)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41765616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3