数据分析和信息处理(英文)最新文献_第6页

Prediction of Accident Severity Using Artificial Neural Network: A Comparison of Analytical Capabilities between Python and R 用人工神经网络预测事故严重程度:Python和R分析能力的比较

数据分析和信息处理(英文)

Pub Date : 2020-01-01 DOI: 10.4236/jdaip.2020.83008

Imran Chowdhury Dipto, A. F. M. Moshiur Rahman, Tanzila Islam, H. Rahman

Large amount of data has been generated by Organizations. Different Analytical Tools are being used to handle such kind of data by Data Scientists. There are many tools available for Data processing, Visualisations, Predictive Analytics and so on. It is important to select a suitable Analytic Tool or Programming Language to carry out the tasks. In this research, two of the most commonly used Programming Languages have been compared and contrasted which are Python and R. To carry out the experiment two data sets have been collected from Kaggle and combined into a single Dataset. This study visualizes the data to generate some useful insights and prepare data for training on Artificial Neural Network by using Python and R language. The scope of this paper is to compare the analytical capabilities of Python and R. An Artificial Neural Network with Multilayer Perceptron has been implemented to predict the severity of accidents. Furthermore, the results have been used to compare and tried to point out which programming language is better for data visualization, data processing, Predictive Analytics, etc.

组织产生了大量的数据。数据科学家正在使用不同的分析工具来处理这类数据。有许多工具可用于数据处理、可视化、预测分析等。选择合适的分析工具或编程语言来执行任务是很重要的。在本研究中，对Python和r这两种最常用的编程语言进行了比较和对比。为了进行实验，我们从Kaggle收集了两个数据集，并将其合并为一个数据集。本研究通过使用Python和R语言对数据进行可视化处理，生成一些有用的见解，并为人工神经网络的训练准备数据。本文的范围是比较Python和r的分析能力。实现了一个带有多层感知器的人工神经网络来预测事故的严重程度。此外，结果被用来比较，并试图指出哪种编程语言更适合数据可视化，数据处理，预测分析等。

{"title":"Prediction of Accident Severity Using Artificial Neural Network: A Comparison of Analytical Capabilities between Python and R","authors":"Imran Chowdhury Dipto, A. F. M. Moshiur Rahman, Tanzila Islam, H. Rahman","doi":"10.4236/jdaip.2020.83008","DOIUrl":"https://doi.org/10.4236/jdaip.2020.83008","url":null,"abstract":"Large amount of data has been generated by Organizations. Different Analytical Tools are being used to handle such kind of data by Data Scientists. There are many tools available for Data processing, Visualisations, Predictive Analytics and so on. It is important to select a suitable Analytic Tool or Programming Language to carry out the tasks. In this research, two of the most commonly used Programming Languages have been compared and contrasted which are Python and R. To carry out the experiment two data sets have been collected from Kaggle and combined into a single Dataset. This study visualizes the data to generate some useful insights and prepare data for training on Artificial Neural Network by using Python and R language. The scope of this paper is to compare the analytical capabilities of Python and R. An Artificial Neural Network with Multilayer Perceptron has been implemented to predict the severity of accidents. Furthermore, the results have been used to compare and tried to point out which programming language is better for data visualization, data processing, Predictive Analytics, etc.","PeriodicalId":71434,"journal":{"name":"数据分析和信息处理(英文)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70996984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Using Multiple Correspondence Analysis to Measure Multidimensional Poverty in Congo 用多重对应分析衡量刚果的多维贫困

数据分析和信息处理(英文)

Pub Date : 2020-01-01 DOI: 10.4236/jdaip.2020.84014

Samuel Ambapour

The following analysis is based on a multidimensional understanding of poverty using a nonmonetary basic needs approach. It is ground on data from the first survey on household living conditions for poverty assessment, conducted by the National Institute of Statistics of Congo in 2005. Multiple Correspondence Analysis is applied to construct a composite indicator by aggregating several attributes likely to reflect the poverty of individuals or households. The application shows that Congolese households are not affected by the same type of poverty. Three types of non-monetary poverty are identified: infrastructure poverty, vulnerability of human existence and poverty of comfort. These households were then classified according to the composite indicator of Poverty. The results show that the incidence of poverty corresponds to the weight of poor class of about 70.67%.

以下分析是基于使用非货币基本需求方法对贫困的多维理解。它是基于2005年刚果国家统计研究所进行的第一次家庭生活条件贫困评估调查的数据。多重对应分析应用于通过汇总可能反映个人或家庭贫困的几个属性来构建一个复合指标。该应用表明，刚果家庭不受同一类型贫困的影响。非货币性贫困有三种类型:基础设施贫困、人类生存脆弱性贫困和舒适贫困。然后根据贫困综合指标对这些家庭进行分类。结果表明，贫困发生率对应贫困阶层权重约为70.67%。

引用次数: 2

Assessment of Hypertension-Induced Deaths in Ghana: A Nation-Wide Study from 2012 to 2016 加纳高血压导致的死亡评估:2012 - 2016年的一项全国性研究

数据分析和信息处理(英文)

Pub Date : 2020-01-01 DOI: 10.4236/jdaip.2020.83009

D. Adedia, Livingstone Asem, S. Appiah, S. Nanga, Y. Boateng, K. Duedu, L. Anani

Globally, hypertension is one of the leading causes of death. It can potentially lead to heart disease and stroke, among others, that could result to premature death. In Ghana, hypertension is considered as a disease that contributes to an increase in outpatients’ attendance. To assess the trend differentials of hypertension-induced deaths in Ghana, Chi-square test for equal proportions and Marascuilo procedure for pairwise comparison were performed using surveillance data on reported number of deaths from 2012 to 2016 across the then ten regions. The results show that incidence of hypertension-induced mortality was significantly different for almost all the regions and over the years. The incidence of hypertension-induced mortality has significantly reduced from 2012 to 2016. However, Volta Region recorded the highest incidence of mortality cases (p-value less of 0.0001) than the other regions during the period under review, while the Upper East Region recorded continuous increase in incidence of mortality cases with the highest in 2016. The Eastern Region, Central Region, and Greater Accra Region recorded significantly (p-value less of 0.0001) higher incidence of hypertension-induced mortality than the Ashanti Region, Brong Ahafo Region, Northern Region, Western Region and Upper West Region. The Upper West Region and Western Region had the lowest incidence of mortality. The decline in trend of hypertension-induced mortality could be attributed to some healthcare interventions put in place during the period. One of these interventions was the introduction of health insurance in 2003, a development which has been shown to affect the health seeking behaviors of the people. It is, therefore, important to investigate factors affecting these spatial and temporal dynamics in order to determine appropriate ways to actively control the hypertension-induced deaths in the country. Public education on health should be intensified so as to totally curb hypertension and its attendant risks.

在全球范围内，高血压是死亡的主要原因之一。它可能导致心脏病和中风，以及其他可能导致过早死亡的疾病。在加纳，高血压被认为是一种有助于增加门诊就诊人数的疾病。为了评估加纳高血压导致死亡的趋势差异，使用2012年至2016年10个地区报告死亡人数的监测数据，进行了等比例的卡方检验和Marascuilo两两比较程序。结果表明，几乎所有地区和年份的高血压死亡率都有显著差异。2012 - 2016年高血压死亡率显著降低。然而，在本报告所述期间，沃尔特地区的死亡病例发生率最高(p值小于0.0001)，而上东部地区的死亡病例发生率持续增加，2016年最高。东部地区、中部地区和大阿克拉地区的高血压死亡率显著高于阿散蒂地区、布朗阿哈福地区、北部地区、西部地区和上西部地区(p值小于0.0001)。上西部地区和西部地区的死亡率最低。高血压引起的死亡率趋势的下降可归因于在此期间实施的一些保健干预措施。其中一项干预措施是2003年引入健康保险，这一发展已被证明对人们寻求健康的行为产生了影响。因此，重要的是调查影响这些时空动态的因素，以便确定积极控制该国高血压引起的死亡的适当方法。应加强公众健康教育，以彻底控制高血压及其相关风险。

{"title":"Assessment of Hypertension-Induced Deaths in Ghana: A Nation-Wide Study from 2012 to 2016","authors":"D. Adedia, Livingstone Asem, S. Appiah, S. Nanga, Y. Boateng, K. Duedu, L. Anani","doi":"10.4236/jdaip.2020.83009","DOIUrl":"https://doi.org/10.4236/jdaip.2020.83009","url":null,"abstract":"Globally, hypertension is one of the leading causes of death. It can potentially lead to heart disease and stroke, among others, that could result to premature death. In Ghana, hypertension is considered as a disease that contributes to an increase in outpatients’ attendance. To assess the trend differentials of hypertension-induced deaths in Ghana, Chi-square test for equal proportions and Marascuilo procedure for pairwise comparison were performed using surveillance data on reported number of deaths from 2012 to 2016 across the then ten regions. The results show that incidence of hypertension-induced mortality was significantly different for almost all the regions and over the years. The incidence of hypertension-induced mortality has significantly reduced from 2012 to 2016. However, Volta Region recorded the highest incidence of mortality cases (p-value less of 0.0001) than the other regions during the period under review, while the Upper East Region recorded continuous increase in incidence of mortality cases with the highest in 2016. The Eastern Region, Central Region, and Greater Accra Region recorded significantly (p-value less of 0.0001) higher incidence of hypertension-induced mortality than the Ashanti Region, Brong Ahafo Region, Northern Region, Western Region and Upper West Region. The Upper West Region and Western Region had the lowest incidence of mortality. The decline in trend of hypertension-induced mortality could be attributed to some healthcare interventions put in place during the period. One of these interventions was the introduction of health insurance in 2003, a development which has been shown to affect the health seeking behaviors of the people. It is, therefore, important to investigate factors affecting these spatial and temporal dynamics in order to determine appropriate ways to actively control the hypertension-induced deaths in the country. Public education on health should be intensified so as to totally curb hypertension and its attendant risks.","PeriodicalId":71434,"journal":{"name":"数据分析和信息处理(英文)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70997060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Resizable, Rescalable and Free-Style Visualization of Hierarchical Clustering and Bioinformatics Analysis 层次聚类和生物信息学分析的可缩放、可缩放和自由风格可视化

数据分析和信息处理(英文)

Pub Date : 2020-01-01 DOI: 10.4236/JDAIP.2020.84013

Ruming Li

Graphical representation of hierarchical clustering results is of final importance in hierarchical cluster analysis of data. Unfortunately, almost all mathematical or statistical software may have a weak capability of showcasing such clustering results. Particularly, most of clustering results or trees drawn cannot be represented in a dendrogram with a resizable, rescalable and free-style fashion. With the “dynamic” drawing instead of “static” one, this research works around these weak functionalities that restrict visualization of clustering results in an arbitrary manner. It introduces an algorithmic solution to these functionalities, which adopts seamless pixel rearrangements to be able to resize and rescale dendrograms or tree diagrams. The results showed that the algorithm developed makes clustering outcome representation a really free visualization of hierarchical clustering and bioinformatics analysis. Especially, it possesses features of selectively visualizing and/or saving results in a specific size, scale and style (different views).

在数据的层次聚类分析中，层次聚类结果的图形化表示是至关重要的。不幸的是，几乎所有的数学或统计软件在显示这种聚类结果方面的能力都很弱。特别是，大多数聚类结果或绘制的树不能以可调整大小、可重新缩放和自由风格的方式在树形图中表示。使用“动态”绘图而不是“静态”绘图，本研究围绕这些以任意方式限制聚类结果可视化的弱功能进行工作。它引入了一种算法来解决这些功能，它采用无缝像素重排来调整树形图或树形图的大小和缩放。结果表明，所开发的算法使聚类结果表示成为层次聚类和生物信息学分析的真正自由可视化。特别是，它具有选择性地将结果可视化和/或保存为特定大小、比例和样式(不同视图)的功能。

引用次数: 2

Spatial Regression Analysis of Pedestrian Crashes Based on Point-of-Interest Data 基于兴趣点数据的行人碰撞空间回归分析

数据分析和信息处理(英文)

Pub Date : 2020-01-01 DOI: 10.4236/jdaip.2020.81001

Yanyan Chen, Jiajie Ma, Shaohua Wang

Pedestrian safety has recently been considered as one of the most serious issues in the research of traffic safety. This study aims at analyzing the spatial correlation between the frequency of pedestrian crashes and various predictor variables based on open source point-of-interest (POI) data which can provide specific land use features and user characteristics. Spatial regression models were developed at Traffic Analysis Zone (TAZ) level using 10,333 pedestrian crash records within the Fifth Ring of Beijing in 2015. Several spatial econometrics approaches were used to examine the spatial autocorrelation in crash count per TAZ, and the spatial heterogeneity was investigated by a geographically weighted regression model. The results showed that spatial error model performed better than other two spatial models and a traditional ordinary least squares model. Specifically, bus stops, hospitals, pharmacies, restaurants, and office buildings had positive impacts on pedestrian crashes, while hotels were negatively associated with the occurrence of pedestrian crashes. In addition, it was proven that there was a significant sign of localization effects for different POIs. Depending on these findings, lots of recommendations and countermeasures can be proposed to better improve the traffic safety for pedestrians.

行人安全已成为近年来交通安全研究的热点问题之一。本研究旨在基于开源兴趣点(POI)数据，分析行人碰撞频率与各种预测变量之间的空间相关性，这些数据可以提供具体的土地利用特征和用户特征。利用2015年北京五环10333例行人交通事故记录，建立了交通分析区(TAZ)水平的空间回归模型。采用空间计量经济学方法分析了交通事故数量的空间自相关性，并采用地理加权回归模型研究了空间异质性。结果表明，空间误差模型优于其他两种空间模型和传统的普通最小二乘模型。其中公交车站、医院、药店、饭店、办公楼对行人交通事故的发生有正向影响，而酒店对行人交通事故的发生有负相关。此外，还证明了不同poi存在显著的局部化效应。根据这些发现，可以提出许多建议和对策，以更好地提高行人的交通安全。

{"title":"Spatial Regression Analysis of Pedestrian Crashes Based on Point-of-Interest Data","authors":"Yanyan Chen, Jiajie Ma, Shaohua Wang","doi":"10.4236/jdaip.2020.81001","DOIUrl":"https://doi.org/10.4236/jdaip.2020.81001","url":null,"abstract":"Pedestrian safety has recently been considered as one of the most serious issues in the research of traffic safety. This study aims at analyzing the spatial correlation between the frequency of pedestrian crashes and various predictor variables based on open source point-of-interest (POI) data which can provide specific land use features and user characteristics. Spatial regression models were developed at Traffic Analysis Zone (TAZ) level using 10,333 pedestrian crash records within the Fifth Ring of Beijing in 2015. Several spatial econometrics approaches were used to examine the spatial autocorrelation in crash count per TAZ, and the spatial heterogeneity was investigated by a geographically weighted regression model. The results showed that spatial error model performed better than other two spatial models and a traditional ordinary least squares model. Specifically, bus stops, hospitals, pharmacies, restaurants, and office buildings had positive impacts on pedestrian crashes, while hotels were negatively associated with the occurrence of pedestrian crashes. In addition, it was proven that there was a significant sign of localization effects for different POIs. Depending on these findings, lots of recommendations and countermeasures can be proposed to better improve the traffic safety for pedestrians.","PeriodicalId":71434,"journal":{"name":"数据分析和信息处理(英文)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70996659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

The Information Protection in Automatic Reconstruction of Not Continuous Geophysical Data Series 非连续地球物理数据序列自动重建中的信息保护

数据分析和信息处理(英文)

Pub Date : 2019-10-15 DOI: 10.4236/jdaip.2019.74013

O. Faggioni

We show a quantitative technique characterized by low numerical mediation for the reconstruction of temporal sequences of geophysical data of length L interrupted for a time ΔT where . The aim is to protect the information acquired before and after the interruption by means of a numerical protocol with the lowest possible calculation weight. The signal reconstruction process is based on the synthesis of the low frequency signal extracted for subsampling (subsampling ∇Dirac = ΔT in phase with ΔT) with the high frequency signal recorded before the crash. The SYRec (SYnthetic REConstruction) method for simplicity and speed of calculation and for spectral response stability is particularly effective in the studies of high speed transient phenomena that develop in very perturbed fields. This operative condition is found a mental when almost immediate informational responses are required to the observation system. In this example we are dealing with geomagnetic data coming from an uw counter intrusion magnetic system. The system produces (on time) information about the transit of local magnetic singularities (magnetic perturbations with low spatial extension), originated by quasi-point form and kinematic sources (divers), in harbors magnetic underwater fields. The performances of stability of the SYRec system make it usable also in long and medium period of observation (activity of geomagnetic observatories).

我们展示了一种以低数值中介为特征的定量技术，用于重建中断时间ΔT的长度为L的地球物理数据的时间序列，其中。其目的是通过具有尽可能低的计算权重的数字协议来保护在中断之前和之后获取的信息。信号重建过程基于提取的低频信号的合成，该低频信号用于对碰撞前记录的高频信号进行二次采样（二次采样ŞDirac=ΔT与ΔT同相）。SYRec（SYnthetic REConstruction）方法具有计算的简单性和速度以及频谱响应的稳定性，在研究在非常扰动的场中发展的高速瞬态现象时特别有效。当观察系统需要几乎立即的信息反应时，这种操作条件被发现是一种心理状态。在这个例子中，我们正在处理来自uw反入侵磁系统的地磁数据。该系统（按时）产生关于港口水下磁场中由准点形式和运动源（潜水员）产生的局部磁奇点（具有低空间扩展的磁扰动）凌日的信息。SYRec系统的稳定性使其也可用于中长期观测（地磁观测站的活动）。

{"title":"The Information Protection in Automatic Reconstruction of Not Continuous Geophysical Data Series","authors":"O. Faggioni","doi":"10.4236/jdaip.2019.74013","DOIUrl":"https://doi.org/10.4236/jdaip.2019.74013","url":null,"abstract":"We show a quantitative technique characterized by low numerical mediation for the reconstruction of temporal sequences of geophysical data of length L interrupted for a time ΔT where . The aim is to protect the information acquired before and after the interruption by means of a numerical protocol with the lowest possible calculation weight. The signal reconstruction process is based on the synthesis of the low frequency signal extracted for subsampling (subsampling ∇Dirac = ΔT in phase with ΔT) with the high frequency signal recorded before the crash. The SYRec (SYnthetic REConstruction) method for simplicity and speed of calculation and for spectral response stability is particularly effective in the studies of high speed transient phenomena that develop in very perturbed fields. This operative condition is found a mental when almost immediate informational responses are required to the observation system. In this example we are dealing with geomagnetic data coming from an uw counter intrusion magnetic system. The system produces (on time) information about the transit of local magnetic singularities (magnetic perturbations with low spatial extension), originated by quasi-point form and kinematic sources (divers), in harbors magnetic underwater fields. The performances of stability of the SYRec system make it usable also in long and medium period of observation (activity of geomagnetic observatories).","PeriodicalId":71434,"journal":{"name":"数据分析和信息处理(英文)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45328578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Hot Events Detection of Stock Market Based on Time Series Data of Stock and Text Data of Network Public Opinion 基于股票时间序列数据和网络舆情文本数据的股市热点事件检测

数据分析和信息处理(英文)

Pub Date : 2019-09-30 DOI: 10.4236/jdaip.2019.74011

Beibei Cao

With the highly integration of the Internet world and the real world, Internet information not only provides real-time and effective data for financial investors, but also helps them understand market dynamics, and enables investors to quickly identify relevant financial events that may lead to stock market volatility. However, in the research of event detection in the financial field, many studies are focused on micro-blog, news and other network text information. Few scholars have studied the characteristics of financial time series data. Considering that in the financial field, the occurrence of an event often affects both the online public opinion space and the real transaction space, so this paper proposes a multi-source heterogeneous information detection method based on stock transaction time series data and online public opinion text data to detect hot events in the stock market. This method uses outlier detection algorithm to extract the time of hot events in stock market based on multi-member fusion. And according to the weight calculation formula of the feature item proposed in this paper, this method calculates the keyword weight of network public opinion information to obtain the core content of hot events in the stock market. Finally, accurate detection of stock market hot events is achieved.

随着互联网世界与现实世界的高度融合，互联网信息不仅为金融投资者提供了实时有效的数据，还帮助他们了解市场动态，使投资者能够快速识别可能导致股市波动的相关金融事件。然而，在金融领域的事件检测研究中，很多研究都集中在微博、新闻等网络文本信息上。很少有学者对金融时间序列数据的特征进行研究。考虑到在金融领域，事件的发生往往会同时影响网络舆情空间和现实交易空间，因此本文提出了一种基于股票交易时间序列数据和网络舆情文本数据的多源异构信息检测方法，用于股票市场热点事件的检测。该方法采用基于多成员融合的离群点检测算法提取股票市场热点事件的时间。该方法根据本文提出的特征项权重计算公式，计算网络舆情信息的关键词权重，获得股市热点事件的核心内容。最后，实现了对股票市场热点事件的准确检测。

{"title":"Hot Events Detection of Stock Market Based on Time Series Data of Stock and Text Data of Network Public Opinion","authors":"Beibei Cao","doi":"10.4236/jdaip.2019.74011","DOIUrl":"https://doi.org/10.4236/jdaip.2019.74011","url":null,"abstract":"With the highly integration of the Internet world and the real world, Internet information not only provides real-time and effective data for financial investors, but also helps them understand market dynamics, and enables investors to quickly identify relevant financial events that may lead to stock market volatility. However, in the research of event detection in the financial field, many studies are focused on micro-blog, news and other network text information. Few scholars have studied the characteristics of financial time series data. Considering that in the financial field, the occurrence of an event often affects both the online public opinion space and the real transaction space, so this paper proposes a multi-source heterogeneous information detection method based on stock transaction time series data and online public opinion text data to detect hot events in the stock market. This method uses outlier detection algorithm to extract the time of hot events in stock market based on multi-member fusion. And according to the weight calculation formula of the feature item proposed in this paper, this method calculates the keyword weight of network public opinion information to obtain the core content of hot events in the stock market. Finally, accurate detection of stock market hot events is achieved.","PeriodicalId":71434,"journal":{"name":"数据分析和信息处理(英文)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46336606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Support Vector Machine for Sentiment Analysis of Nigerian Banks Financial Tweets 支持向量机对尼日利亚银行金融推文的情绪分析

数据分析和信息处理(英文)

Pub Date : 2019-09-23 DOI: 10.4236/jdaip.2019.74010

F. C. Onwuegbuche, J. Wafula, J. Mung'atu

The rise of social media paves way for unprecedented benefits or risks to several organisations depending on how they adapt to its changes. This rise comes with a great challenge of gaining insights from these big data for effective and efficient decision making that can improve quality, profitability, productivity, competitiveness and customer satisfaction. Sentiment analysis is the field that is concerned with the classification and analysis of user generated text under defined polarities. Despite the upsurge of research in sentiment analysis in recent years, there is a dearth in literature on sentiment analysis applied to banks social media data and mostly on African datasets. Against this background, this study applied machine learning technique (support vector machine) for sentiment analysis of Nigerian banks twitter data within a 2-year period, from 1st January 2017 to 31st December 2018. After crawling and preprocessing of the data, LibSVM algorithm in WEKA was used to build the sentiment classification model based on the training data. The performance of this model was evaluated on a pre-labelled test dataset generated from the five banks. The results show that the accuracy of the classifier was 71.8367%. The precision for both the positive and negative classes was above 0.7, the recall for the negative class was 0.696 and that of the positive class was 0.741 which shows the prediction did better than chance in addition to other measures. Applying the model in predicting the sentiments of the five Nigerian banks twitter data reveals that the number of positive tweets within this period was slightly greater than the number of negative tweets. The scatter plots for the sentiments series indicated that, majority of the data falls between 0 and 100 sentiments per day, with few outliers above this range.

社交媒体的兴起为一些组织带来了前所未有的利益或风险，这取决于它们如何适应它的变化。这种增长带来了巨大的挑战，即从这些大数据中获取见解，以便有效和高效地做出决策，从而提高质量、盈利能力、生产力、竞争力和客户满意度。情感分析是在定义极性下对用户生成的文本进行分类和分析的领域。尽管近年来情绪分析的研究激增，但将情绪分析应用于银行社交媒体数据的文献很少，主要是针对非洲的数据集。在此背景下，本研究应用机器学习技术(支持向量机)对2017年1月1日至2018年12月31日两年内尼日利亚银行推特数据进行情绪分析。对数据进行爬取和预处理后，利用WEKA中的LibSVM算法建立基于训练数据的情感分类模型。该模型的性能在五家银行生成的预标记测试数据集上进行了评估。结果表明，该分类器的准确率为71.8367%。正类和负类的精度都在0.7以上，负类的召回率为0.696，正类的召回率为0.741，表明除其他措施外，预测效果优于机会。将该模型应用于预测尼日利亚五家银行twitter数据的情绪，发现这一时期的积极推文数量略大于消极推文数量。情绪系列的散点图表明，大多数数据落在每天0到100个情绪之间，很少有异常值高于这个范围。

{"title":"Support Vector Machine for Sentiment Analysis of Nigerian Banks Financial Tweets","authors":"F. C. Onwuegbuche, J. Wafula, J. Mung'atu","doi":"10.4236/jdaip.2019.74010","DOIUrl":"https://doi.org/10.4236/jdaip.2019.74010","url":null,"abstract":"The rise of social media paves way for unprecedented benefits or risks to several organisations depending on how they adapt to its changes. This rise comes with a great challenge of gaining insights from these big data for effective and efficient decision making that can improve quality, profitability, productivity, competitiveness and customer satisfaction. Sentiment analysis is the field that is concerned with the classification and analysis of user generated text under defined polarities. Despite the upsurge of research in sentiment analysis in recent years, there is a dearth in literature on sentiment analysis applied to banks social media data and mostly on African datasets. Against this background, this study applied machine learning technique (support vector machine) for sentiment analysis of Nigerian banks twitter data within a 2-year period, from 1st January 2017 to 31st December 2018. After crawling and preprocessing of the data, LibSVM algorithm in WEKA was used to build the sentiment classification model based on the training data. The performance of this model was evaluated on a pre-labelled test dataset generated from the five banks. The results show that the accuracy of the classifier was 71.8367%. The precision for both the positive and negative classes was above 0.7, the recall for the negative class was 0.696 and that of the positive class was 0.741 which shows the prediction did better than chance in addition to other measures. Applying the model in predicting the sentiments of the five Nigerian banks twitter data reveals that the number of positive tweets within this period was slightly greater than the number of negative tweets. The scatter plots for the sentiments series indicated that, majority of the data falls between 0 and 100 sentiments per day, with few outliers above this range.","PeriodicalId":71434,"journal":{"name":"数据分析和信息处理(英文)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43823705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Predicting the Stock Price Movement by Social Media Analysis 通过社会媒体分析预测股价走势

数据分析和信息处理(英文)

Pub Date : 2019-09-12 DOI: 10.4236/jdaip.2019.74017

Sitong Chen, Tianhong Gao, Yuqinq He, Yifan Jin

Prediction of stock trend has been an intriguing topic and is extensively studied by researchers from diversified fields. Machine learning, a well-established algorithm, has been also studied for its potentials in prediction of financial markets. In this paper, seven different techniques of data mining are applied to predict stock price movement of Shanghai Composite Index. The approaches include Support vector machine, Logistic regression, Naive Bayesian, K-nearest neighbor classification, Decision tree, Random forest and Adaboost. Extracting the corresponding comments between April 2017 and May 2018, it shows that: 1) sentiment derived from Eastmoney, a social media platform for the financial community in China, further enhances model performances, 2) for positive and negative sentiments classifications, all classifiers reach at least 75% accuracy and the linear SVC models prove to perform best, 3) according to the strong correlation between the price fluctuation and the bullish index, the approximate overall trend of the closing price can be acquired.

股票走势预测一直是一个有趣的话题，受到了各个领域研究者的广泛研究。机器学习是一种成熟的算法，它在预测金融市场方面的潜力也得到了研究。本文运用7种不同的数据挖掘技术来预测上证综合指数的股价走势。方法包括支持向量机、逻辑回归、朴素贝叶斯、k近邻分类、决策树、随机森林和Adaboost。抽取2017年4月至2018年5月的相应评论，结果显示:1)来自中国金融界社交媒体平台Eastmoney的情绪进一步提高了模型的性能;2)对于正面和负面情绪分类，所有分类器的准确率都达到75%以上，线性SVC模型表现最好;3)根据价格波动与看涨指数之间的强相关性，可以获得收盘价格的大致整体趋势。

引用次数: 0

A Review of the Logistic Regression Model with Emphasis on Medical Research 以医学研究为重点的Logistic回归模型综述

数据分析和信息处理(英文)

Pub Date : 2019-09-12 DOI: 10.4236/jdaip.2019.74012

E. Boateng, D. Abaye

This study explored and reviewed the logistic regression (LR) model, a multivariable method for modeling the relationship between multiple independent variables and a categorical dependent variable, with emphasis on medical research. Thirty seven research articles published between 2000 and 2018 which employed logistic regression as the main statistical tool as well as six text books on logistic regression were reviewed. Logistic regression concepts such as odds, odds ratio, logit transformation, logistic curve, assumption, selecting dependent and independent variables, model fitting, reporting and interpreting were presented. Upon perusing the literature, considerable deficiencies were found in both the use and reporting of LR. For many studies, the ratio of the number of outcome events to predictor variables (events per variable) was sufficiently small to call into question the accuracy of the regression model. Also, most studies did not report on validation analysis, regression diagnostics or goodness-of-fit measures; measures which authenticate the robustness of the LR model. Here, we demonstrate a good example of the application of the LR model using data obtained on a cohort of pregnant women and the factors that influence their decision to opt for caesarean delivery or vaginal birth. It is recommended that researchers should be more rigorous and pay greater attention to guidelines concerning the use and reporting of LR models.

本文以医学研究为重点，对logistic回归(LR)模型进行了探索和综述，该模型是一种多变量方法，用于建模多个自变量与分类因变量之间的关系。回顾了2000年至2018年间发表的37篇以逻辑回归为主要统计工具的研究论文以及6本关于逻辑回归的教科书。介绍了逻辑回归的概念，如比值、比值比、逻辑变换、逻辑曲线、假设、选择因变量和自变量、模型拟合、报告和解释。在仔细阅读文献后，发现在LR的使用和报告中都存在相当大的缺陷。在许多研究中，结果事件数与预测变量(每个变量的事件数)之比小到足以质疑回归模型的准确性。此外，大多数研究没有报告验证分析、回归诊断或拟合优度措施;验证LR模型鲁棒性的度量。在这里，我们展示了LR模型应用的一个很好的例子，使用了一组孕妇的数据，以及影响她们选择剖腹产或顺产的因素。建议研究人员应更加严格，更加关注有关LR模型使用和报告的指南。

{"title":"A Review of the Logistic Regression Model with Emphasis on Medical Research","authors":"E. Boateng, D. Abaye","doi":"10.4236/jdaip.2019.74012","DOIUrl":"https://doi.org/10.4236/jdaip.2019.74012","url":null,"abstract":"This study explored and reviewed the logistic regression (LR) model, a multivariable method for modeling the relationship between multiple independent variables and a categorical dependent variable, with emphasis on medical research. Thirty seven research articles published between 2000 and 2018 which employed logistic regression as the main statistical tool as well as six text books on logistic regression were reviewed. Logistic regression concepts such as odds, odds ratio, logit transformation, logistic curve, assumption, selecting dependent and independent variables, model fitting, reporting and interpreting were presented. Upon perusing the literature, considerable deficiencies were found in both the use and reporting of LR. For many studies, the ratio of the number of outcome events to predictor variables (events per variable) was sufficiently small to call into question the accuracy of the regression model. Also, most studies did not report on validation analysis, regression diagnostics or goodness-of-fit measures; measures which authenticate the robustness of the LR model. Here, we demonstrate a good example of the application of the LR model using data obtained on a cohort of pregnant women and the factors that influence their decision to opt for caesarean delivery or vaginal birth. It is recommended that researchers should be more rigorous and pay greater attention to guidelines concerning the use and reporting of LR models.","PeriodicalId":71434,"journal":{"name":"数据分析和信息处理(英文)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48024591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 109