Pub Date : 2020-01-01DOI: 10.4236/jdaip.2020.83008
Imran Chowdhury Dipto, A. F. M. Moshiur Rahman, Tanzila Islam, H. Rahman
Large amount of data has been generated by Organizations. Different Analytical Tools are being used to handle such kind of data by Data Scientists. There are many tools available for Data processing, Visualisations, Predictive Analytics and so on. It is important to select a suitable Analytic Tool or Programming Language to carry out the tasks. In this research, two of the most commonly used Programming Languages have been compared and contrasted which are Python and R. To carry out the experiment two data sets have been collected from Kaggle and combined into a single Dataset. This study visualizes the data to generate some useful insights and prepare data for training on Artificial Neural Network by using Python and R language. The scope of this paper is to compare the analytical capabilities of Python and R. An Artificial Neural Network with Multilayer Perceptron has been implemented to predict the severity of accidents. Furthermore, the results have been used to compare and tried to point out which programming language is better for data visualization, data processing, Predictive Analytics, etc.
{"title":"Prediction of Accident Severity Using Artificial Neural Network: A Comparison of Analytical Capabilities between Python and R","authors":"Imran Chowdhury Dipto, A. F. M. Moshiur Rahman, Tanzila Islam, H. Rahman","doi":"10.4236/jdaip.2020.83008","DOIUrl":"https://doi.org/10.4236/jdaip.2020.83008","url":null,"abstract":"Large amount of data has been generated by Organizations. Different Analytical Tools are being used to handle such kind of data by Data Scientists. There are many tools available for Data processing, Visualisations, Predictive Analytics and so on. It is important to select a suitable Analytic Tool or Programming Language to carry out the tasks. In this research, two of the most commonly used Programming Languages have been compared and contrasted which are Python and R. To carry out the experiment two data sets have been collected from Kaggle and combined into a single Dataset. This study visualizes the data to generate some useful insights and prepare data for training on Artificial Neural Network by using Python and R language. The scope of this paper is to compare the analytical capabilities of Python and R. An Artificial Neural Network with Multilayer Perceptron has been implemented to predict the severity of accidents. Furthermore, the results have been used to compare and tried to point out which programming language is better for data visualization, data processing, Predictive Analytics, etc.","PeriodicalId":71434,"journal":{"name":"数据分析和信息处理(英文)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70996984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-01-01DOI: 10.4236/jdaip.2020.84014
Samuel Ambapour
The following analysis is based on a multidimensional understanding of poverty using a nonmonetary basic needs approach. It is ground on data from the first survey on household living conditions for poverty assessment, conducted by the National Institute of Statistics of Congo in 2005. Multiple Correspondence Analysis is applied to construct a composite indicator by aggregating several attributes likely to reflect the poverty of individuals or households. The application shows that Congolese households are not affected by the same type of poverty. Three types of non-monetary poverty are identified: infrastructure poverty, vulnerability of human existence and poverty of comfort. These households were then classified according to the composite indicator of Poverty. The results show that the incidence of poverty corresponds to the weight of poor class of about 70.67%.
{"title":"Using Multiple Correspondence Analysis to Measure Multidimensional Poverty in Congo","authors":"Samuel Ambapour","doi":"10.4236/jdaip.2020.84014","DOIUrl":"https://doi.org/10.4236/jdaip.2020.84014","url":null,"abstract":"The following analysis is based on a multidimensional understanding of poverty using a nonmonetary basic needs approach. It is ground on data from the first survey on household living conditions for poverty assessment, conducted by the National Institute of Statistics of Congo in 2005. Multiple Correspondence Analysis is applied to construct a composite indicator by aggregating several attributes likely to reflect the poverty of individuals or households. The application shows that Congolese households are not affected by the same type of poverty. Three types of non-monetary poverty are identified: infrastructure poverty, vulnerability of human existence and poverty of comfort. These households were then classified according to the composite indicator of Poverty. The results show that the incidence of poverty corresponds to the weight of poor class of about 70.67%.","PeriodicalId":71434,"journal":{"name":"数据分析和信息处理(英文)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70997400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-01-01DOI: 10.4236/jdaip.2020.83009
D. Adedia, Livingstone Asem, S. Appiah, S. Nanga, Y. Boateng, K. Duedu, L. Anani
Globally, hypertension is one of the leading causes of death. It can potentially lead to heart disease and stroke, among others, that could result to premature death. In Ghana, hypertension is considered as a disease that contributes to an increase in outpatients’ attendance. To assess the trend differentials of hypertension-induced deaths in Ghana, Chi-square test for equal proportions and Marascuilo procedure for pairwise comparison were performed using surveillance data on reported number of deaths from 2012 to 2016 across the then ten regions. The results show that incidence of hypertension-induced mortality was significantly different for almost all the regions and over the years. The incidence of hypertension-induced mortality has significantly reduced from 2012 to 2016. However, Volta Region recorded the highest incidence of mortality cases (p-value less of 0.0001) than the other regions during the period under review, while the Upper East Region recorded continuous increase in incidence of mortality cases with the highest in 2016. The Eastern Region, Central Region, and Greater Accra Region recorded significantly (p-value less of 0.0001) higher incidence of hypertension-induced mortality than the Ashanti Region, Brong Ahafo Region, Northern Region, Western Region and Upper West Region. The Upper West Region and Western Region had the lowest incidence of mortality. The decline in trend of hypertension-induced mortality could be attributed to some healthcare interventions put in place during the period. One of these interventions was the introduction of health insurance in 2003, a development which has been shown to affect the health seeking behaviors of the people. It is, therefore, important to investigate factors affecting these spatial and temporal dynamics in order to determine appropriate ways to actively control the hypertension-induced deaths in the country. Public education on health should be intensified so as to totally curb hypertension and its attendant risks.
{"title":"Assessment of Hypertension-Induced Deaths in Ghana: A Nation-Wide Study from 2012 to 2016","authors":"D. Adedia, Livingstone Asem, S. Appiah, S. Nanga, Y. Boateng, K. Duedu, L. Anani","doi":"10.4236/jdaip.2020.83009","DOIUrl":"https://doi.org/10.4236/jdaip.2020.83009","url":null,"abstract":"Globally, hypertension is one of the leading causes of death. It can potentially lead to heart disease and stroke, among others, that could result to premature death. In Ghana, hypertension is considered as a disease that contributes to an increase in outpatients’ attendance. To assess the trend differentials of hypertension-induced deaths in Ghana, Chi-square test for equal proportions and Marascuilo procedure for pairwise comparison were performed using surveillance data on reported number of deaths from 2012 to 2016 across the then ten regions. The results show that incidence of hypertension-induced mortality was significantly different for almost all the regions and over the years. The incidence of hypertension-induced mortality has significantly reduced from 2012 to 2016. However, Volta Region recorded the highest incidence of mortality cases (p-value less of 0.0001) than the other regions during the period under review, while the Upper East Region recorded continuous increase in incidence of mortality cases with the highest in 2016. The Eastern Region, Central Region, and Greater Accra Region recorded significantly (p-value less of 0.0001) higher incidence of hypertension-induced mortality than the Ashanti Region, Brong Ahafo Region, Northern Region, Western Region and Upper West Region. The Upper West Region and Western Region had the lowest incidence of mortality. The decline in trend of hypertension-induced mortality could be attributed to some healthcare interventions put in place during the period. One of these interventions was the introduction of health insurance in 2003, a development which has been shown to affect the health seeking behaviors of the people. It is, therefore, important to investigate factors affecting these spatial and temporal dynamics in order to determine appropriate ways to actively control the hypertension-induced deaths in the country. Public education on health should be intensified so as to totally curb hypertension and its attendant risks.","PeriodicalId":71434,"journal":{"name":"数据分析和信息处理(英文)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70997060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-01-01DOI: 10.4236/JDAIP.2020.84013
Ruming Li
Graphical representation of hierarchical clustering results is of final importance in hierarchical cluster analysis of data. Unfortunately, almost all mathematical or statistical software may have a weak capability of showcasing such clustering results. Particularly, most of clustering results or trees drawn cannot be represented in a dendrogram with a resizable, rescalable and free-style fashion. With the “dynamic” drawing instead of “static” one, this research works around these weak functionalities that restrict visualization of clustering results in an arbitrary manner. It introduces an algorithmic solution to these functionalities, which adopts seamless pixel rearrangements to be able to resize and rescale dendrograms or tree diagrams. The results showed that the algorithm developed makes clustering outcome representation a really free visualization of hierarchical clustering and bioinformatics analysis. Especially, it possesses features of selectively visualizing and/or saving results in a specific size, scale and style (different views).
{"title":"Resizable, Rescalable and Free-Style Visualization of Hierarchical Clustering and Bioinformatics Analysis","authors":"Ruming Li","doi":"10.4236/JDAIP.2020.84013","DOIUrl":"https://doi.org/10.4236/JDAIP.2020.84013","url":null,"abstract":"Graphical representation of hierarchical clustering results is of final importance in hierarchical cluster analysis of data. Unfortunately, almost all mathematical or statistical software may have a weak capability of showcasing such clustering results. Particularly, most of clustering results or trees drawn cannot be represented in a dendrogram with a resizable, rescalable and free-style fashion. With the “dynamic” drawing instead of “static” one, this research works around these weak functionalities that restrict visualization of clustering results in an arbitrary manner. It introduces an algorithmic solution to these functionalities, which adopts seamless pixel rearrangements to be able to resize and rescale dendrograms or tree diagrams. The results showed that the algorithm developed makes clustering outcome representation a really free visualization of hierarchical clustering and bioinformatics analysis. Especially, it possesses features of selectively visualizing and/or saving results in a specific size, scale and style (different views).","PeriodicalId":71434,"journal":{"name":"数据分析和信息处理(英文)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70997216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-01-01DOI: 10.4236/jdaip.2020.81001
Yanyan Chen, Jiajie Ma, Shaohua Wang
Pedestrian safety has recently been considered as one of the most serious issues in the research of traffic safety. This study aims at analyzing the spatial correlation between the frequency of pedestrian crashes and various predictor variables based on open source point-of-interest (POI) data which can provide specific land use features and user characteristics. Spatial regression models were developed at Traffic Analysis Zone (TAZ) level using 10,333 pedestrian crash records within the Fifth Ring of Beijing in 2015. Several spatial econometrics approaches were used to examine the spatial autocorrelation in crash count per TAZ, and the spatial heterogeneity was investigated by a geographically weighted regression model. The results showed that spatial error model performed better than other two spatial models and a traditional ordinary least squares model. Specifically, bus stops, hospitals, pharmacies, restaurants, and office buildings had positive impacts on pedestrian crashes, while hotels were negatively associated with the occurrence of pedestrian crashes. In addition, it was proven that there was a significant sign of localization effects for different POIs. Depending on these findings, lots of recommendations and countermeasures can be proposed to better improve the traffic safety for pedestrians.
{"title":"Spatial Regression Analysis of Pedestrian Crashes Based on Point-of-Interest Data","authors":"Yanyan Chen, Jiajie Ma, Shaohua Wang","doi":"10.4236/jdaip.2020.81001","DOIUrl":"https://doi.org/10.4236/jdaip.2020.81001","url":null,"abstract":"Pedestrian safety has recently been considered as one of the most serious issues in the research of traffic safety. This study aims at analyzing the spatial correlation between the frequency of pedestrian crashes and various predictor variables based on open source point-of-interest (POI) data which can provide specific land use features and user characteristics. Spatial regression models were developed at Traffic Analysis Zone (TAZ) level using 10,333 pedestrian crash records within the Fifth Ring of Beijing in 2015. Several spatial econometrics approaches were used to examine the spatial autocorrelation in crash count per TAZ, and the spatial heterogeneity was investigated by a geographically weighted regression model. The results showed that spatial error model performed better than other two spatial models and a traditional ordinary least squares model. Specifically, bus stops, hospitals, pharmacies, restaurants, and office buildings had positive impacts on pedestrian crashes, while hotels were negatively associated with the occurrence of pedestrian crashes. In addition, it was proven that there was a significant sign of localization effects for different POIs. Depending on these findings, lots of recommendations and countermeasures can be proposed to better improve the traffic safety for pedestrians.","PeriodicalId":71434,"journal":{"name":"数据分析和信息处理(英文)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70996659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-15DOI: 10.4236/jdaip.2019.74013
O. Faggioni
We show a quantitative technique characterized by low numerical mediation for the reconstruction of temporal sequences of geophysical data of length L interrupted for a time ΔT where . The aim is to protect the information acquired before and after the interruption by means of a numerical protocol with the lowest possible calculation weight. The signal reconstruction process is based on the synthesis of the low frequency signal extracted for subsampling (subsampling ∇Dirac = ΔT in phase with ΔT) with the high frequency signal recorded before the crash. The SYRec (SYnthetic REConstruction) method for simplicity and speed of calculation and for spectral response stability is particularly effective in the studies of high speed transient phenomena that develop in very perturbed fields. This operative condition is found a mental when almost immediate informational responses are required to the observation system. In this example we are dealing with geomagnetic data coming from an uw counter intrusion magnetic system. The system produces (on time) information about the transit of local magnetic singularities (magnetic perturbations with low spatial extension), originated by quasi-point form and kinematic sources (divers), in harbors magnetic underwater fields. The performances of stability of the SYRec system make it usable also in long and medium period of observation (activity of geomagnetic observatories).
{"title":"The Information Protection in Automatic Reconstruction of Not Continuous Geophysical Data Series","authors":"O. Faggioni","doi":"10.4236/jdaip.2019.74013","DOIUrl":"https://doi.org/10.4236/jdaip.2019.74013","url":null,"abstract":"We show a quantitative technique characterized by low numerical mediation for the reconstruction of temporal sequences of geophysical data of length L interrupted for a time ΔT where . The aim is to protect the information acquired before and after the interruption by means of a numerical protocol with the lowest possible calculation weight. The signal reconstruction process is based on the synthesis of the low frequency signal extracted for subsampling (subsampling ∇Dirac = ΔT in phase with ΔT) with the high frequency signal recorded before the crash. The SYRec (SYnthetic REConstruction) method for simplicity and speed of calculation and for spectral response stability is particularly effective in the studies of high speed transient phenomena that develop in very perturbed fields. This operative condition is found a mental when almost immediate informational responses are required to the observation system. In this example we are dealing with geomagnetic data coming from an uw counter intrusion magnetic system. The system produces (on time) information about the transit of local magnetic singularities (magnetic perturbations with low spatial extension), originated by quasi-point form and kinematic sources (divers), in harbors magnetic underwater fields. The performances of stability of the SYRec system make it usable also in long and medium period of observation (activity of geomagnetic observatories).","PeriodicalId":71434,"journal":{"name":"数据分析和信息处理(英文)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45328578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-30DOI: 10.4236/jdaip.2019.74011
Beibei Cao
With the highly integration of the Internet world and the real world, Internet information not only provides real-time and effective data for financial investors, but also helps them understand market dynamics, and enables investors to quickly identify relevant financial events that may lead to stock market volatility. However, in the research of event detection in the financial field, many studies are focused on micro-blog, news and other network text information. Few scholars have studied the characteristics of financial time series data. Considering that in the financial field, the occurrence of an event often affects both the online public opinion space and the real transaction space, so this paper proposes a multi-source heterogeneous information detection method based on stock transaction time series data and online public opinion text data to detect hot events in the stock market. This method uses outlier detection algorithm to extract the time of hot events in stock market based on multi-member fusion. And according to the weight calculation formula of the feature item proposed in this paper, this method calculates the keyword weight of network public opinion information to obtain the core content of hot events in the stock market. Finally, accurate detection of stock market hot events is achieved.
{"title":"Hot Events Detection of Stock Market Based on Time Series Data of Stock and Text Data of Network Public Opinion","authors":"Beibei Cao","doi":"10.4236/jdaip.2019.74011","DOIUrl":"https://doi.org/10.4236/jdaip.2019.74011","url":null,"abstract":"With the highly integration of the Internet world and the real world, Internet information not only provides real-time and effective data for financial investors, but also helps them understand market dynamics, and enables investors to quickly identify relevant financial events that may lead to stock market volatility. However, in the research of event detection in the financial field, many studies are focused on micro-blog, news and other network text information. Few scholars have studied the characteristics of financial time series data. Considering that in the financial field, the occurrence of an event often affects both the online public opinion space and the real transaction space, so this paper proposes a multi-source heterogeneous information detection method based on stock transaction time series data and online public opinion text data to detect hot events in the stock market. This method uses outlier detection algorithm to extract the time of hot events in stock market based on multi-member fusion. And according to the weight calculation formula of the feature item proposed in this paper, this method calculates the keyword weight of network public opinion information to obtain the core content of hot events in the stock market. Finally, accurate detection of stock market hot events is achieved.","PeriodicalId":71434,"journal":{"name":"数据分析和信息处理(英文)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46336606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-23DOI: 10.4236/jdaip.2019.74010
F. C. Onwuegbuche, J. Wafula, J. Mung'atu
The rise of social media paves way for unprecedented benefits or risks to several organisations depending on how they adapt to its changes. This rise comes with a great challenge of gaining insights from these big data for effective and efficient decision making that can improve quality, profitability, productivity, competitiveness and customer satisfaction. Sentiment analysis is the field that is concerned with the classification and analysis of user generated text under defined polarities. Despite the upsurge of research in sentiment analysis in recent years, there is a dearth in literature on sentiment analysis applied to banks social media data and mostly on African datasets. Against this background, this study applied machine learning technique (support vector machine) for sentiment analysis of Nigerian banks twitter data within a 2-year period, from 1st January 2017 to 31st December 2018. After crawling and preprocessing of the data, LibSVM algorithm in WEKA was used to build the sentiment classification model based on the training data. The performance of this model was evaluated on a pre-labelled test dataset generated from the five banks. The results show that the accuracy of the classifier was 71.8367%. The precision for both the positive and negative classes was above 0.7, the recall for the negative class was 0.696 and that of the positive class was 0.741 which shows the prediction did better than chance in addition to other measures. Applying the model in predicting the sentiments of the five Nigerian banks twitter data reveals that the number of positive tweets within this period was slightly greater than the number of negative tweets. The scatter plots for the sentiments series indicated that, majority of the data falls between 0 and 100 sentiments per day, with few outliers above this range.
{"title":"Support Vector Machine for Sentiment Analysis of Nigerian Banks Financial Tweets","authors":"F. C. Onwuegbuche, J. Wafula, J. Mung'atu","doi":"10.4236/jdaip.2019.74010","DOIUrl":"https://doi.org/10.4236/jdaip.2019.74010","url":null,"abstract":"The rise of social media paves way for unprecedented benefits or risks to several organisations depending on how they adapt to its changes. This rise comes with a great challenge of gaining insights from these big data for effective and efficient decision making that can improve quality, profitability, productivity, competitiveness and customer satisfaction. Sentiment analysis is the field that is concerned with the classification and analysis of user generated text under defined polarities. Despite the upsurge of research in sentiment analysis in recent years, there is a dearth in literature on sentiment analysis applied to banks social media data and mostly on African datasets. Against this background, this study applied machine learning technique (support vector machine) for sentiment analysis of Nigerian banks twitter data within a 2-year period, from 1st January 2017 to 31st December 2018. After crawling and preprocessing of the data, LibSVM algorithm in WEKA was used to build the sentiment classification model based on the training data. The performance of this model was evaluated on a pre-labelled test dataset generated from the five banks. The results show that the accuracy of the classifier was 71.8367%. The precision for both the positive and negative classes was above 0.7, the recall for the negative class was 0.696 and that of the positive class was 0.741 which shows the prediction did better than chance in addition to other measures. Applying the model in predicting the sentiments of the five Nigerian banks twitter data reveals that the number of positive tweets within this period was slightly greater than the number of negative tweets. The scatter plots for the sentiments series indicated that, majority of the data falls between 0 and 100 sentiments per day, with few outliers above this range.","PeriodicalId":71434,"journal":{"name":"数据分析和信息处理(英文)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43823705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-12DOI: 10.4236/jdaip.2019.74017
Sitong Chen, Tianhong Gao, Yuqinq He, Yifan Jin
Prediction of stock trend has been an intriguing topic and is extensively studied by researchers from diversified fields. Machine learning, a well-established algorithm, has been also studied for its potentials in prediction of financial markets. In this paper, seven different techniques of data mining are applied to predict stock price movement of Shanghai Composite Index. The approaches include Support vector machine, Logistic regression, Naive Bayesian, K-nearest neighbor classification, Decision tree, Random forest and Adaboost. Extracting the corresponding comments between April 2017 and May 2018, it shows that: 1) sentiment derived from Eastmoney, a social media platform for the financial community in China, further enhances model performances, 2) for positive and negative sentiments classifications, all classifiers reach at least 75% accuracy and the linear SVC models prove to perform best, 3) according to the strong correlation between the price fluctuation and the bullish index, the approximate overall trend of the closing price can be acquired.
{"title":"Predicting the Stock Price Movement by Social Media Analysis","authors":"Sitong Chen, Tianhong Gao, Yuqinq He, Yifan Jin","doi":"10.4236/jdaip.2019.74017","DOIUrl":"https://doi.org/10.4236/jdaip.2019.74017","url":null,"abstract":"Prediction of stock trend has been an intriguing topic and is extensively studied by researchers from diversified fields. Machine learning, a well-established algorithm, has been also studied for its potentials in prediction of financial markets. In this paper, seven different techniques of data mining are applied to predict stock price movement of Shanghai Composite Index. The approaches include Support vector machine, Logistic regression, Naive Bayesian, K-nearest neighbor classification, Decision tree, Random forest and Adaboost. Extracting the corresponding comments between April 2017 and May 2018, it shows that: 1) sentiment derived from Eastmoney, a social media platform for the financial community in China, further enhances model performances, 2) for positive and negative sentiments classifications, all classifiers reach at least 75% accuracy and the linear SVC models prove to perform best, 3) according to the strong correlation between the price fluctuation and the bullish index, the approximate overall trend of the closing price can be acquired.","PeriodicalId":71434,"journal":{"name":"数据分析和信息处理(英文)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42548758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-12DOI: 10.4236/jdaip.2019.74012
E. Boateng, D. Abaye
This study explored and reviewed the logistic regression (LR) model, a multivariable method for modeling the relationship between multiple independent variables and a categorical dependent variable, with emphasis on medical research. Thirty seven research articles published between 2000 and 2018 which employed logistic regression as the main statistical tool as well as six text books on logistic regression were reviewed. Logistic regression concepts such as odds, odds ratio, logit transformation, logistic curve, assumption, selecting dependent and independent variables, model fitting, reporting and interpreting were presented. Upon perusing the literature, considerable deficiencies were found in both the use and reporting of LR. For many studies, the ratio of the number of outcome events to predictor variables (events per variable) was sufficiently small to call into question the accuracy of the regression model. Also, most studies did not report on validation analysis, regression diagnostics or goodness-of-fit measures; measures which authenticate the robustness of the LR model. Here, we demonstrate a good example of the application of the LR model using data obtained on a cohort of pregnant women and the factors that influence their decision to opt for caesarean delivery or vaginal birth. It is recommended that researchers should be more rigorous and pay greater attention to guidelines concerning the use and reporting of LR models.
{"title":"A Review of the Logistic Regression Model with Emphasis on Medical Research","authors":"E. Boateng, D. Abaye","doi":"10.4236/jdaip.2019.74012","DOIUrl":"https://doi.org/10.4236/jdaip.2019.74012","url":null,"abstract":"This study explored and reviewed the logistic regression (LR) model, a multivariable method for modeling the relationship between multiple independent variables and a categorical dependent variable, with emphasis on medical research. Thirty seven research articles published between 2000 and 2018 which employed logistic regression as the main statistical tool as well as six text books on logistic regression were reviewed. Logistic regression concepts such as odds, odds ratio, logit transformation, logistic curve, assumption, selecting dependent and independent variables, model fitting, reporting and interpreting were presented. Upon perusing the literature, considerable deficiencies were found in both the use and reporting of LR. For many studies, the ratio of the number of outcome events to predictor variables (events per variable) was sufficiently small to call into question the accuracy of the regression model. Also, most studies did not report on validation analysis, regression diagnostics or goodness-of-fit measures; measures which authenticate the robustness of the LR model. Here, we demonstrate a good example of the application of the LR model using data obtained on a cohort of pregnant women and the factors that influence their decision to opt for caesarean delivery or vaginal birth. It is recommended that researchers should be more rigorous and pay greater attention to guidelines concerning the use and reporting of LR models.","PeriodicalId":71434,"journal":{"name":"数据分析和信息处理(英文)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48024591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}