Pub Date : 2020-04-27DOI: 10.22456/2175-2745.96081
Marcos de Souza Oliveira, Sergio Queiroz
Feature selection is an important research area that seeks to eliminate unwanted features from datasets. Many feature selection methods are suggested in the literature, but the evaluation of the best set of features is usually performed using supervised metrics, where labels are required. In this work we propose a methodology that tries to aid data specialists to answer simple but important questions, such as: (1) do current feature selection methods give similar results? (2) is there is a consistently better method ? (3) how to select the m -best features? (4) as the methods are not parameter-free, how to choose the best parameters in the unsupervised scenario? and (5) given different options of selection, could we get better results if we fusion the results of the methods? If yes, how can we combine the results? We analyze these issues and propose a methodology that, based on some unsupervised methods, will make feature selection using strategies that turn the execution of the process fully automatic and unsupervised, in high-dimensional datasets. After, we evaluate the obtained results, when we see that they are better than those obtained by using the selection methods at standard configurations. In the end, we also list some further improvements that can be made in future works.
{"title":"Unsupervised Feature Selection Methodology for Clustering in High Dimensionality Datasets","authors":"Marcos de Souza Oliveira, Sergio Queiroz","doi":"10.22456/2175-2745.96081","DOIUrl":"https://doi.org/10.22456/2175-2745.96081","url":null,"abstract":"Feature selection is an important research area that seeks to eliminate unwanted features from datasets. Many feature selection methods are suggested in the literature, but the evaluation of the best set of features is usually performed using supervised metrics, where labels are required. In this work we propose a methodology that tries to aid data specialists to answer simple but important questions, such as: (1) do current feature selection methods give similar results? (2) is there is a consistently better method ? (3) how to select the m -best features? (4) as the methods are not parameter-free, how to choose the best parameters in the unsupervised scenario? and (5) given different options of selection, could we get better results if we fusion the results of the methods? If yes, how can we combine the results? We analyze these issues and propose a methodology that, based on some unsupervised methods, will make feature selection using strategies that turn the execution of the process fully automatic and unsupervised, in high-dimensional datasets. After, we evaluate the obtained results, when we see that they are better than those obtained by using the selection methods at standard configurations. In the end, we also list some further improvements that can be made in future works.","PeriodicalId":82472,"journal":{"name":"Research initiative, treatment action : RITA","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88187662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-04-27DOI: 10.22456/2175-2745.98203
L. Marinho, Suelen Regina C. dos Santos, Leonardo Andrade, Bruna Costa Cons, Marcelo Schots, Vera Werneck
The use of agile methods has become essential in software development at the present time. Among the existing methods, Scrum is one of the major ones, and is used to manage projects in companies, even outside of the scope of software systems development. Considering the relevance of this subject and the success usually obtained in learning through educational games, Scrumie was proposed to teach the management of Scrum projects. Scrumie applies intelligence in multiagent architecture being developed with Agile Passi an agent oriented methodology. This paper contains a proposal, modeling, implementation and evaluation of the Scrumie game.
{"title":"Scrumie: Scrum Teaching Agent Oriented Game","authors":"L. Marinho, Suelen Regina C. dos Santos, Leonardo Andrade, Bruna Costa Cons, Marcelo Schots, Vera Werneck","doi":"10.22456/2175-2745.98203","DOIUrl":"https://doi.org/10.22456/2175-2745.98203","url":null,"abstract":"The use of agile methods has become essential in software development at the present time. Among the existing methods, Scrum is one of the major ones, and is used to manage projects in companies, even outside of the scope of software systems development. Considering the relevance of this subject and the success usually obtained in learning through educational games, Scrumie was proposed to teach the management of Scrum projects. Scrumie applies intelligence in multiagent architecture being developed with Agile Passi an agent oriented methodology. This paper contains a proposal, modeling, implementation and evaluation of the Scrumie game.","PeriodicalId":82472,"journal":{"name":"Research initiative, treatment action : RITA","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75066102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-04-27DOI: 10.22456/2175-2745.94845
Luis Gustavo Ludescher, Jaime Simão Sichman
In a conventional political system, leaders decide how to distribute benefits to the population and coalitions can emerge when other individuals support the candidates. This work intends to analyze how different leader strategies and individual profiles affect the way coalitions are formed and rewards are shared. Using agent-based simulation, we simulated a model in which individuals of three different perseverance profiles (patient, intermediate and impatient) eventually decide to be part of coalitions by supporting certain leaders when aiming to maximize their own earnings. Leaders can follow one of three different strategies to share rewards: altruistic, intermediate and egoistic. The results show that egoistic leaders stimulate the competition for rewards and the formation of coalitions, causing greater inequalities, while impatient individuals also promote more instability and lead to a higher concentration of rewards.
{"title":"Effects of reward distribution strategies and perseverance profiles on agent-based coalitions dynamics","authors":"Luis Gustavo Ludescher, Jaime Simão Sichman","doi":"10.22456/2175-2745.94845","DOIUrl":"https://doi.org/10.22456/2175-2745.94845","url":null,"abstract":"In a conventional political system, leaders decide how to distribute benefits to the population and coalitions can emerge when other individuals support the candidates. This work intends to analyze how different leader strategies and individual profiles affect the way coalitions are formed and rewards are shared. Using agent-based simulation, we simulated a model in which individuals of three different perseverance profiles (patient, intermediate and impatient) eventually decide to be part of coalitions by supporting certain leaders when aiming to maximize their own earnings. Leaders can follow one of three different strategies to share rewards: altruistic, intermediate and egoistic. The results show that egoistic leaders stimulate the competition for rewards and the formation of coalitions, causing greater inequalities, while impatient individuals also promote more instability and lead to a higher concentration of rewards.","PeriodicalId":82472,"journal":{"name":"Research initiative, treatment action : RITA","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87283172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-04-27DOI: 10.22456/2175-2745.94082
Ygor Alcântara de Medeiros, M. Goldbarg, E. Goldbarg
The Prize Collecting Traveling Salesman Problem with Ridesharing is a model that joins elements from the Prize Collecting Traveling Salesman and the collaborative transport. The salesman is the driver of a capacitated vehicle and uses a ridesharing system to minimize travel costs. There are a penalty and a bonus associated with each vertex of a graph, G, that represents the problem. There is also a cost associated with each edge of G. The salesman must choose a subset of vertices to be visited so that the total bonus collection is at least a given a parameter. The length of the tour plus the sum of penalties of all vertices not visited is as small as possible. There is a set of persons demanding rides. The ride request consists of a pickup and a drop off location, a maximum travel duration, and the maximum amount the person agrees to pay. The driver shares the cost associated with each arc in the tour with the passengers in the vehicle. Constraints from ride requests, as well as the capacity of the car, must be satisfied. We present a mathematical formulation for the problem investigated in this study and solve it in an optimization tool. We also present three heuristics that hybridize exact and heuristic methods. These algorithms use a decomposition strategy that other enriched vehicle routing problems can utilize.
{"title":"Prize Collecting Traveling Salesman Problem with Ridesharing","authors":"Ygor Alcântara de Medeiros, M. Goldbarg, E. Goldbarg","doi":"10.22456/2175-2745.94082","DOIUrl":"https://doi.org/10.22456/2175-2745.94082","url":null,"abstract":"The Prize Collecting Traveling Salesman Problem with Ridesharing is a model that joins elements from the Prize Collecting Traveling Salesman and the collaborative transport. The salesman is the driver of a capacitated vehicle and uses a ridesharing system to minimize travel costs. There are a penalty and a bonus associated with each vertex of a graph, G, that represents the problem. There is also a cost associated with each edge of G. The salesman must choose a subset of vertices to be visited so that the total bonus collection is at least a given a parameter. The length of the tour plus the sum of penalties of all vertices not visited is as small as possible. There is a set of persons demanding rides. The ride request consists of a pickup and a drop off location, a maximum travel duration, and the maximum amount the person agrees to pay. The driver shares the cost associated with each arc in the tour with the passengers in the vehicle. Constraints from ride requests, as well as the capacity of the car, must be satisfied. We present a mathematical formulation for the problem investigated in this study and solve it in an optimization tool. We also present three heuristics that hybridize exact and heuristic methods. These algorithms use a decomposition strategy that other enriched vehicle routing problems can utilize.","PeriodicalId":82472,"journal":{"name":"Research initiative, treatment action : RITA","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91265601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-01-15DOI: 10.22456/2175-2745.96181
A. B. S. Neto, T. Ferreira, M. D. C. M. Batista, P. Firmino
Cognitive models have been paramount for modeling phenomena for which empirical data are unavailable, scarce, or only partially relevant. These approaches are based on methods dedicated to preparing experts and then to elicit their opinions about the variables that describe the phenomena under study. In time series forecasting exercises, elicitation processes seek to obtain accurate estimates, overcoming human heuristic biases, while being less time consuming. This paper aims to compare the performance of cognitive and mathematical time series predictors, regarding accuracy. The results are based on the comparison of predictors of the cognitive and mathematical models for several time series from the M3-Competition. From the results, one can see that cognitive models are, at least, as accurate as ARIMA models predictions.
{"title":"Studying the Performance of Cognitive Models in Time Series Forecasting","authors":"A. B. S. Neto, T. Ferreira, M. D. C. M. Batista, P. Firmino","doi":"10.22456/2175-2745.96181","DOIUrl":"https://doi.org/10.22456/2175-2745.96181","url":null,"abstract":"Cognitive models have been paramount for modeling phenomena for which empirical data are unavailable, scarce, or only partially relevant. These approaches are based on methods dedicated to preparing experts and then to elicit their opinions about the variables that describe the phenomena under study. In time series forecasting exercises, elicitation processes seek to obtain accurate estimates, overcoming human heuristic biases, while being less time consuming. This paper aims to compare the performance of cognitive and mathematical time series predictors, regarding accuracy. The results are based on the comparison of predictors of the cognitive and mathematical models for several time series from the M3-Competition. From the results, one can see that cognitive models are, at least, as accurate as ARIMA models predictions.","PeriodicalId":82472,"journal":{"name":"Research initiative, treatment action : RITA","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75978653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-01-15DOI: 10.22456/2175-2745.92406
G. H. Nunes, B. A. Oliveira, C. Nametala
The National High School Examination (ENEM) gains each year more importance, as it gradually, replacing traditional vestibular. Many simulations are done almost randomly by teachers or systems, with questions chosen without discretion. With this methodology, if a test needs to be reapplied, it is not possible to recreate it with questions that have the same difficulty as those used in the first evaluation. In this context, the present work presents the development of an ENEM Intelligent Simulation Generation System that calculates the parameters of Item Response Theory (TRI) of questions that have already been applied in ENEM and, based on them, classifies them. in groups of difficulty, thus enabling the generation of balanced tests. For this, the K-means algorithm was used to group the questions into three difficulty groups: easy, medium and difficult. To verify the functioning of the system, a simulation with 180 questions was generated along the ENEM model. It could be seen that in 37.7% of cases this happened. This hit rate was not greater because the algorithm confounded the difficulty of issues that are in close classes. However, the system has a hit rate of 92.8% in the classification of questions that are in distant groups.
{"title":"A Computational Strategy for Classification of Enem Issues Based on Item Response Theory","authors":"G. H. Nunes, B. A. Oliveira, C. Nametala","doi":"10.22456/2175-2745.92406","DOIUrl":"https://doi.org/10.22456/2175-2745.92406","url":null,"abstract":"The National High School Examination (ENEM) gains each year more importance, as it gradually, replacing traditional vestibular. Many simulations are done almost randomly by teachers or systems, with questions chosen without discretion. With this methodology, if a test needs to be reapplied, it is not possible to recreate it with questions that have the same difficulty as those used in the first evaluation. In this context, the present work presents the development of an ENEM Intelligent Simulation Generation System that calculates the parameters of Item Response Theory (TRI) of questions that have already been applied in ENEM and, based on them, classifies them. in groups of difficulty, thus enabling the generation of balanced tests. For this, the K-means algorithm was used to group the questions into three difficulty groups: easy, medium and difficult. To verify the functioning of the system, a simulation with 180 questions was generated along the ENEM model. It could be seen that in 37.7% of cases this happened. This hit rate was not greater because the algorithm confounded the difficulty of issues that are in close classes. However, the system has a hit rate of 92.8% in the classification of questions that are in distant groups.","PeriodicalId":82472,"journal":{"name":"Research initiative, treatment action : RITA","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81066039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-01-15DOI: 10.22456/2175-2745.88822
Felipe Fernandes Lima Melo, J. F. D. S. Junior, G. Callou
Due to the growth of cloud computing, data center environment has grown in importance and in use. Data centers are responsible for maintaining and processing several critical-value applications. Therefore, data center infrastructures must be evaluated in order to improve the high availability and reliability demanded for such environments. This work adopts Stochastic Petri Nets (SPN) to evaluate the impact of maintenance policies on the data center dependability. The main goal is to analyze maintenance policies, associated to SLA contracts, and to propose improvements. In order to accomplish this, an optimization strategy that uses Euclidean distance is adopted to indicate the most appropriate solution assuming conflicting requirements (e.g., cost and availability). To illustrate the applicability of the proposed models and approach, this work presents case studies comparing different SLA contracts and maintenance policies (preventive and corrective) applied on data center electrical infrastructures.
{"title":"Evaluating the impact of maintenance policies associated to SLA contracts on the dependability of data centers electrical infrastructures","authors":"Felipe Fernandes Lima Melo, J. F. D. S. Junior, G. Callou","doi":"10.22456/2175-2745.88822","DOIUrl":"https://doi.org/10.22456/2175-2745.88822","url":null,"abstract":"Due to the growth of cloud computing, data center environment has grown in importance and in use. Data centers are responsible for maintaining and processing several critical-value applications. Therefore, data center infrastructures must be evaluated in order to improve the high availability and reliability demanded for such environments. This work adopts Stochastic Petri Nets (SPN) to evaluate the impact of maintenance policies on the data center dependability. The main goal is to analyze maintenance policies, associated to SLA contracts, and to propose improvements. In order to accomplish this, an optimization strategy that uses Euclidean distance is adopted to indicate the most appropriate solution assuming conflicting requirements (e.g., cost and availability). To illustrate the applicability of the proposed models and approach, this work presents case studies comparing different SLA contracts and maintenance policies (preventive and corrective) applied on data center electrical infrastructures.","PeriodicalId":82472,"journal":{"name":"Research initiative, treatment action : RITA","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72447612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-01-15DOI: 10.22456/2175-2745.91414
Francisco das Chagas Imperes Filho, V. Machado, R. Veras, K. Aires, Aline Montenegro Leal Silva
Clustering algorithms are often used to form groups based on the similarity of their members. In this context, understanding a group is just as important as its composition. Identifying, or labeling groups can assist with their interpretation and, consequently, guide decision-making efforts by taking into account the features from each group. Interpreting groups can be beneficial when it is necessary to know what makes an element a part of a given group, what are the main features of a group, and what are the differences and similarities among them. This work describes a method for finding relevant features and generate labels for the elements of each group, uniquely identifying them. This way, our approach solves the problem of finding relevant definitions that can identify groups. The proposed method transforms the standard output of an unsupervised distance-based clustering algorithm into a Pertinence Degree (GP), where each element of the database receives a GP concerning each formed group. The elements with their GPs are used to formulate ranges of values for their attributes. Such ranges can identify the groups uniquely. The labels produced by this approach averaged 94.83% of correct answers for the analyzed databases, allowing a natural interpretation of the generated definitions.
{"title":"Group Labeling Methodology Using Distance-based Data Grouping Algorithms","authors":"Francisco das Chagas Imperes Filho, V. Machado, R. Veras, K. Aires, Aline Montenegro Leal Silva","doi":"10.22456/2175-2745.91414","DOIUrl":"https://doi.org/10.22456/2175-2745.91414","url":null,"abstract":"Clustering algorithms are often used to form groups based on the similarity of their members. In this context, understanding a group is just as important as its composition. Identifying, or labeling groups can assist with their interpretation and, consequently, guide decision-making efforts by taking into account the features from each group. Interpreting groups can be beneficial when it is necessary to know what makes an element a part of a given group, what are the main features of a group, and what are the differences and similarities among them. This work describes a method for finding relevant features and generate labels for the elements of each group, uniquely identifying them. This way, our approach solves the problem of finding relevant definitions that can identify groups. The proposed method transforms the standard output of an unsupervised distance-based clustering algorithm into a Pertinence Degree (GP), where each element of the database receives a GP concerning each formed group. The elements with their GPs are used to formulate ranges of values for their attributes. Such ranges can identify the groups uniquely. The labels produced by this approach averaged 94.83% of correct answers for the analyzed databases, allowing a natural interpretation of the generated definitions.","PeriodicalId":82472,"journal":{"name":"Research initiative, treatment action : RITA","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77802522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-01-15DOI: 10.22456/2175-2745.89115
G. B. Santos, André Tragancin Filho
This paper describes experiments with PHOC (Pyramid Histogram of Color) features descriptor in terms of capacity for representing features presented in breast radiograph (also known as mammography). Patches were taken from regions in digital mammographies, representing benign, cancerous, normal tissues and image’s background. The motivation is to evaluate the proposal in perspective of using it for execution in an inexpensive ordinary desktop computer in places located far from medical experts. The images were obtained from DDSM database and processed producing the feature-dataset used for training an Artificial Neural Network, the results were evaluated by analysis of the learning rate curve and ROC curves, besides these graphical analytical tools the confusion matrix and other quantitative metrics (TPR, FPR and Accuracy) were also extracted and analyzed. The average accuracy ≈ 0 . 8 and the other metrics extracted from results demonstrate that the proposal presents potential for further developments. At the best effort, PHOC was not found in literature for applications in mammographies such as it is proposed here.
{"title":"PHOC Descriptor Applied for Mammography Classification","authors":"G. B. Santos, André Tragancin Filho","doi":"10.22456/2175-2745.89115","DOIUrl":"https://doi.org/10.22456/2175-2745.89115","url":null,"abstract":"This paper describes experiments with PHOC (Pyramid Histogram of Color) features descriptor in terms of capacity for representing features presented in breast radiograph (also known as mammography). Patches were taken from regions in digital mammographies, representing benign, cancerous, normal tissues and image’s background. The motivation is to evaluate the proposal in perspective of using it for execution in an inexpensive ordinary desktop computer in places located far from medical experts. The images were obtained from DDSM database and processed producing the feature-dataset used for training an Artificial Neural Network, the results were evaluated by analysis of the learning rate curve and ROC curves, besides these graphical analytical tools the confusion matrix and other quantitative metrics (TPR, FPR and Accuracy) were also extracted and analyzed. The average accuracy ≈ 0 . 8 and the other metrics extracted from results demonstrate that the proposal presents potential for further developments. At the best effort, PHOC was not found in literature for applications in mammographies such as it is proposed here.","PeriodicalId":82472,"journal":{"name":"Research initiative, treatment action : RITA","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85269356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-01-15DOI: 10.22456/2175-2745.90822
Agustín Alejandro Ortiz Díaz, Isvani Inocencio Frías Blanco, L. M. Mariño, F. Baldo
This paper presents a new learning algorithm for inducing decision trees from data streams. In these domains, large amounts of data are constantly arriving over time, possibly at high speed. The proposed algorithm uses a top-down induction method for building trees, splitting leaf nodes recursively, until none of them can be expanded. The new algorithm combines two split methods in the tree induction. The first method is able to guarantee, with statistical significance, that each split chosen would be the same as that chosen using infinite examples. By doing so, it aims at ensuring that the tree induced online is close to the optimal model. However, this split method often needs too many examples to make a decision about the best split, which delays the accuracy improvement of the online predictive learning model. Therefore, the second method is used to split nodes more quickly, speeding up the tree growth. The second split method is based on the observation that larger trees are able to store more information about the training examples and to represent more complex concepts. The first split method is also used to correct splits previously suggested by the second one, when it has sufficient evidence. Finally, an additional procedure rebuilds the tree model according to the suggestions made with an adequate level of statistical significance. The proposed algorithm is empirically compared with several well-known induction algorithms for learning decision trees from data streams. In the tests it is possible to observe that the proposed algorithm is more competitive in terms of accuracy and model size using various synthetic and real world datasets.
{"title":"An Online Tree-Based Approach for Mining Non-Stationary High-Speed Data Streams","authors":"Agustín Alejandro Ortiz Díaz, Isvani Inocencio Frías Blanco, L. M. Mariño, F. Baldo","doi":"10.22456/2175-2745.90822","DOIUrl":"https://doi.org/10.22456/2175-2745.90822","url":null,"abstract":"This paper presents a new learning algorithm for inducing decision trees from data streams. In these domains, large amounts of data are constantly arriving over time, possibly at high speed. The proposed algorithm uses a top-down induction method for building trees, splitting leaf nodes recursively, until none of them can be expanded. The new algorithm combines two split methods in the tree induction. The first method is able to guarantee, with statistical significance, that each split chosen would be the same as that chosen using infinite examples. By doing so, it aims at ensuring that the tree induced online is close to the optimal model. However, this split method often needs too many examples to make a decision about the best split, which delays the accuracy improvement of the online predictive learning model. Therefore, the second method is used to split nodes more quickly, speeding up the tree growth. The second split method is based on the observation that larger trees are able to store more information about the training examples and to represent more complex concepts. The first split method is also used to correct splits previously suggested by the second one, when it has sufficient evidence. Finally, an additional procedure rebuilds the tree model according to the suggestions made with an adequate level of statistical significance. The proposed algorithm is empirically compared with several well-known induction algorithms for learning decision trees from data streams. In the tests it is possible to observe that the proposed algorithm is more competitive in terms of accuracy and model size using various synthetic and real world datasets.","PeriodicalId":82472,"journal":{"name":"Research initiative, treatment action : RITA","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75882151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}