Eye movement trajectories of depressed patients and normal persons are different. The eye-tracking data obtained by the eye tracker can adequately summarize the characteristics of the eye movement trajectory. Based on the characteristics of eye movement trajectory, this paper proposes a new depression detection model by using an artificial neural network, which can better assist doctors in the diagnosis of depression. First, we extract the feature of eye movement trajectory, which obtains from time-series data recording the trajectory of the eye. Then, we convert the data from three-dimensional to two-dimensional, and perform feature extraction and transformation. Finally, we propose a new depression detection model by using artificial neural networks. The experimental results show that the best result of the model evaluation is 83.17%, which can effectively assist doctors in the diagnosis of depression.
{"title":"Detection Model of Depression Based on Eye Movement Trajectory","authors":"Yifang Yuan, Qingxiang Wang","doi":"10.1109/dsaa.2019.00082","DOIUrl":"https://doi.org/10.1109/dsaa.2019.00082","url":null,"abstract":"Eye movement trajectories of depressed patients and normal persons are different. The eye-tracking data obtained by the eye tracker can adequately summarize the characteristics of the eye movement trajectory. Based on the characteristics of eye movement trajectory, this paper proposes a new depression detection model by using an artificial neural network, which can better assist doctors in the diagnosis of depression. First, we extract the feature of eye movement trajectory, which obtains from time-series data recording the trajectory of the eye. Then, we convert the data from three-dimensional to two-dimensional, and perform feature extraction and transformation. Finally, we propose a new depression detection model by using artificial neural networks. The experimental results show that the best result of the model evaluation is 83.17%, which can effectively assist doctors in the diagnosis of depression.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128740625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
While standard statistical inference techniques and machine learning generalization bounds assume that tests are run on data selected independently of the hypotheses, practical data analysis and machine learning are usually iterative and adaptive processes where the same holdout data is often used for testing a sequence of hypotheses (or models), which may each depend on the outcome of the previous tests on the same data. In this work, we present RADABOUND a rigorous, efficient and practical procedure for controlling the generalization error when using a holdout sample for multiple adaptive testing. Our solution is based on a new application of the Rademacher Complexity generalization bounds, adapted to dependent tests. We demonstrate the statistical power and practicality of our method through extensive simulations and comparisons to alternative approaches. In particular, we show that our rigorous solution is a substantially more powerful and efficient than the differential privacy based approach proposed in Dwork et al. [1]–[3].
{"title":"A Rademacher Complexity Based Method for Controlling Power and Confidence Level in Adaptive Statistical Analysis","authors":"L. Stefani, E. Upfal","doi":"10.1109/DSAA.2019.00021","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00021","url":null,"abstract":"While standard statistical inference techniques and machine learning generalization bounds assume that tests are run on data selected independently of the hypotheses, practical data analysis and machine learning are usually iterative and adaptive processes where the same holdout data is often used for testing a sequence of hypotheses (or models), which may each depend on the outcome of the previous tests on the same data. In this work, we present RADABOUND a rigorous, efficient and practical procedure for controlling the generalization error when using a holdout sample for multiple adaptive testing. Our solution is based on a new application of the Rademacher Complexity generalization bounds, adapted to dependent tests. We demonstrate the statistical power and practicality of our method through extensive simulations and comparisons to alternative approaches. In particular, we show that our rigorous solution is a substantially more powerful and efficient than the differential privacy based approach proposed in Dwork et al. [1]–[3].","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131820016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Forecasting from multivariate time series data is a difficult task, made more so in the situation where the number of series (p) is much larger than the length of each series (T), which makes dimension reduction desirable prior to obtaining a model. The LASSO has become a widely-used method to choose relevant covariates out of many candidates, and it has many variations and extensions, such as grouped LASSO, adaptive LASSO, weighted lag adaptive LASSO, and fused LASSO. Of these, only the weighted lag adaptive LASSO and the fused LASSO take into account natural ordering among series. To examine the ability of variations on the LASSO to choose relevant covariates for short time series we use simulations for series with fewer than 50 observations. We then apply the methods to a data set on significant changes in self-reported psycho-social symptoms in the 30 years after the Chornobyl nuclear catastrophe.
{"title":"Comparison of Variable Selection Methods for Forecasting from Short Time Series","authors":"M. McGee, R. Yaffee","doi":"10.1109/DSAA.2019.00068","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00068","url":null,"abstract":"Forecasting from multivariate time series data is a difficult task, made more so in the situation where the number of series (p) is much larger than the length of each series (T), which makes dimension reduction desirable prior to obtaining a model. The LASSO has become a widely-used method to choose relevant covariates out of many candidates, and it has many variations and extensions, such as grouped LASSO, adaptive LASSO, weighted lag adaptive LASSO, and fused LASSO. Of these, only the weighted lag adaptive LASSO and the fused LASSO take into account natural ordering among series. To examine the ability of variations on the LASSO to choose relevant covariates for short time series we use simulations for series with fewer than 50 observations. We then apply the methods to a data set on significant changes in self-reported psycho-social symptoms in the 30 years after the Chornobyl nuclear catastrophe.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130799999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The class imbalance problem has been thoroughly studied over the past two decades. More recently, the research community realized that the problem of imbalanced distributions also occurred in other tasks beyond classification. Regression problems are among these newly studied tasks where the problem of imbalanced domains also poses important challenges. Imbalanced regression problems occur in a diversity of real world domains such as meteorological (predicting weather extreme values), financial (extreme stock returns forecasting) or medical (anticipate rare values). In imbalanced regression the end-user preferences are biased towards values of the target variable that are under-represented on the available data. Several pre-processing methods were proposed to address this problem. These methods change the training set to force the learner to focus on the rare cases. However, as far as we know, the relationship between the data intrinsic characteristics and the performance achieved by these methods has not yet been studied for imbalanced regression tasks. In this paper we describe a study of the impact certain data characteristics may have in the results of applying pre-processing methods to imbalanced regression problems. To achieve this goal, we define potentially interesting data characteristics of regression problems. We then conduct our study using a synthetic data repository build for this purpose. We show that all the different characteristics studied have a different behaviour that is related with the level at which the data characteristic is present and the learning algorithm used. The main contributions of our work are: i) to define interesting data characteristics for regression tasks; ii) to create the first repository of imbalanced regression tasks containing 6000 data sets with controlled data characteristics; and iii) to provide insights on the impact of intrinsic data characteristics in the results of pre-processing methods for handling imbalanced regression tasks.
{"title":"A Study on the Impact of Data Characteristics in Imbalanced Regression Tasks","authors":"Paula Branco, L. Torgo","doi":"10.1109/DSAA.2019.00034","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00034","url":null,"abstract":"The class imbalance problem has been thoroughly studied over the past two decades. More recently, the research community realized that the problem of imbalanced distributions also occurred in other tasks beyond classification. Regression problems are among these newly studied tasks where the problem of imbalanced domains also poses important challenges. Imbalanced regression problems occur in a diversity of real world domains such as meteorological (predicting weather extreme values), financial (extreme stock returns forecasting) or medical (anticipate rare values). In imbalanced regression the end-user preferences are biased towards values of the target variable that are under-represented on the available data. Several pre-processing methods were proposed to address this problem. These methods change the training set to force the learner to focus on the rare cases. However, as far as we know, the relationship between the data intrinsic characteristics and the performance achieved by these methods has not yet been studied for imbalanced regression tasks. In this paper we describe a study of the impact certain data characteristics may have in the results of applying pre-processing methods to imbalanced regression problems. To achieve this goal, we define potentially interesting data characteristics of regression problems. We then conduct our study using a synthetic data repository build for this purpose. We show that all the different characteristics studied have a different behaviour that is related with the level at which the data characteristic is present and the learning algorithm used. The main contributions of our work are: i) to define interesting data characteristics for regression tasks; ii) to create the first repository of imbalanced regression tasks containing 6000 data sets with controlled data characteristics; and iii) to provide insights on the impact of intrinsic data characteristics in the results of pre-processing methods for handling imbalanced regression tasks.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"40 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122811753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ankit Agrawal, A. Saboo, W. Xiong, G. Olson, A. Choudhary
Martensite start temperature (MsT) is an important characteristic of steels, knowledge of which is vital for materials engineers to guide the structural design process of steels. It is defined as the highest temperature at which the austenite phase in steel begins to transform to martensite phase during rapid cooling. Here we describe the development and deployment of predictive models for MsT, given the chemical composition of the material. The data-driven models described here are built on a dataset of about 1000 experimental observations reported in published literature, and the best model developed was found to significantly outperform several existing MsT prediction methods. The data-driven analyses also revealed several interesting insights about the relationship between MsT and the constituent alloying elements of steels. The most accurate predictive model resulting from this work has been deployed in an online web-tool that takes as input the elemental alloying composition of a given steel and predicts its MsT. The online MsT predictor is available at http://info.eecs.northwestern.edu/MsTpredictor.
{"title":"Martensite Start Temperature Predictor for Steels Using Ensemble Data Mining","authors":"Ankit Agrawal, A. Saboo, W. Xiong, G. Olson, A. Choudhary","doi":"10.1109/DSAA.2019.00067","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00067","url":null,"abstract":"Martensite start temperature (MsT) is an important characteristic of steels, knowledge of which is vital for materials engineers to guide the structural design process of steels. It is defined as the highest temperature at which the austenite phase in steel begins to transform to martensite phase during rapid cooling. Here we describe the development and deployment of predictive models for MsT, given the chemical composition of the material. The data-driven models described here are built on a dataset of about 1000 experimental observations reported in published literature, and the best model developed was found to significantly outperform several existing MsT prediction methods. The data-driven analyses also revealed several interesting insights about the relationship between MsT and the constituent alloying elements of steels. The most accurate predictive model resulting from this work has been deployed in an online web-tool that takes as input the elemental alloying composition of a given steel and predicts its MsT. The online MsT predictor is available at http://info.eecs.northwestern.edu/MsTpredictor.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124931034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Over the past decade low graduation and retention rates has plagued higher education institutions. To assist students in choosing a sequence of courses, choosing majors and successful academic pathways; many institutions provide several on-site academic advising services supported by data driven educational technologies. Accurate performance prediction can serve as the backbone for degree planning software, personalized advising systems and early warning systems that can identify students at-risk of dropping from their field of study. In this work, we present a deep learning based recommender system approach called Neural Collaborative Filtering (NCF) for predicting the grade a student will earn in a course that he/she plans to take in the next-term. Prior grade prediction methods are based on matrix factorization (MF) where students and courses are represented in a latent "knowledge" space. The deep learning inspired approach provides added flexibility in learning the latent spaces in comparison to MF approaches. The proposed approach also incorporates instructor information besides student and course information. Moreover, for proper analysis of the learned model parameters, we assume the embeddings obtained for students, courses and instructors should be non-negative. This non-negative NCF model referred by NCFnn model adds a rectified linear units (ReLU) on the embedding layer of NCF. The experimental results on datasets from George Mason University, a large, public university in the United States, demonstrate that the proposed NCF approaches significantly outperform competitive baselines across different test sets.
{"title":"Grade Prediction with Neural Collaborative Filtering","authors":"Zhiyun Ren, Xia Ning, Andrew S. Lan, H. Rangwala","doi":"10.1109/DSAA.2019.00014","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00014","url":null,"abstract":"Over the past decade low graduation and retention rates has plagued higher education institutions. To assist students in choosing a sequence of courses, choosing majors and successful academic pathways; many institutions provide several on-site academic advising services supported by data driven educational technologies. Accurate performance prediction can serve as the backbone for degree planning software, personalized advising systems and early warning systems that can identify students at-risk of dropping from their field of study. In this work, we present a deep learning based recommender system approach called Neural Collaborative Filtering (NCF) for predicting the grade a student will earn in a course that he/she plans to take in the next-term. Prior grade prediction methods are based on matrix factorization (MF) where students and courses are represented in a latent \"knowledge\" space. The deep learning inspired approach provides added flexibility in learning the latent spaces in comparison to MF approaches. The proposed approach also incorporates instructor information besides student and course information. Moreover, for proper analysis of the learned model parameters, we assume the embeddings obtained for students, courses and instructors should be non-negative. This non-negative NCF model referred by NCFnn model adds a rectified linear units (ReLU) on the embedding layer of NCF. The experimental results on datasets from George Mason University, a large, public university in the United States, demonstrate that the proposed NCF approaches significantly outperform competitive baselines across different test sets.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127498716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The labor market refers to the market between job seekers and employers. As much of job seeking and talent hiring activities are now performed online, a large amount of job posting and application data have been collected and can be re-purposed for labor market analysis. In the labor market, both supply and demand are the key factors in determining an appropriate salary for both job applicants and employers in the market. However, it is challenging to discover the supply and demand for any labor market. In this paper, we propose a novel framework to built a labor market model using a large amount of job post and applicant data. For each labor market, the supply and demand of the labor market are constructed by using offer salaries of job posts and the response of applicants. The equilibrium salary and the equilibrium job quantity are calculated by considering the supply and demand. This labor market modeling framework is then applied to a large job repository dataset containing job post and applicant data of Singapore, a developed economy in Southeast Asia. Several issues are discussed thoroughly in the paper including developing and evaluate salary prediction models to predict missing offer salaries and estimate reserved salaries. Moreover, we propose a way to empirically evaluate of equilibrium salary of the proposed model. The constructed labor market models are then used to explain the job seeker and employer specific challenges in various market segments. We also report gender and age biases that exist in labor markets. Finally, we present a wage dashboard system that yields interesting salary insights using the model.
{"title":"On Analysing Supply and Demand in Labor Markets: Framework, Model and System","authors":"H. S. Sugiarto, Ee-Peng Lim, Ngak-Leng Sim","doi":"10.1109/DSAA.2019.00066","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00066","url":null,"abstract":"The labor market refers to the market between job seekers and employers. As much of job seeking and talent hiring activities are now performed online, a large amount of job posting and application data have been collected and can be re-purposed for labor market analysis. In the labor market, both supply and demand are the key factors in determining an appropriate salary for both job applicants and employers in the market. However, it is challenging to discover the supply and demand for any labor market. In this paper, we propose a novel framework to built a labor market model using a large amount of job post and applicant data. For each labor market, the supply and demand of the labor market are constructed by using offer salaries of job posts and the response of applicants. The equilibrium salary and the equilibrium job quantity are calculated by considering the supply and demand. This labor market modeling framework is then applied to a large job repository dataset containing job post and applicant data of Singapore, a developed economy in Southeast Asia. Several issues are discussed thoroughly in the paper including developing and evaluate salary prediction models to predict missing offer salaries and estimate reserved salaries. Moreover, we propose a way to empirically evaluate of equilibrium salary of the proposed model. The constructed labor market models are then used to explain the job seeker and employer specific challenges in various market segments. We also report gender and age biases that exist in labor markets. Finally, we present a wage dashboard system that yields interesting salary insights using the model.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"04 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131039980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we explore the use of supervised machine learning models to predict the employability of high school students with local businesses for part-time jobs. We further compare the performance of trained models used in this analysis to one another. Empirical results show that it is possible to predict the employability of high school students with local businesses with high-predictive accuracies. The trained predictive models perform better with larger dataset, with up to 93% accuracy.
{"title":"Using Machine Learning to Predict High School Student Employability – A Case Study","authors":"Aarushi Dubey, M. Mani","doi":"10.1109/DSAA.2019.00078","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00078","url":null,"abstract":"In this paper, we explore the use of supervised machine learning models to predict the employability of high school students with local businesses for part-time jobs. We further compare the performance of trained models used in this analysis to one another. Empirical results show that it is possible to predict the employability of high school students with local businesses with high-predictive accuracies. The trained predictive models perform better with larger dataset, with up to 93% accuracy.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133244400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the era of information explosion, plenty of data has been generated through a variety of channels, such as social networks, crowdsourcing platforms and blogs. Conflicts and errors are constantly emerging. Truth discovery aims to find trustworthy information from conflicting data by considering source reliability. However, most traditional truth discovery approaches are designed only for structured data, and fail to meet the strong requirements to extract trustworthy information from unstructured raw text data. The major challenges of inferring reliable information on text data stem from the multifactorial property (i.e., an answer may contain multiple different key factors, which may be complex) and the diversity of word usages (i.e., different words may share similar semantic information, but the spelling of which are completely different). To solve these challenges, an ant colony optimization based text data truth discovery model is proposed. Firstly, keywords extracted from the whole answers of the specific question are grouped into a set. Then, we translate the truth discovery problem to a subset optimization problem, and the parallel ant colony optimization is utilized to find correct keywords for each question based on the hypothesis of truth discovery from the whole keywords. After that, the answers to each question can be ranked based on the similarities between keywords of user answers and identified correct keywords found by colony. The experiment results on real dataset show that even the semantic information of text data is complex, our proposed model can still find trustworthy information from complex answers compared with retrieval-based and state-of-the-art approaches.
{"title":"Truth Discovery from Multi-Sourced Text Data Based on Ant Colony Optimization","authors":"Chen Chang, Jianjun Cao, Guojun Lv, Nianfeng Weng","doi":"10.1109/DSAA.2019.00031","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00031","url":null,"abstract":"In the era of information explosion, plenty of data has been generated through a variety of channels, such as social networks, crowdsourcing platforms and blogs. Conflicts and errors are constantly emerging. Truth discovery aims to find trustworthy information from conflicting data by considering source reliability. However, most traditional truth discovery approaches are designed only for structured data, and fail to meet the strong requirements to extract trustworthy information from unstructured raw text data. The major challenges of inferring reliable information on text data stem from the multifactorial property (i.e., an answer may contain multiple different key factors, which may be complex) and the diversity of word usages (i.e., different words may share similar semantic information, but the spelling of which are completely different). To solve these challenges, an ant colony optimization based text data truth discovery model is proposed. Firstly, keywords extracted from the whole answers of the specific question are grouped into a set. Then, we translate the truth discovery problem to a subset optimization problem, and the parallel ant colony optimization is utilized to find correct keywords for each question based on the hypothesis of truth discovery from the whole keywords. After that, the answers to each question can be ranked based on the similarities between keywords of user answers and identified correct keywords found by colony. The experiment results on real dataset show that even the semantic information of text data is complex, our proposed model can still find trustworthy information from complex answers compared with retrieval-based and state-of-the-art approaches.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124847503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rodrigo Rivera-Castro, A. Pletnev, Polina Pilyugina, G. Diaz, I. Nazarov, Wanyi Zhu, E. Burnaev
Topological Data Analysis (TDA) is a recent approach to analyze data sets from the perspective of their topological structure. Its use for time series data has been limited. In this work, a system developed for a leading provider of cloud computing combining both user segmentation and demand forecasting is presented. It consists of a TDA-based clustering method for time series inspired by a popular managerial framework for customer segmentation and extended to the case of clusterwise regression using matrix factorization methods to forecast demand. Increasing customer loyalty and producing accurate forecasts remain active topics of discussion both for researchers and managers. Using a public and a novel proprietary data set of commercial data, this research shows that the proposed system enables analysts to both cluster their user base and plan demand at a granular level with significantly higher accuracy than a state of the art baseline. This work thus seeks to introduce TDA-based clustering of time series and clusterwise regression with matrix factorization methods as viable tools for the practitioner.
{"title":"Topology-Based Clusterwise Regression for User Segmentation and Demand Forecasting","authors":"Rodrigo Rivera-Castro, A. Pletnev, Polina Pilyugina, G. Diaz, I. Nazarov, Wanyi Zhu, E. Burnaev","doi":"10.1109/DSAA.2019.00048","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00048","url":null,"abstract":"Topological Data Analysis (TDA) is a recent approach to analyze data sets from the perspective of their topological structure. Its use for time series data has been limited. In this work, a system developed for a leading provider of cloud computing combining both user segmentation and demand forecasting is presented. It consists of a TDA-based clustering method for time series inspired by a popular managerial framework for customer segmentation and extended to the case of clusterwise regression using matrix factorization methods to forecast demand. Increasing customer loyalty and producing accurate forecasts remain active topics of discussion both for researchers and managers. Using a public and a novel proprietary data set of commercial data, this research shows that the proposed system enables analysts to both cluster their user base and plan demand at a granular level with significantly higher accuracy than a state of the art baseline. This work thus seeks to introduce TDA-based clustering of time series and clusterwise regression with matrix factorization methods as viable tools for the practitioner.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127778711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}