首页 > 最新文献

2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)最新文献

英文 中文
Detection Model of Depression Based on Eye Movement Trajectory 基于眼动轨迹的抑郁检测模型
Yifang Yuan, Qingxiang Wang
Eye movement trajectories of depressed patients and normal persons are different. The eye-tracking data obtained by the eye tracker can adequately summarize the characteristics of the eye movement trajectory. Based on the characteristics of eye movement trajectory, this paper proposes a new depression detection model by using an artificial neural network, which can better assist doctors in the diagnosis of depression. First, we extract the feature of eye movement trajectory, which obtains from time-series data recording the trajectory of the eye. Then, we convert the data from three-dimensional to two-dimensional, and perform feature extraction and transformation. Finally, we propose a new depression detection model by using artificial neural networks. The experimental results show that the best result of the model evaluation is 83.17%, which can effectively assist doctors in the diagnosis of depression.
抑郁症患者的眼动轨迹与正常人不同。眼动仪获得的眼动数据可以充分概括眼球运动轨迹的特征。基于眼动轨迹的特点,本文提出了一种新的基于人工神经网络的抑郁症检测模型,可以更好地辅助医生对抑郁症的诊断。首先,从记录眼球运动轨迹的时间序列数据中提取眼球运动轨迹特征;然后,将数据从三维转换为二维,并进行特征提取和转换。最后,我们提出了一种新的基于人工神经网络的抑郁症检测模型。实验结果表明,模型评价的最佳结果为83.17%,能够有效地辅助医生对抑郁症的诊断。
{"title":"Detection Model of Depression Based on Eye Movement Trajectory","authors":"Yifang Yuan, Qingxiang Wang","doi":"10.1109/dsaa.2019.00082","DOIUrl":"https://doi.org/10.1109/dsaa.2019.00082","url":null,"abstract":"Eye movement trajectories of depressed patients and normal persons are different. The eye-tracking data obtained by the eye tracker can adequately summarize the characteristics of the eye movement trajectory. Based on the characteristics of eye movement trajectory, this paper proposes a new depression detection model by using an artificial neural network, which can better assist doctors in the diagnosis of depression. First, we extract the feature of eye movement trajectory, which obtains from time-series data recording the trajectory of the eye. Then, we convert the data from three-dimensional to two-dimensional, and perform feature extraction and transformation. Finally, we propose a new depression detection model by using artificial neural networks. The experimental results show that the best result of the model evaluation is 83.17%, which can effectively assist doctors in the diagnosis of depression.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128740625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A Rademacher Complexity Based Method for Controlling Power and Confidence Level in Adaptive Statistical Analysis 自适应统计分析中基于Rademacher复杂度的控制能力和置信水平方法
L. Stefani, E. Upfal
While standard statistical inference techniques and machine learning generalization bounds assume that tests are run on data selected independently of the hypotheses, practical data analysis and machine learning are usually iterative and adaptive processes where the same holdout data is often used for testing a sequence of hypotheses (or models), which may each depend on the outcome of the previous tests on the same data. In this work, we present RADABOUND a rigorous, efficient and practical procedure for controlling the generalization error when using a holdout sample for multiple adaptive testing. Our solution is based on a new application of the Rademacher Complexity generalization bounds, adapted to dependent tests. We demonstrate the statistical power and practicality of our method through extensive simulations and comparisons to alternative approaches. In particular, we show that our rigorous solution is a substantially more powerful and efficient than the differential privacy based approach proposed in Dwork et al. [1]–[3].
虽然标准统计推断技术和机器学习泛化界限假设测试是在独立于假设的选择数据上运行的,但实际数据分析和机器学习通常是迭代和自适应的过程,其中通常使用相同的保留数据来测试一系列假设(或模型),其中每个假设(或模型)可能取决于先前对相同数据的测试结果。在这项工作中,我们提出了一种严格、有效和实用的方法来控制在使用保留样本进行多次自适应测试时的泛化误差。我们的解决方案是基于Rademacher复杂度泛化界的一种新应用,适用于相关测试。我们通过广泛的模拟和与其他方法的比较来证明我们的方法的统计能力和实用性。特别是,我们证明了我们的严格解决方案比Dwork等人[1]-[3]提出的基于差分隐私的方法更强大和有效。
{"title":"A Rademacher Complexity Based Method for Controlling Power and Confidence Level in Adaptive Statistical Analysis","authors":"L. Stefani, E. Upfal","doi":"10.1109/DSAA.2019.00021","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00021","url":null,"abstract":"While standard statistical inference techniques and machine learning generalization bounds assume that tests are run on data selected independently of the hypotheses, practical data analysis and machine learning are usually iterative and adaptive processes where the same holdout data is often used for testing a sequence of hypotheses (or models), which may each depend on the outcome of the previous tests on the same data. In this work, we present RADABOUND a rigorous, efficient and practical procedure for controlling the generalization error when using a holdout sample for multiple adaptive testing. Our solution is based on a new application of the Rademacher Complexity generalization bounds, adapted to dependent tests. We demonstrate the statistical power and practicality of our method through extensive simulations and comparisons to alternative approaches. In particular, we show that our rigorous solution is a substantially more powerful and efficient than the differential privacy based approach proposed in Dwork et al. [1]–[3].","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131820016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Comparison of Variable Selection Methods for Forecasting from Short Time Series 短时间序列预测的变量选择方法比较
M. McGee, R. Yaffee
Forecasting from multivariate time series data is a difficult task, made more so in the situation where the number of series (p) is much larger than the length of each series (T), which makes dimension reduction desirable prior to obtaining a model. The LASSO has become a widely-used method to choose relevant covariates out of many candidates, and it has many variations and extensions, such as grouped LASSO, adaptive LASSO, weighted lag adaptive LASSO, and fused LASSO. Of these, only the weighted lag adaptive LASSO and the fused LASSO take into account natural ordering among series. To examine the ability of variations on the LASSO to choose relevant covariates for short time series we use simulations for series with fewer than 50 observations. We then apply the methods to a data set on significant changes in self-reported psycho-social symptoms in the 30 years after the Chornobyl nuclear catastrophe.
从多变量时间序列数据进行预测是一项困难的任务,在序列数量(p)远大于每个序列长度(T)的情况下更是如此,这使得在获得模型之前需要进行降维。LASSO已成为一种广泛使用的从众多候选协变量中选择相关协变量的方法,它有许多变体和扩展,如分组LASSO、自适应LASSO、加权滞后自适应LASSO和融合LASSO。其中,只有加权滞后自适应LASSO和融合LASSO考虑了序列间的自然排序。为了检验LASSO变化对短时间序列选择相关协变量的能力,我们对少于50个观测值的序列进行了模拟。然后,我们将这些方法应用于切尔诺贝利核灾难后30年内自我报告的心理社会症状的显著变化的数据集。
{"title":"Comparison of Variable Selection Methods for Forecasting from Short Time Series","authors":"M. McGee, R. Yaffee","doi":"10.1109/DSAA.2019.00068","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00068","url":null,"abstract":"Forecasting from multivariate time series data is a difficult task, made more so in the situation where the number of series (p) is much larger than the length of each series (T), which makes dimension reduction desirable prior to obtaining a model. The LASSO has become a widely-used method to choose relevant covariates out of many candidates, and it has many variations and extensions, such as grouped LASSO, adaptive LASSO, weighted lag adaptive LASSO, and fused LASSO. Of these, only the weighted lag adaptive LASSO and the fused LASSO take into account natural ordering among series. To examine the ability of variations on the LASSO to choose relevant covariates for short time series we use simulations for series with fewer than 50 observations. We then apply the methods to a data set on significant changes in self-reported psycho-social symptoms in the 30 years after the Chornobyl nuclear catastrophe.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130799999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A Study on the Impact of Data Characteristics in Imbalanced Regression Tasks 非平衡回归任务中数据特征的影响研究
Paula Branco, L. Torgo
The class imbalance problem has been thoroughly studied over the past two decades. More recently, the research community realized that the problem of imbalanced distributions also occurred in other tasks beyond classification. Regression problems are among these newly studied tasks where the problem of imbalanced domains also poses important challenges. Imbalanced regression problems occur in a diversity of real world domains such as meteorological (predicting weather extreme values), financial (extreme stock returns forecasting) or medical (anticipate rare values). In imbalanced regression the end-user preferences are biased towards values of the target variable that are under-represented on the available data. Several pre-processing methods were proposed to address this problem. These methods change the training set to force the learner to focus on the rare cases. However, as far as we know, the relationship between the data intrinsic characteristics and the performance achieved by these methods has not yet been studied for imbalanced regression tasks. In this paper we describe a study of the impact certain data characteristics may have in the results of applying pre-processing methods to imbalanced regression problems. To achieve this goal, we define potentially interesting data characteristics of regression problems. We then conduct our study using a synthetic data repository build for this purpose. We show that all the different characteristics studied have a different behaviour that is related with the level at which the data characteristic is present and the learning algorithm used. The main contributions of our work are: i) to define interesting data characteristics for regression tasks; ii) to create the first repository of imbalanced regression tasks containing 6000 data sets with controlled data characteristics; and iii) to provide insights on the impact of intrinsic data characteristics in the results of pre-processing methods for handling imbalanced regression tasks.
在过去的二十年里,人们对阶级失衡问题进行了深入的研究。最近,研究界意识到不平衡分布的问题也发生在分类以外的其他任务中。回归问题是这些新研究的任务之一,其中不平衡域问题也提出了重要的挑战。不平衡回归问题出现在现实世界的各种领域,如气象(预测天气极值)、金融(预测极端股票回报)或医疗(预测罕见值)。在不平衡回归中,最终用户偏好偏向于在可用数据上未充分代表的目标变量的值。针对这一问题,提出了几种预处理方法。这些方法改变了训练集,迫使学习者把注意力集中在罕见的情况上。然而,据我们所知,对于不平衡回归任务,数据内在特征与这些方法所获得的性能之间的关系尚未得到研究。在本文中,我们描述了对应用预处理方法处理不平衡回归问题的结果中某些数据特征可能产生的影响的研究。为了实现这一目标,我们定义了回归问题的潜在有趣的数据特征。然后,我们使用为此目的构建的合成数据存储库来进行研究。我们表明,所研究的所有不同特征都有不同的行为,这与数据特征存在的水平和所使用的学习算法有关。我们工作的主要贡献是:i)为回归任务定义有趣的数据特征;Ii)建立首个不平衡回归任务储存库,其中包含6000个具有受控数据特征的数据集;iii)提供数据内在特征对处理不平衡回归任务的预处理方法结果的影响的见解。
{"title":"A Study on the Impact of Data Characteristics in Imbalanced Regression Tasks","authors":"Paula Branco, L. Torgo","doi":"10.1109/DSAA.2019.00034","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00034","url":null,"abstract":"The class imbalance problem has been thoroughly studied over the past two decades. More recently, the research community realized that the problem of imbalanced distributions also occurred in other tasks beyond classification. Regression problems are among these newly studied tasks where the problem of imbalanced domains also poses important challenges. Imbalanced regression problems occur in a diversity of real world domains such as meteorological (predicting weather extreme values), financial (extreme stock returns forecasting) or medical (anticipate rare values). In imbalanced regression the end-user preferences are biased towards values of the target variable that are under-represented on the available data. Several pre-processing methods were proposed to address this problem. These methods change the training set to force the learner to focus on the rare cases. However, as far as we know, the relationship between the data intrinsic characteristics and the performance achieved by these methods has not yet been studied for imbalanced regression tasks. In this paper we describe a study of the impact certain data characteristics may have in the results of applying pre-processing methods to imbalanced regression problems. To achieve this goal, we define potentially interesting data characteristics of regression problems. We then conduct our study using a synthetic data repository build for this purpose. We show that all the different characteristics studied have a different behaviour that is related with the level at which the data characteristic is present and the learning algorithm used. The main contributions of our work are: i) to define interesting data characteristics for regression tasks; ii) to create the first repository of imbalanced regression tasks containing 6000 data sets with controlled data characteristics; and iii) to provide insights on the impact of intrinsic data characteristics in the results of pre-processing methods for handling imbalanced regression tasks.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"40 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122811753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Martensite Start Temperature Predictor for Steels Using Ensemble Data Mining 基于集成数据挖掘的钢的马氏体起始温度预测器
Ankit Agrawal, A. Saboo, W. Xiong, G. Olson, A. Choudhary
Martensite start temperature (MsT) is an important characteristic of steels, knowledge of which is vital for materials engineers to guide the structural design process of steels. It is defined as the highest temperature at which the austenite phase in steel begins to transform to martensite phase during rapid cooling. Here we describe the development and deployment of predictive models for MsT, given the chemical composition of the material. The data-driven models described here are built on a dataset of about 1000 experimental observations reported in published literature, and the best model developed was found to significantly outperform several existing MsT prediction methods. The data-driven analyses also revealed several interesting insights about the relationship between MsT and the constituent alloying elements of steels. The most accurate predictive model resulting from this work has been deployed in an online web-tool that takes as input the elemental alloying composition of a given steel and predicts its MsT. The online MsT predictor is available at http://info.eecs.northwestern.edu/MsTpredictor.
马氏体起始温度(MsT)是钢的一个重要特性,对材料工程师指导钢的结构设计过程至关重要。它被定义为在快速冷却过程中钢中的奥氏体相开始向马氏体相转变的最高温度。在这里,我们描述了MsT预测模型的发展和部署,给出了材料的化学成分。本文描述的数据驱动模型建立在已发表文献中报告的约1000个实验观测数据集上,发现开发的最佳模型显著优于几种现有的MsT预测方法。数据驱动的分析还揭示了关于MsT和钢的组成合金元素之间关系的几个有趣的见解。从这项工作中得出的最准确的预测模型已经部署在一个在线网络工具中,该工具将给定钢的元素合金成分作为输入,并预测其MsT。在线MsT预测器可在http://info.eecs.northwestern.edu/MsTpredictor上获得。
{"title":"Martensite Start Temperature Predictor for Steels Using Ensemble Data Mining","authors":"Ankit Agrawal, A. Saboo, W. Xiong, G. Olson, A. Choudhary","doi":"10.1109/DSAA.2019.00067","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00067","url":null,"abstract":"Martensite start temperature (MsT) is an important characteristic of steels, knowledge of which is vital for materials engineers to guide the structural design process of steels. It is defined as the highest temperature at which the austenite phase in steel begins to transform to martensite phase during rapid cooling. Here we describe the development and deployment of predictive models for MsT, given the chemical composition of the material. The data-driven models described here are built on a dataset of about 1000 experimental observations reported in published literature, and the best model developed was found to significantly outperform several existing MsT prediction methods. The data-driven analyses also revealed several interesting insights about the relationship between MsT and the constituent alloying elements of steels. The most accurate predictive model resulting from this work has been deployed in an online web-tool that takes as input the elemental alloying composition of a given steel and predicts its MsT. The online MsT predictor is available at http://info.eecs.northwestern.edu/MsTpredictor.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124931034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Grade Prediction with Neural Collaborative Filtering 基于神经协同过滤的等级预测
Zhiyun Ren, Xia Ning, Andrew S. Lan, H. Rangwala
Over the past decade low graduation and retention rates has plagued higher education institutions. To assist students in choosing a sequence of courses, choosing majors and successful academic pathways; many institutions provide several on-site academic advising services supported by data driven educational technologies. Accurate performance prediction can serve as the backbone for degree planning software, personalized advising systems and early warning systems that can identify students at-risk of dropping from their field of study. In this work, we present a deep learning based recommender system approach called Neural Collaborative Filtering (NCF) for predicting the grade a student will earn in a course that he/she plans to take in the next-term. Prior grade prediction methods are based on matrix factorization (MF) where students and courses are represented in a latent "knowledge" space. The deep learning inspired approach provides added flexibility in learning the latent spaces in comparison to MF approaches. The proposed approach also incorporates instructor information besides student and course information. Moreover, for proper analysis of the learned model parameters, we assume the embeddings obtained for students, courses and instructors should be non-negative. This non-negative NCF model referred by NCFnn model adds a rectified linear units (ReLU) on the embedding layer of NCF. The experimental results on datasets from George Mason University, a large, public university in the United States, demonstrate that the proposed NCF approaches significantly outperform competitive baselines across different test sets.
在过去的十年里,低毕业率和留校率一直困扰着高等教育机构。协助学生选择一系列课程,选择专业和成功的学术途径;许多机构在数据驱动的教育技术的支持下提供多种现场学术咨询服务。准确的成绩预测可以作为学位规划软件、个性化咨询系统和早期预警系统的支柱,这些系统可以识别有可能从所学领域辍学的学生。在这项工作中,我们提出了一种基于深度学习的推荐系统方法,称为神经协同过滤(NCF),用于预测学生在他/她计划下学期学习的课程中将获得的成绩。先验成绩预测方法基于矩阵分解(MF),其中学生和课程在潜在的“知识”空间中表示。与MF方法相比,受深度学习启发的方法在学习潜在空间方面提供了更大的灵活性。该方法除了包含学生和课程信息外,还包含了教师信息。此外,为了正确分析学习到的模型参数,我们假设得到的学生、课程和教师的嵌入都是非负的。NCFnn模型所引用的非负NCF模型在NCF的嵌入层上增加了一个整流线性单元(ReLU)。在美国大型公立大学乔治梅森大学(George Mason University)的数据集上进行的实验结果表明,所提出的NCF方法在不同测试集上的表现明显优于竞争性基线。
{"title":"Grade Prediction with Neural Collaborative Filtering","authors":"Zhiyun Ren, Xia Ning, Andrew S. Lan, H. Rangwala","doi":"10.1109/DSAA.2019.00014","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00014","url":null,"abstract":"Over the past decade low graduation and retention rates has plagued higher education institutions. To assist students in choosing a sequence of courses, choosing majors and successful academic pathways; many institutions provide several on-site academic advising services supported by data driven educational technologies. Accurate performance prediction can serve as the backbone for degree planning software, personalized advising systems and early warning systems that can identify students at-risk of dropping from their field of study. In this work, we present a deep learning based recommender system approach called Neural Collaborative Filtering (NCF) for predicting the grade a student will earn in a course that he/she plans to take in the next-term. Prior grade prediction methods are based on matrix factorization (MF) where students and courses are represented in a latent \"knowledge\" space. The deep learning inspired approach provides added flexibility in learning the latent spaces in comparison to MF approaches. The proposed approach also incorporates instructor information besides student and course information. Moreover, for proper analysis of the learned model parameters, we assume the embeddings obtained for students, courses and instructors should be non-negative. This non-negative NCF model referred by NCFnn model adds a rectified linear units (ReLU) on the embedding layer of NCF. The experimental results on datasets from George Mason University, a large, public university in the United States, demonstrate that the proposed NCF approaches significantly outperform competitive baselines across different test sets.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127498716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
On Analysing Supply and Demand in Labor Markets: Framework, Model and System 劳动力市场供求分析:框架、模型与系统
H. S. Sugiarto, Ee-Peng Lim, Ngak-Leng Sim
The labor market refers to the market between job seekers and employers. As much of job seeking and talent hiring activities are now performed online, a large amount of job posting and application data have been collected and can be re-purposed for labor market analysis. In the labor market, both supply and demand are the key factors in determining an appropriate salary for both job applicants and employers in the market. However, it is challenging to discover the supply and demand for any labor market. In this paper, we propose a novel framework to built a labor market model using a large amount of job post and applicant data. For each labor market, the supply and demand of the labor market are constructed by using offer salaries of job posts and the response of applicants. The equilibrium salary and the equilibrium job quantity are calculated by considering the supply and demand. This labor market modeling framework is then applied to a large job repository dataset containing job post and applicant data of Singapore, a developed economy in Southeast Asia. Several issues are discussed thoroughly in the paper including developing and evaluate salary prediction models to predict missing offer salaries and estimate reserved salaries. Moreover, we propose a way to empirically evaluate of equilibrium salary of the proposed model. The constructed labor market models are then used to explain the job seeker and employer specific challenges in various market segments. We also report gender and age biases that exist in labor markets. Finally, we present a wage dashboard system that yields interesting salary insights using the model.
劳动力市场是指求职者和雇主之间的市场。由于现在许多求职和人才招聘活动都是在网上进行的,大量的招聘和申请数据已经被收集起来,可以重新用于劳动力市场分析。在劳动力市场上,供给和需求是决定求职者和雇主在市场上获得合适薪水的关键因素。然而,发现任何劳动力市场的供求关系都是一项挑战。在本文中,我们提出了一个新的框架来构建一个劳动力市场模型,该模型使用了大量的工作岗位和求职者数据。对于每个劳动力市场,劳动力市场的供给和需求是通过使用工作岗位的提供工资和申请人的反应来构建的。考虑供求关系,计算均衡工资和均衡工资量。然后,将该劳动力市场建模框架应用于包含东南亚发达经济体新加坡的工作岗位和申请人数据的大型工作存储库数据集。本文对薪酬预测模型的建立和评估进行了深入的探讨,以预测缺失offer薪酬和估计保留薪酬。此外,我们还提出了一种对所提出模型的均衡工资进行实证评估的方法。然后使用构建的劳动力市场模型来解释求职者和雇主在各个细分市场中的具体挑战。我们还报告了劳动力市场中存在的性别和年龄偏见。最后,我们提出了一个工资仪表板系统,该系统使用该模型产生了有趣的工资见解。
{"title":"On Analysing Supply and Demand in Labor Markets: Framework, Model and System","authors":"H. S. Sugiarto, Ee-Peng Lim, Ngak-Leng Sim","doi":"10.1109/DSAA.2019.00066","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00066","url":null,"abstract":"The labor market refers to the market between job seekers and employers. As much of job seeking and talent hiring activities are now performed online, a large amount of job posting and application data have been collected and can be re-purposed for labor market analysis. In the labor market, both supply and demand are the key factors in determining an appropriate salary for both job applicants and employers in the market. However, it is challenging to discover the supply and demand for any labor market. In this paper, we propose a novel framework to built a labor market model using a large amount of job post and applicant data. For each labor market, the supply and demand of the labor market are constructed by using offer salaries of job posts and the response of applicants. The equilibrium salary and the equilibrium job quantity are calculated by considering the supply and demand. This labor market modeling framework is then applied to a large job repository dataset containing job post and applicant data of Singapore, a developed economy in Southeast Asia. Several issues are discussed thoroughly in the paper including developing and evaluate salary prediction models to predict missing offer salaries and estimate reserved salaries. Moreover, we propose a way to empirically evaluate of equilibrium salary of the proposed model. The constructed labor market models are then used to explain the job seeker and employer specific challenges in various market segments. We also report gender and age biases that exist in labor markets. Finally, we present a wage dashboard system that yields interesting salary insights using the model.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"04 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131039980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Using Machine Learning to Predict High School Student Employability – A Case Study 使用机器学习来预测高中生的就业能力——一个案例研究
Aarushi Dubey, M. Mani
In this paper, we explore the use of supervised machine learning models to predict the employability of high school students with local businesses for part-time jobs. We further compare the performance of trained models used in this analysis to one another. Empirical results show that it is possible to predict the employability of high school students with local businesses with high-predictive accuracies. The trained predictive models perform better with larger dataset, with up to 93% accuracy.
在本文中,我们探索了使用监督机器学习模型来预测本地企业中学生兼职工作的就业能力。我们进一步比较了本分析中使用的训练模型的性能。实证结果表明,利用地方企业对高中生就业能力进行预测是可行的,预测精度较高。经过训练的预测模型在更大的数据集上表现更好,准确率高达93%。
{"title":"Using Machine Learning to Predict High School Student Employability – A Case Study","authors":"Aarushi Dubey, M. Mani","doi":"10.1109/DSAA.2019.00078","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00078","url":null,"abstract":"In this paper, we explore the use of supervised machine learning models to predict the employability of high school students with local businesses for part-time jobs. We further compare the performance of trained models used in this analysis to one another. Empirical results show that it is possible to predict the employability of high school students with local businesses with high-predictive accuracies. The trained predictive models perform better with larger dataset, with up to 93% accuracy.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133244400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Truth Discovery from Multi-Sourced Text Data Based on Ant Colony Optimization 基于蚁群优化的多源文本数据真相发现
Chen Chang, Jianjun Cao, Guojun Lv, Nianfeng Weng
In the era of information explosion, plenty of data has been generated through a variety of channels, such as social networks, crowdsourcing platforms and blogs. Conflicts and errors are constantly emerging. Truth discovery aims to find trustworthy information from conflicting data by considering source reliability. However, most traditional truth discovery approaches are designed only for structured data, and fail to meet the strong requirements to extract trustworthy information from unstructured raw text data. The major challenges of inferring reliable information on text data stem from the multifactorial property (i.e., an answer may contain multiple different key factors, which may be complex) and the diversity of word usages (i.e., different words may share similar semantic information, but the spelling of which are completely different). To solve these challenges, an ant colony optimization based text data truth discovery model is proposed. Firstly, keywords extracted from the whole answers of the specific question are grouped into a set. Then, we translate the truth discovery problem to a subset optimization problem, and the parallel ant colony optimization is utilized to find correct keywords for each question based on the hypothesis of truth discovery from the whole keywords. After that, the answers to each question can be ranked based on the similarities between keywords of user answers and identified correct keywords found by colony. The experiment results on real dataset show that even the semantic information of text data is complex, our proposed model can still find trustworthy information from complex answers compared with retrieval-based and state-of-the-art approaches.
在信息爆炸的时代,大量的数据通过社交网络、众包平台、博客等多种渠道产生。冲突和错误不断出现。真相发现的目的是在考虑数据源可靠性的基础上,从相互冲突的数据中发现可信的信息。然而,大多数传统的真值发现方法仅针对结构化数据设计,无法满足从非结构化原始文本数据中提取可信信息的强烈需求。在文本数据上推断可靠信息的主要挑战来自多因素属性(即,一个答案可能包含多个不同的关键因素,这些因素可能很复杂)和单词用法的多样性(即,不同的单词可能共享相似的语义信息,但其拼写完全不同)。为了解决这些问题,提出了一种基于蚁群优化的文本数据真相发现模型。首先,从特定问题的全部答案中提取关键字,将其分组成一组。然后,我们将真值发现问题转化为子集优化问题,利用并行蚁群优化算法,基于真值发现假设,从整个关键字中找到每个问题的正确关键字。然后,根据用户答案的关键词与群体找到的识别正确的关键词之间的相似度,对每个问题的答案进行排序。在真实数据集上的实验结果表明,即使文本数据的语义信息很复杂,与基于检索和最先进的方法相比,我们提出的模型仍然可以从复杂的答案中找到可信的信息。
{"title":"Truth Discovery from Multi-Sourced Text Data Based on Ant Colony Optimization","authors":"Chen Chang, Jianjun Cao, Guojun Lv, Nianfeng Weng","doi":"10.1109/DSAA.2019.00031","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00031","url":null,"abstract":"In the era of information explosion, plenty of data has been generated through a variety of channels, such as social networks, crowdsourcing platforms and blogs. Conflicts and errors are constantly emerging. Truth discovery aims to find trustworthy information from conflicting data by considering source reliability. However, most traditional truth discovery approaches are designed only for structured data, and fail to meet the strong requirements to extract trustworthy information from unstructured raw text data. The major challenges of inferring reliable information on text data stem from the multifactorial property (i.e., an answer may contain multiple different key factors, which may be complex) and the diversity of word usages (i.e., different words may share similar semantic information, but the spelling of which are completely different). To solve these challenges, an ant colony optimization based text data truth discovery model is proposed. Firstly, keywords extracted from the whole answers of the specific question are grouped into a set. Then, we translate the truth discovery problem to a subset optimization problem, and the parallel ant colony optimization is utilized to find correct keywords for each question based on the hypothesis of truth discovery from the whole keywords. After that, the answers to each question can be ranked based on the similarities between keywords of user answers and identified correct keywords found by colony. The experiment results on real dataset show that even the semantic information of text data is complex, our proposed model can still find trustworthy information from complex answers compared with retrieval-based and state-of-the-art approaches.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124847503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Topology-Based Clusterwise Regression for User Segmentation and Demand Forecasting 基于拓扑的聚类回归用于用户分割和需求预测
Rodrigo Rivera-Castro, A. Pletnev, Polina Pilyugina, G. Diaz, I. Nazarov, Wanyi Zhu, E. Burnaev
Topological Data Analysis (TDA) is a recent approach to analyze data sets from the perspective of their topological structure. Its use for time series data has been limited. In this work, a system developed for a leading provider of cloud computing combining both user segmentation and demand forecasting is presented. It consists of a TDA-based clustering method for time series inspired by a popular managerial framework for customer segmentation and extended to the case of clusterwise regression using matrix factorization methods to forecast demand. Increasing customer loyalty and producing accurate forecasts remain active topics of discussion both for researchers and managers. Using a public and a novel proprietary data set of commercial data, this research shows that the proposed system enables analysts to both cluster their user base and plan demand at a granular level with significantly higher accuracy than a state of the art baseline. This work thus seeks to introduce TDA-based clustering of time series and clusterwise regression with matrix factorization methods as viable tools for the practitioner.
拓扑数据分析(TDA)是从数据的拓扑结构角度分析数据集的一种新方法。它对时间序列数据的使用受到限制。在这项工作中,为一家领先的云计算提供商开发了一个结合用户细分和需求预测的系统。它包括一种基于tda的时间序列聚类方法,该方法受到流行的客户细分管理框架的启发,并扩展到使用矩阵分解方法进行聚类回归的情况下预测需求。提高客户忠诚度和产生准确的预测仍然是研究人员和管理人员讨论的活跃话题。通过使用公共和新颖的专有商业数据集,该研究表明,所提出的系统使分析人员能够聚集他们的用户群,并在粒度级别上规划需求,其准确性明显高于最先进的基线。因此,这项工作旨在引入基于tda的时间序列聚类和具有矩阵分解方法的聚类回归,作为实践者的可行工具。
{"title":"Topology-Based Clusterwise Regression for User Segmentation and Demand Forecasting","authors":"Rodrigo Rivera-Castro, A. Pletnev, Polina Pilyugina, G. Diaz, I. Nazarov, Wanyi Zhu, E. Burnaev","doi":"10.1109/DSAA.2019.00048","DOIUrl":"https://doi.org/10.1109/DSAA.2019.00048","url":null,"abstract":"Topological Data Analysis (TDA) is a recent approach to analyze data sets from the perspective of their topological structure. Its use for time series data has been limited. In this work, a system developed for a leading provider of cloud computing combining both user segmentation and demand forecasting is presented. It consists of a TDA-based clustering method for time series inspired by a popular managerial framework for customer segmentation and extended to the case of clusterwise regression using matrix factorization methods to forecast demand. Increasing customer loyalty and producing accurate forecasts remain active topics of discussion both for researchers and managers. Using a public and a novel proprietary data set of commercial data, this research shows that the proposed system enables analysts to both cluster their user base and plan demand at a granular level with significantly higher accuracy than a state of the art baseline. This work thus seeks to introduce TDA-based clustering of time series and clusterwise regression with matrix factorization methods as viable tools for the practitioner.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127778711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1