Analyzing data streams for social scientists

Handbook of Computational Social Science, Volume 2 Pub Date : 2021-11-10 DOI:10.4324/9781003025245-6

Lianne Ippel, M. Kaptein, J. Vermunt

{"title":"Analyzing data streams for social scientists","authors":"Lianne Ippel, M. Kaptein, J. Vermunt","doi":"10.4324/9781003025245-6","DOIUrl":null,"url":null,"abstract":"The technological developments of the last decades have created opportunities to efficiently collect data of many individuals over time. While these technologies provide exciting research opportunities, they also provide challenges: datasets collected using these technologies grow increasingly large, or be continuously augmented with new observations. These data streams make the standard computation of well-known estimators inefficient, as computations are repeated each time new data enter. This chapter details online learning, an analysis method that updates parameter estimates instead of re-estimating them to analyze large and/or streaming data. The chapter presents several simple (and exact) examples of the online estimation for independent observations. Additionally, social scientists are often faced with nested data: pupils are nested within schools, or repeated measurements are nested within individuals. Nested data are typically analyzed using multilevel models. Estimating multilevel models, however, can be challenging in data streams: the standard algorithms used to fit these models repeatedly revisit all data points, which becomes infeasible in a data stream context. We present a solution to this problem by introducing the Streaming Expectation Maximization Approximation (SEMA) algorithm for fitting multilevel models online. We end this chapter with a discussion of the methodological challenges that remain.","PeriodicalId":422456,"journal":{"name":"Handbook of Computational Social Science, Volume 2","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Handbook of Computational Social Science, Volume 2","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4324/9781003025245-6","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The technological developments of the last decades have created opportunities to efficiently collect data of many individuals over time. While these technologies provide exciting research opportunities, they also provide challenges: datasets collected using these technologies grow increasingly large, or be continuously augmented with new observations. These data streams make the standard computation of well-known estimators inefficient, as computations are repeated each time new data enter. This chapter details online learning, an analysis method that updates parameter estimates instead of re-estimating them to analyze large and/or streaming data. The chapter presents several simple (and exact) examples of the online estimation for independent observations. Additionally, social scientists are often faced with nested data: pupils are nested within schools, or repeated measurements are nested within individuals. Nested data are typically analyzed using multilevel models. Estimating multilevel models, however, can be challenging in data streams: the standard algorithms used to fit these models repeatedly revisit all data points, which becomes infeasible in a data stream context. We present a solution to this problem by introducing the Streaming Expectation Maximization Approximation (SEMA) algorithm for fitting multilevel models online. We end this chapter with a discussion of the methodological challenges that remain.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

为社会科学家分析数据流

随着时间的推移，过去几十年的技术发展为有效收集许多个人的数据创造了机会。虽然这些技术提供了令人兴奋的研究机会，但它们也带来了挑战:使用这些技术收集的数据集越来越大，或者不断增加新的观察结果。这些数据流使得众所周知的估计器的标准计算效率低下，因为每次新数据输入时都要重复计算。本章详细介绍了在线学习，一种更新参数估计的分析方法，而不是重新估计它们来分析大型和/或流数据。本章给出了几个简单的(和精确的)独立观测在线估计的例子。此外，社会科学家经常面临嵌套数据:学生嵌套在学校里，或者重复测量嵌套在个人身上。嵌套数据通常使用多层模型进行分析。然而，在数据流中估计多层模型可能具有挑战性:用于拟合这些模型的标准算法反复访问所有数据点，这在数据流上下文中变得不可行的。我们通过引入流期望最大化近似(SEMA)算法来在线拟合多层模型，从而解决了这一问题。本章结束时，我们将讨论仍然存在的方法论挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Handbook of Computational Social Science, Volume 2

自引率

0.00%

发文量

期刊最新文献

Crowdsourcing in observational and experimental research Large-scale agent-based simulation and crowd sensing with mobile agents Disaggregation via Gaussian regression for robust analysis of heterogeneous data Handling missing data in large databases Machine learning methods for computational social science