{"title":"Analyzing data streams for social scientists","authors":"Lianne Ippel, M. Kaptein, J. Vermunt","doi":"10.4324/9781003025245-6","DOIUrl":null,"url":null,"abstract":"The technological developments of the last decades have created opportunities to efficiently collect data of many individuals over time. While these technologies provide exciting research opportunities, they also provide challenges: datasets collected using these technologies grow increasingly large, or be continuously augmented with new observations. These data streams make the standard computation of well-known estimators inefficient, as computations are repeated each time new data enter. This chapter details online learning, an analysis method that updates parameter estimates instead of re-estimating them to analyze large and/or streaming data. The chapter presents several simple (and exact) examples of the online estimation for independent observations. Additionally, social scientists are often faced with nested data: pupils are nested within schools, or repeated measurements are nested within individuals. Nested data are typically analyzed using multilevel models. Estimating multilevel models, however, can be challenging in data streams: the standard algorithms used to fit these models repeatedly revisit all data points, which becomes infeasible in a data stream context. We present a solution to this problem by introducing the Streaming Expectation Maximization Approximation (SEMA) algorithm for fitting multilevel models online. We end this chapter with a discussion of the methodological challenges that remain.","PeriodicalId":422456,"journal":{"name":"Handbook of Computational Social Science, Volume 2","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Handbook of Computational Social Science, Volume 2","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4324/9781003025245-6","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The technological developments of the last decades have created opportunities to efficiently collect data of many individuals over time. While these technologies provide exciting research opportunities, they also provide challenges: datasets collected using these technologies grow increasingly large, or be continuously augmented with new observations. These data streams make the standard computation of well-known estimators inefficient, as computations are repeated each time new data enter. This chapter details online learning, an analysis method that updates parameter estimates instead of re-estimating them to analyze large and/or streaming data. The chapter presents several simple (and exact) examples of the online estimation for independent observations. Additionally, social scientists are often faced with nested data: pupils are nested within schools, or repeated measurements are nested within individuals. Nested data are typically analyzed using multilevel models. Estimating multilevel models, however, can be challenging in data streams: the standard algorithms used to fit these models repeatedly revisit all data points, which becomes infeasible in a data stream context. We present a solution to this problem by introducing the Streaming Expectation Maximization Approximation (SEMA) algorithm for fitting multilevel models online. We end this chapter with a discussion of the methodological challenges that remain.