Analyzing data streams for social scientists

Lianne Ippel, M. Kaptein, J. Vermunt
{"title":"Analyzing data streams for social scientists","authors":"Lianne Ippel, M. Kaptein, J. Vermunt","doi":"10.4324/9781003025245-6","DOIUrl":null,"url":null,"abstract":"The technological developments of the last decades have created opportunities to efficiently collect data of many individuals over time. While these technologies provide exciting research opportunities, they also provide challenges: datasets collected using these technologies grow increasingly large, or be continuously augmented with new observations. These data streams make the standard computation of well-known estimators inefficient, as computations are repeated each time new data enter. This chapter details online learning, an analysis method that updates parameter estimates instead of re-estimating them to analyze large and/or streaming data. The chapter presents several simple (and exact) examples of the online estimation for independent observations. Additionally, social scientists are often faced with nested data: pupils are nested within schools, or repeated measurements are nested within individuals. Nested data are typically analyzed using multilevel models. Estimating multilevel models, however, can be challenging in data streams: the standard algorithms used to fit these models repeatedly revisit all data points, which becomes infeasible in a data stream context. We present a solution to this problem by introducing the Streaming Expectation Maximization Approximation (SEMA) algorithm for fitting multilevel models online. We end this chapter with a discussion of the methodological challenges that remain.","PeriodicalId":422456,"journal":{"name":"Handbook of Computational Social Science, Volume 2","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Handbook of Computational Social Science, Volume 2","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4324/9781003025245-6","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The technological developments of the last decades have created opportunities to efficiently collect data of many individuals over time. While these technologies provide exciting research opportunities, they also provide challenges: datasets collected using these technologies grow increasingly large, or be continuously augmented with new observations. These data streams make the standard computation of well-known estimators inefficient, as computations are repeated each time new data enter. This chapter details online learning, an analysis method that updates parameter estimates instead of re-estimating them to analyze large and/or streaming data. The chapter presents several simple (and exact) examples of the online estimation for independent observations. Additionally, social scientists are often faced with nested data: pupils are nested within schools, or repeated measurements are nested within individuals. Nested data are typically analyzed using multilevel models. Estimating multilevel models, however, can be challenging in data streams: the standard algorithms used to fit these models repeatedly revisit all data points, which becomes infeasible in a data stream context. We present a solution to this problem by introducing the Streaming Expectation Maximization Approximation (SEMA) algorithm for fitting multilevel models online. We end this chapter with a discussion of the methodological challenges that remain.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
为社会科学家分析数据流
随着时间的推移,过去几十年的技术发展为有效收集许多个人的数据创造了机会。虽然这些技术提供了令人兴奋的研究机会,但它们也带来了挑战:使用这些技术收集的数据集越来越大,或者不断增加新的观察结果。这些数据流使得众所周知的估计器的标准计算效率低下,因为每次新数据输入时都要重复计算。本章详细介绍了在线学习,一种更新参数估计的分析方法,而不是重新估计它们来分析大型和/或流数据。本章给出了几个简单的(和精确的)独立观测在线估计的例子。此外,社会科学家经常面临嵌套数据:学生嵌套在学校里,或者重复测量嵌套在个人身上。嵌套数据通常使用多层模型进行分析。然而,在数据流中估计多层模型可能具有挑战性:用于拟合这些模型的标准算法反复访问所有数据点,这在数据流上下文中变得不可行的。我们通过引入流期望最大化近似(SEMA)算法来在线拟合多层模型,从而解决了这一问题。本章结束时,我们将讨论仍然存在的方法论挑战。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Crowdsourcing in observational and experimental research Large-scale agent-based simulation and crowd sensing with mobile agents Disaggregation via Gaussian regression for robust analysis of heterogeneous data Handling missing data in large databases Machine learning methods for computational social science
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1