私营公司的大规模数据可靠吗？对流行数据集中机器生成的商业位置数据的分析

IF 3 2区社会学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Social Science Computer Review Pub Date : 2024-04-15 DOI:10.1177/08944393241245390

Nikolitsa Grigoropoulou, Mario L. Small

{"title":"私营公司的大规模数据可靠吗？对流行数据集中机器生成的商业位置数据的分析","authors":"Nikolitsa Grigoropoulou, Mario L. Small","doi":"10.1177/08944393241245390","DOIUrl":null,"url":null,"abstract":"Large-scale data from private companies offer new opportunities to examine topics of scientific and social significance, such as racial inequality, partisan polarization, and activity-based segregation. However, because such data are often generated through automated processes, their accuracy and reliability for social science research remain unclear. The present study examines how quality issues in large-scale data from private companies can afflict the reporting of even ostensibly uncomplicated values. We assess the reliability with which an often-used device tracking data source, SafeGraph, sorted data it acquired on financial institutions into categories, such as banks and payday lenders, based on a standard classification system. We find major classification problems that vary by type of institution, and remarkably high rates of unidentified closures and duplicate records. We suggest that classification problems can affect research based on large-scale private data in four ways: detection, efficiency, validity, and bias. We discuss the implications of our findings, and list a set of problems researchers should consider when using large-scale data from companies.","PeriodicalId":49509,"journal":{"name":"Social Science Computer Review","volume":"29 1","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Are Large-Scale Data From Private Companies Reliable? An Analysis of Machine-Generated Business Location Data in a Popular Dataset\",\"authors\":\"Nikolitsa Grigoropoulou, Mario L. Small\",\"doi\":\"10.1177/08944393241245390\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large-scale data from private companies offer new opportunities to examine topics of scientific and social significance, such as racial inequality, partisan polarization, and activity-based segregation. However, because such data are often generated through automated processes, their accuracy and reliability for social science research remain unclear. The present study examines how quality issues in large-scale data from private companies can afflict the reporting of even ostensibly uncomplicated values. We assess the reliability with which an often-used device tracking data source, SafeGraph, sorted data it acquired on financial institutions into categories, such as banks and payday lenders, based on a standard classification system. We find major classification problems that vary by type of institution, and remarkably high rates of unidentified closures and duplicate records. We suggest that classification problems can affect research based on large-scale private data in four ways: detection, efficiency, validity, and bias. We discuss the implications of our findings, and list a set of problems researchers should consider when using large-scale data from companies.\",\"PeriodicalId\":49509,\"journal\":{\"name\":\"Social Science Computer Review\",\"volume\":\"29 1\",\"pages\":\"\"},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2024-04-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Social Science Computer Review\",\"FirstCategoryId\":\"90\",\"ListUrlMain\":\"https://doi.org/10.1177/08944393241245390\",\"RegionNum\":2,\"RegionCategory\":\"社会学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Social Science Computer Review","FirstCategoryId":"90","ListUrlMain":"https://doi.org/10.1177/08944393241245390","RegionNum":2,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

摘要

来自私营公司的大规模数据为研究种族不平等、党派两极分化和基于活动的隔离等具有科学和社会意义的课题提供了新的机会。然而，由于这些数据通常是通过自动化流程生成的，因此它们在社会科学研究中的准确性和可靠性仍不明确。本研究探讨了私营公司大规模数据的质量问题会如何影响表面上并不复杂的数值报告。我们评估了经常使用的设备追踪数据源 SafeGraph 根据标准分类系统将其获得的金融机构数据分类（如银行和发薪日贷款机构）的可靠性。我们发现，不同类型的机构存在不同的重大分类问题，未识别的关闭机构和重复记录的比率也非常高。我们认为，分类问题会从四个方面影响基于大规模私人数据的研究：检测、效率、有效性和偏见。我们讨论了研究结果的影响，并列出了研究人员在使用公司大规模数据时应考虑的一系列问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Are Large-Scale Data From Private Companies Reliable? An Analysis of Machine-Generated Business Location Data in a Popular Dataset

Large-scale data from private companies offer new opportunities to examine topics of scientific and social significance, such as racial inequality, partisan polarization, and activity-based segregation. However, because such data are often generated through automated processes, their accuracy and reliability for social science research remain unclear. The present study examines how quality issues in large-scale data from private companies can afflict the reporting of even ostensibly uncomplicated values. We assess the reliability with which an often-used device tracking data source, SafeGraph, sorted data it acquired on financial institutions into categories, such as banks and payday lenders, based on a standard classification system. We find major classification problems that vary by type of institution, and remarkably high rates of unidentified closures and duplicate records. We suggest that classification problems can affect research based on large-scale private data in four ways: detection, efficiency, validity, and bias. We discuss the implications of our findings, and list a set of problems researchers should consider when using large-scale data from companies.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Social Science Computer Review 社会科学-计算机：跨学科应用

CiteScore

9.00

自引率

4.90%

发文量

审稿时长

>12 weeks

期刊介绍： Unique Scope Social Science Computer Review is an interdisciplinary journal covering social science instructional and research applications of computing, as well as societal impacts of informational technology. Topics included: artificial intelligence, business, computational social science theory, computer-assisted survey research, computer-based qualitative analysis, computer simulation, economic modeling, electronic modeling, electronic publishing, geographic information systems, instrumentation and research tools, public administration, social impacts of computing and telecommunications, software evaluation, world-wide web resources for social scientists. Interdisciplinary Nature Because the Uses and impacts of computing are interdisciplinary, so is Social Science Computer Review. The journal is of direct relevance to scholars and scientists in a wide variety of disciplines. In its pages you''ll find work in the following areas: sociology, anthropology, political science, economics, psychology, computer literacy, computer applications, and methodology.