AStA Wirtschafts- und Sozialstatistisches Archiv最新文献

英文中文

Connecting algorithmic fairness to quality dimensions in machine learning in official statistics and survey production 将算法公平性与官方统计和调查制作中机器学习的质量维度联系起来

AStA Wirtschafts- und Sozialstatistisches Archiv

Pub Date : 2024-10-07 DOI: 10.1007/s11943-024-00344-2

Patrick Oliver Schenk, Christoph Kern

National Statistical Organizations (NSOs) increasingly draw on Machine Learning (ML) to improve the timeliness and cost-effectiveness of their products. When introducing ML solutions, NSOs must ensure that high standards with respect to robustness, reproducibility, and accuracy are upheld as codified, e.g., in the Quality Framework for Statistical Algorithms (QF4SA; Yung et al. 2022, Statistical Journal of the IAOS). At the same time, a growing body of research focuses on fairness as a pre-condition of a safe deployment of ML to prevent disparate social impacts in practice. However, fairness has not yet been explicitly discussed as a quality aspect in the context of the application of ML at NSOs. We employ the QF4SA quality framework and present a mapping of its quality dimensions to algorithmic fairness. We thereby extend the QF4SA framework in several ways: First, we investigate the interaction of fairness with each of these quality dimensions. Second, we argue for fairness as its own, additional quality dimension, beyond what is contained in the QF4SA so far. Third, we emphasize and explicitly address data, both on its own and its interaction with applied methodology. In parallel with empirical illustrations, we show how our mapping can contribute to methodology in the domains of official statistics, algorithmic fairness, and trustworthy machine learning.

Little to no prior knowledge of ML, fairness, and quality dimensions in official statistics is required as we provide introductions to these subjects. These introductions are also targeted to the discussion of quality dimensions and fairness.

国家统计机构（NSO）越来越多地利用机器学习（ML）来提高其产品的及时性和成本效益。在引入 ML 解决方案时，国家统计局必须确保在稳健性、可重复性和准确性方面坚持高标准，例如在统计算法质量框架（QF4SA；Yung 等人，2022 年，IAOS 统计期刊）中。与此同时，越来越多的研究将公平性作为安全部署人工智能的先决条件，以防止在实践中产生不同的社会影响。然而，在国家统计局应用人工智能的背景下，公平性尚未作为一个质量方面得到明确讨论。我们采用了 QF4SA 质量框架，并将其质量维度与算法公平性进行了映射。因此，我们从几个方面扩展了 QF4SA 框架：首先，我们研究了公平性与每个质量维度之间的相互作用。其次，我们主张将公平性作为 QF4SA 迄今为止所包含的质量维度之外的额外质量维度。第三，我们强调并明确论述了数据本身及其与应用方法的相互作用。在进行实证说明的同时，我们还展示了我们的映射如何有助于官方统计、算法公平性和可信机器学习等领域的方法论。由于我们对这些主题进行了介绍，因此几乎不需要事先了解官方统计中的 ML、公平性和质量维度。这些介绍也针对质量维度和公平性的讨论。

{"title":"Connecting algorithmic fairness to quality dimensions in machine learning in official statistics and survey production","authors":"Patrick Oliver Schenk, Christoph Kern","doi":"10.1007/s11943-024-00344-2","DOIUrl":"10.1007/s11943-024-00344-2","url":null,"abstract":"<div><p>National Statistical Organizations (NSOs) increasingly draw on Machine Learning (ML) to improve the timeliness and cost-effectiveness of their products. When introducing ML solutions, NSOs must ensure that high standards with respect to robustness, reproducibility, and accuracy are upheld as codified, e.g., in the Quality Framework for Statistical Algorithms (QF4SA; Yung et al. 2022, <i>Statistical Journal of the IAOS</i>). At the same time, a growing body of research focuses on fairness as a pre-condition of a safe deployment of ML to prevent disparate social impacts in practice. However, fairness has not yet been explicitly discussed as a quality aspect in the context of the application of ML at NSOs. We employ the QF4SA quality framework and present a mapping of its quality dimensions to algorithmic fairness. We thereby extend the QF4SA framework in several ways: First, we investigate the interaction of fairness with each of these quality dimensions. Second, we argue for fairness as its own, additional quality dimension, beyond what is contained in the QF4SA so far. Third, we emphasize and explicitly address data, both on its own and its interaction with applied methodology. In parallel with empirical illustrations, we show how our mapping can contribute to methodology in the domains of official statistics, algorithmic fairness, and trustworthy machine learning.</p><p>Little to no prior knowledge of ML, fairness, and quality dimensions in official statistics is required as we provide introductions to these subjects. These introductions are also targeted to the discussion of quality dimensions and fairness.</p></div>","PeriodicalId":100134,"journal":{"name":"AStA Wirtschafts- und Sozialstatistisches Archiv","volume":"18 2","pages":"131 - 184"},"PeriodicalIF":0.0,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11943-024-00344-2.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142451095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automated Bayesian variable selection methods for binary regression models with missing covariate data 针对具有缺失协变量数据的二元回归模型的贝叶斯变量自动选择方法

AStA Wirtschafts- und Sozialstatistisches Archiv

Pub Date : 2024-09-13 DOI: 10.1007/s11943-024-00345-1

Michael Bergrab, Christian Aßmann

Data collection and the availability of large data sets has increased over the last decades. In both statistical and machine learning frameworks, two methodological issues typically arise when performing regression analysis on large data sets. First, variable selection is crucial in regression modeling, as it helps to identify an appropriate model with respect to the considered set of conditioning variables. Second, especially in the context of survey data, handling of missing values is important for estimation, which occur even with state-of-the-art data collection and processing methods. Within this paper, we provide an Bayesian approach based on a spike-and-slab prior for the regression coefficients, which allows for simultaneous handling of variable selection and estimation in combination with handling of missing values in covariate data. The paper also discusses the implementation of the approach using Markov chain Monte Carlo techniques and provides results for simulated data sets and an empirical illustration based on data from the German National Educational Panel Study. The suggested Bayesian approach is compared to other statistical and machine learning frameworks such as Lasso, ridge regression, and Elastic net, and is shown to perform well in terms of estimation performance and variable selection accuracy. The simulation results demonstrate that ignoring the handling of missing values in data sets can lead to the generation of biased selection results. Overall, the proposed Bayesian method offers a holistic, flexible, and powerful framework for variable selection in the presence of missing covariate data.

过去几十年来，数据收集和大型数据集的可用性不断增加。在统计和机器学习框架中，对大型数据集进行回归分析时通常会出现两个方法问题。首先，变量选择在回归建模中至关重要，因为它有助于根据所考虑的条件变量集确定合适的模型。其次，特别是在调查数据的情况下，处理缺失值对估计非常重要，即使采用最先进的数据收集和处理方法也会出现这种情况。在本文中，我们提供了一种基于回归系数的尖峰和平板先验的贝叶斯方法，它可以同时处理变量选择和估计，并结合处理协变量数据中的缺失值。论文还讨论了如何利用马尔科夫链蒙特卡罗技术实现该方法，并提供了模拟数据集的结果和基于德国国家教育面板研究数据的经验说明。将所建议的贝叶斯方法与其他统计和机器学习框架（如 Lasso、脊回归和弹性网）进行了比较，结果表明该方法在估计性能和变量选择准确性方面表现良好。模拟结果表明，忽略数据集中缺失值的处理会导致产生有偏差的选择结果。总之，所提出的贝叶斯方法为存在缺失协变量数据时的变量选择提供了一个全面、灵活和强大的框架。

{"title":"Automated Bayesian variable selection methods for binary regression models with missing covariate data","authors":"Michael Bergrab, Christian Aßmann","doi":"10.1007/s11943-024-00345-1","DOIUrl":"10.1007/s11943-024-00345-1","url":null,"abstract":"<div><p>Data collection and the availability of large data sets has increased over the last decades. In both statistical and machine learning frameworks, two methodological issues typically arise when performing regression analysis on large data sets. First, variable selection is crucial in regression modeling, as it helps to identify an appropriate model with respect to the considered set of conditioning variables. Second, especially in the context of survey data, handling of missing values is important for estimation, which occur even with state-of-the-art data collection and processing methods. Within this paper, we provide an Bayesian approach based on a spike-and-slab prior for the regression coefficients, which allows for simultaneous handling of variable selection and estimation in combination with handling of missing values in covariate data. The paper also discusses the implementation of the approach using Markov chain Monte Carlo techniques and provides results for simulated data sets and an empirical illustration based on data from the German National Educational Panel Study. The suggested Bayesian approach is compared to other statistical and machine learning frameworks such as Lasso, ridge regression, and Elastic net, and is shown to perform well in terms of estimation performance and variable selection accuracy. The simulation results demonstrate that ignoring the handling of missing values in data sets can lead to the generation of biased selection results. Overall, the proposed Bayesian method offers a holistic, flexible, and powerful framework for variable selection in the presence of missing covariate data.</p></div>","PeriodicalId":100134,"journal":{"name":"AStA Wirtschafts- und Sozialstatistisches Archiv","volume":"18 2","pages":"203 - 244"},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11943-024-00345-1.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142451122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Die Volkswirtschaftlichen Gesamtrechnungen in Zeiten der Pandemie – wurden alle Herausforderungen gemeistert? 大流行时期的国民核算——所有挑战都解决了吗？

AStA Wirtschafts- und Sozialstatistisches Archiv

Pub Date : 2024-09-05 DOI: 10.1007/s11943-024-00348-y

Josef Richter

Die Corona-Pandemie hat die amtliche Statistik und insbesondere die Volkswirtschaftlichen Gesamtrechnungen vor gewaltige Aufgaben gestellt. Es galt neue, vorher nie beobachtete Phänomene in das System zu integrieren und es musste mit fehlenden und geänderten Datengrundlagen operiert werden. Gleichzeitig waren unter erschwerten Bedingungen die legitimen Informationsbedürfnisse der Allgemeinheit und der Entscheidungsträger zu befriedigen. Zudem wurden auch grundlegende Konzeptfragen, die üblicherweise vernachlässigt werden können, virulent. So war die Frage zu beantworten, ob es Preise geben kann, wenn keine Transaktionen stattfinden und es galt sich darüber klar zu werden, welcher Produktionsbegriff eigentlich operationalisiert werden soll.

Im System der Volkswirtschaftlichen Gesamtrechnungen ist sowohl ein technischer Produktionsbegriff als auch ein ökonomischer Produktionsbegriff präsent, wie an Hand ausgewählter Bestimmungen näher gezeigt wird. Unter Normalbedingungen werden sich bei beiden Ansätzen ähnliche Ergebnisse ergeben. Wie in dem Beitrag illustriert wird, können unter den besonderen Umständen der Pandemie freilich größere Unterschiede resultieren.

Unter dem Druck der Ereignisse wurde in der Pandemie ein sehr pragmatisches Vorgehen gewählt und die Nichtbefassung mit zentralen Konzeptfragen auch damit entschuldigt, dass die Auswirkungen auf die großen Aggregate gering sind. Für die Erfüllung der Aufgabe des Monitoring und für die dominierenden operationalen Funktionen der Daten trifft dies sicher zu. Die Gesamtrechnungen haben aber auch eine wichtige Rolle als empirische Grundlage der Wirtschaftsforschung zu spielen. In diesem Zusammenhang hätten die Konzeptfragen mehr Aufmerksamkeit verdient. Ungenügend wurde auch die Herausforderung bewältigt, die Nutzer adäquat zu informieren. In der Präsentation der Ergebnisse wurden die durch die spezifisch Situation bedingten unterschiedlichen Charakteristika der Resultate für die Pandemieperioden meist ausgeblendet.

COVID - 19大流行给官方统计，特别是国民核算带来了巨大挑战。必须将以前从未观察到的新现象整合到系统中，必须使用缺失和修改过的数据库进行操作。与此同时，必须在困难的条件下满足公众和决策者对信息的合理需要。此外，通常可以忽略的基本概念问题也变得恶毒起来。因此，问题是，在没有交易的情况下，是否可以有价格，需要明确哪些生产概念应该付诸实施。在国民核算体系中，既存在技术生产概念，也存在经济生产概念。在正常情况下，这两种方法的结果是相似的。然而，正如本文所示，大流行的特殊情况可能导致更大的差异。在事件的压力下，对这一流行病采取了一种非常务实的做法，以对主要总量的影响很小为借口，不处理关键的概念问题。这当然适用于完成监视任务和数据的主要操作功能。然而，会计作为经济研究的实证基础也发挥着重要作用。在这方面，概念问题应该得到更多的注意。向用户提供充分信息的挑战也没有得到充分的解决。在介绍结果时，由于大流行期间的具体情况，结果的不同特点往往被忽略。

{"title":"Die Volkswirtschaftlichen Gesamtrechnungen in Zeiten der Pandemie – wurden alle Herausforderungen gemeistert?","authors":"Josef Richter","doi":"10.1007/s11943-024-00348-y","DOIUrl":"10.1007/s11943-024-00348-y","url":null,"abstract":"<p>Die Corona-Pandemie hat die amtliche Statistik und insbesondere die Volkswirtschaftlichen Gesamtrechnungen vor gewaltige Aufgaben gestellt. Es galt neue, vorher nie beobachtete Phänomene in das System zu integrieren und es musste mit fehlenden und geänderten Datengrundlagen operiert werden. Gleichzeitig waren unter erschwerten Bedingungen die legitimen Informationsbedürfnisse der Allgemeinheit und der Entscheidungsträger zu befriedigen. Zudem wurden auch grundlegende Konzeptfragen, die üblicherweise vernachlässigt werden können, virulent. So war die Frage zu beantworten, ob es Preise geben kann, wenn keine Transaktionen stattfinden und es galt sich darüber klar zu werden, welcher Produktionsbegriff eigentlich operationalisiert werden soll.</p><p>Im System der Volkswirtschaftlichen Gesamtrechnungen ist sowohl ein technischer Produktionsbegriff als auch ein ökonomischer Produktionsbegriff präsent, wie an Hand ausgewählter Bestimmungen näher gezeigt wird. Unter Normalbedingungen werden sich bei beiden Ansätzen ähnliche Ergebnisse ergeben. Wie in dem Beitrag illustriert wird, können unter den besonderen Umständen der Pandemie freilich größere Unterschiede resultieren.</p><p>Unter dem Druck der Ereignisse wurde in der Pandemie ein sehr pragmatisches Vorgehen gewählt und die Nichtbefassung mit zentralen Konzeptfragen auch damit entschuldigt, dass die Auswirkungen auf die großen Aggregate gering sind. Für die Erfüllung der Aufgabe des Monitoring und für die dominierenden operationalen Funktionen der Daten trifft dies sicher zu. Die Gesamtrechnungen haben aber auch eine wichtige Rolle als empirische Grundlage der Wirtschaftsforschung zu spielen. In diesem Zusammenhang hätten die Konzeptfragen mehr Aufmerksamkeit verdient. Ungenügend wurde auch die Herausforderung bewältigt, die Nutzer adäquat zu informieren. In der Präsentation der Ergebnisse wurden die durch die spezifisch Situation bedingten unterschiedlichen Charakteristika der Resultate für die Pandemieperioden meist ausgeblendet.</p>","PeriodicalId":100134,"journal":{"name":"AStA Wirtschafts- und Sozialstatistisches Archiv","volume":"18 3-4","pages":"305 - 318"},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11943-024-00348-y.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143108247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fairness als Qualitätskriterium im Maschinellen Lernen – Rekonstruktion des philosophischen Konzepts und Implikationen für die Nutzung außergesetzlicher Merkmale bei qualifizierten Mietspiegeln 作为机器学习质量标准的公平性--哲学概念的重构以及在合格租金指数中使用法外特征的影响

AStA Wirtschafts- und Sozialstatistisches Archiv

Pub Date : 2024-08-23 DOI: 10.1007/s11943-024-00346-0

Ludwig Bothmann, Kristina Peters

Mit der verstärkten Nutzung von Modellen des Maschinellen Lernens (ML) innerhalb von Systemen der automatisierten Entscheidungsfindung wachsen die Anforderungen an die Qualität von ML-Modellen. Die reine Prognosegüte ist nicht länger das alleinige Qualitätskriterium; insbesondere wird vermehrt gefordert, dass Fairnessaspekte berücksichtigt werden. Dieser Beitrag verfolgt zwei Ziele. Zum einen werden die aktuelle Fairnessdiskussion im Bereich ML (fairML) zusammengefasst und die aktuellsten Entwicklungen, insbesondere in Bezug auf die philosophischen Grundlagen des Fairnessbegriffs innerhalb ML, beschrieben. Zum anderen wird die Frage behandelt, inwiefern sogenannte „außergesetzliche“ Merkmale bei der Erstellung qualifizierter Mietspiegel genutzt werden dürfen. Ein aktueller Vorschlag von Kauermann und Windmann (AStA Wirtschafts- und Sozialstatistisches Archiv, Band 17, 2023) zur Nutzung außergesetzlicher Merkmale in qualifizierten Mietspiegeln beinhaltet eine modellbasierte Imputationsmethode, welche wir den gesetzlichen Anforderungen gegenüberstellen. Schließlich zeigen wir auf, welche Alternativen aus dem Bereich fairML genutzt werden könnten und legen dar, welche unterschiedlichen philosophischen Grundannahmen hinter den verschiedenen Verfahren stehen.

随着机器学习（ML）模型在自动决策系统中的使用越来越多，对 ML 模型质量的要求也越来越高。纯粹的预测质量不再是唯一的质量标准，尤其是对公平性方面的要求越来越高。本文有两个目的。首先，它总结了当前有关 ML（fairML）领域公平性的讨论，并描述了最新发展，特别是有关 ML 中公平性概念的哲学基础。其次，讨论了在编制合格租金指数时可以在多大程度上使用所谓 "法律外 "特征的问题。考尔曼和温德曼（AStA Wirtschafts- und Sozialstatistisches Archiv，第 17 卷，2023 年）目前提出的关于在限定租金指数中使用非法定特征的建议包括一种基于模型的估算方法，我们将其与法定要求进行了比较。最后，我们说明了可以使用公平估算法中的哪些替代方法，并解释了各种方法背后不同的基本哲学假设。

{"title":"Fairness als Qualitätskriterium im Maschinellen Lernen – Rekonstruktion des philosophischen Konzepts und Implikationen für die Nutzung außergesetzlicher Merkmale bei qualifizierten Mietspiegeln","authors":"Ludwig Bothmann, Kristina Peters","doi":"10.1007/s11943-024-00346-0","DOIUrl":"10.1007/s11943-024-00346-0","url":null,"abstract":"<p>Mit der verstärkten Nutzung von Modellen des Maschinellen Lernens (ML) innerhalb von Systemen der automatisierten Entscheidungsfindung wachsen die Anforderungen an die Qualität von ML-Modellen. Die reine Prognosegüte ist nicht länger das alleinige Qualitätskriterium; insbesondere wird vermehrt gefordert, dass Fairnessaspekte berücksichtigt werden. Dieser Beitrag verfolgt zwei Ziele. Zum einen werden die aktuelle Fairnessdiskussion im Bereich ML (fairML) zusammengefasst und die aktuellsten Entwicklungen, insbesondere in Bezug auf die philosophischen Grundlagen des Fairnessbegriffs innerhalb ML, beschrieben. Zum anderen wird die Frage behandelt, inwiefern sogenannte „außergesetzliche“ Merkmale bei der Erstellung qualifizierter Mietspiegel genutzt werden dürfen. Ein aktueller Vorschlag von Kauermann und Windmann (AStA Wirtschafts- und Sozialstatistisches Archiv, Band 17, 2023) zur Nutzung außergesetzlicher Merkmale in qualifizierten Mietspiegeln beinhaltet eine modellbasierte Imputationsmethode, welche wir den gesetzlichen Anforderungen gegenüberstellen. Schließlich zeigen wir auf, welche Alternativen aus dem Bereich fairML genutzt werden könnten und legen dar, welche unterschiedlichen philosophischen Grundannahmen hinter den verschiedenen Verfahren stehen.</p>","PeriodicalId":100134,"journal":{"name":"AStA Wirtschafts- und Sozialstatistisches Archiv","volume":"18 2","pages":"185 - 201"},"PeriodicalIF":0.0,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11943-024-00346-0.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142451137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Interview mit Walter Krämer 采访 Walter Krämer

AStA Wirtschafts- und Sozialstatistisches Archiv

Pub Date : 2024-07-18 DOI: 10.1007/s11943-024-00343-3

Ulrich Rendtel

引用次数: 0

„Mister SOEP et al.“ – ein Nachruf auf Gert G. Wagner "SOEP 先生等人"--格特-G-瓦格纳的讣告

AStA Wirtschafts- und Sozialstatistisches Archiv

Pub Date : 2024-07-09 DOI: 10.1007/s11943-024-00342-4

C. Katharina Spieß

引用次数: 0

Data Observer—a guide to data that can help to inform evidence-based policymaking 数据观察员--有助于循证决策的数据指南

AStA Wirtschafts- und Sozialstatistisches Archiv

Pub Date : 2024-06-24 DOI: 10.1007/s11943-024-00341-5

Joachim Wagner

For many attempts to inform evidence-based policymaking (or policy-makers in general) researchers have to rely on already available (instead of newly collected) data. These data have to be reliable, accessible (at best, without high hurdles, and with low or no fees to be paid) and findable. One way that helps to find suitable data that are easily accessible (and hopefully reliable) is to look at the contributions published in the Data Observer series described in this paper.

在为循证决策（或一般决策者）提供信息的许多尝试中，研究人员必须依靠已有的（而不是新收集的）数据。这些数据必须可靠、可获取（最多是没有高门槛、低费用或无费用）、可查找。找到易于获取（希望可靠）的合适数据的一个方法是查看本文所述的《数据观察家》系列所发表的文章。

引用次数: 0

Flat rent price prediction in Berlin with web scraping 利用网络搜索预测柏林公寓租金价格

AStA Wirtschafts- und Sozialstatistisches Archiv

Pub Date : 2024-06-24 DOI: 10.1007/s11943-024-00340-6

Camilo Meyberg, Ulrich Rendtel, Holger Leerhoff

Internet data pose a challenge to the traditional system of official statistics, which relies on more conventional sources such as surveys and registers, not readily adaptable to rapid changes. Expanding this system to include internet data is currently at an experimental stage, exploring these sources’ potentials and benefits. This paper describes a project conducted within the ESSnet Trusted Smart Statistics – Web Intelligence Network framework. It investigates the use of online apartment listings to analyze the rental market. We used web scraping to extract information from two online real estate portals for flats in the city of Berlin. Using this data, we developed a model to predict rental prices per square meter based on the accommodation’s features and location within the city. We detected offers which appear in both portals by means of statistical matching and removed duplicate offers. Missing values were treated by multiple imputation. The prediction model is a semi-parametric approach where the postal districts are used to describe the location effect. Comparisons with microcensus results and the local rent index reveal significant differences between the market of online flat offers and the stock of existing flat contracts. Interested readers will find the commented programming code in the internet supplement.

互联网数据对传统的官方统计系统提出了挑战，因为传统的官方统计系统依赖于调查和登记等较传统的来源，不易适应快速的变化。将这一系统扩展到互联网数据目前正处于试验阶段，探索这些来源的潜力和益处。本文介绍了在 ESSnet 可信智能统计--网络智能网络框架内开展的一个项目。该项目研究了如何利用在线公寓列表来分析租赁市场。我们使用网络搜刮技术从柏林市的两个在线房地产门户网站中提取公寓信息。利用这些数据，我们建立了一个模型，根据住房的特点和在城市中的位置来预测每平方米的租金价格。我们通过统计匹配方法检测了两个门户网站中出现的报价，并删除了重复报价。缺失值通过多重估算进行处理。预测模型是一种半参数方法，使用邮区来描述位置效应。通过与微观人口普查结果和当地租金指数进行比较，发现在线公寓报价市场与现有公寓合同存量之间存在显著差异。感兴趣的读者可在互联网增刊中找到注释编程代码。

{"title":"Flat rent price prediction in Berlin with web scraping","authors":"Camilo Meyberg, Ulrich Rendtel, Holger Leerhoff","doi":"10.1007/s11943-024-00340-6","DOIUrl":"10.1007/s11943-024-00340-6","url":null,"abstract":"<div><p>Internet data pose a challenge to the traditional system of official statistics, which relies on more conventional sources such as surveys and registers, not readily adaptable to rapid changes. Expanding this system to include internet data is currently at an experimental stage, exploring these sources’ potentials and benefits. This paper describes a project conducted within the ESSnet <i>Trusted Smart Statistics – Web Intelligence Network</i> framework. It investigates the use of online apartment listings to analyze the rental market. We used web scraping to extract information from two online real estate portals for flats in the city of Berlin. Using this data, we developed a model to predict rental prices per square meter based on the accommodation’s features and location within the city. We detected offers which appear in both portals by means of statistical matching and removed duplicate offers. Missing values were treated by multiple imputation. The prediction model is a semi-parametric approach where the postal districts are used to describe the location effect. Comparisons with microcensus results and the local rent index reveal significant differences between the market of online flat offers and the stock of existing flat contracts. Interested readers will find the commented programming code in the internet supplement.</p></div>","PeriodicalId":100134,"journal":{"name":"AStA Wirtschafts- und Sozialstatistisches Archiv","volume":"18 2","pages":"245 - 278"},"PeriodicalIF":0.0,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11943-024-00340-6.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142451139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Vorwort der Herausgeber 编辑前言

AStA Wirtschafts- und Sozialstatistisches Archiv

Pub Date : 2024-04-17 DOI: 10.1007/s11943-024-00339-z

Markus Zwick, Jan Pablo Burgard

引用次数: 0

Interview mit Ralf Münnich 采访拉尔夫-明尼希

AStA Wirtschafts- und Sozialstatistisches Archiv

Pub Date : 2024-03-11 DOI: 10.1007/s11943-024-00337-1

Walter Krämer

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

AStA Wirtschafts- und Sozialstatistisches Archiv

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀