推断维基百科编辑的社会人口学属性:最新技术和编辑隐私的含义

S. Brückner, F. Lemmerich, M. Strohmaier
{"title":"推断维基百科编辑的社会人口学属性:最新技术和编辑隐私的含义","authors":"S. Brückner, F. Lemmerich, M. Strohmaier","doi":"10.1145/3442442.3452350","DOIUrl":null,"url":null,"abstract":"In this paper, we investigate the state-of-the-art of machine learning models to infer sociodemographic attributes of Wikipedia editors based on their public profile pages and corresponding implications for editor privacy. To build models for inferring sociodemographic attributes, ground truth labels are obtained via different strategies, using publicly disclosed information from editor profile pages. Different embedding techniques are used to derive features from editors’ profile texts. In comparative evaluations of different machine learning models, we show that the highest prediction accuracy can be obtained for the attribute gender, with precision values of 82% to 91% for women and men respectively, as well as an averaged F1-score of 0.78. For other attributes like age group, education, and religion, the utilized classifiers exhibit F1-scores in the range of 0.32 to 0.74, depending on the model class. By merely using publicly disclosed information of Wikipedia editors, we highlight issues surrounding editor privacy on Wikipedia and discuss ways to mitigate this problem. We believe our work can help start a conversation about carefully weighing the potential benefits and harms that come with the existence of information-rich, pre-labeled profile pages of Wikipedia editors.","PeriodicalId":129420,"journal":{"name":"Companion Proceedings of the Web Conference 2021","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Inferring Sociodemographic Attributes of Wikipedia Editors: State-of-the-art and Implications for Editor Privacy\",\"authors\":\"S. Brückner, F. Lemmerich, M. Strohmaier\",\"doi\":\"10.1145/3442442.3452350\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we investigate the state-of-the-art of machine learning models to infer sociodemographic attributes of Wikipedia editors based on their public profile pages and corresponding implications for editor privacy. To build models for inferring sociodemographic attributes, ground truth labels are obtained via different strategies, using publicly disclosed information from editor profile pages. Different embedding techniques are used to derive features from editors’ profile texts. In comparative evaluations of different machine learning models, we show that the highest prediction accuracy can be obtained for the attribute gender, with precision values of 82% to 91% for women and men respectively, as well as an averaged F1-score of 0.78. For other attributes like age group, education, and religion, the utilized classifiers exhibit F1-scores in the range of 0.32 to 0.74, depending on the model class. By merely using publicly disclosed information of Wikipedia editors, we highlight issues surrounding editor privacy on Wikipedia and discuss ways to mitigate this problem. We believe our work can help start a conversation about carefully weighing the potential benefits and harms that come with the existence of information-rich, pre-labeled profile pages of Wikipedia editors.\",\"PeriodicalId\":129420,\"journal\":{\"name\":\"Companion Proceedings of the Web Conference 2021\",\"volume\":\"45 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-04-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Companion Proceedings of the Web Conference 2021\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3442442.3452350\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion Proceedings of the Web Conference 2021","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3442442.3452350","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

在本文中,我们研究了最先进的机器学习模型,以根据维基百科编者的公共个人资料页面和相应的编辑隐私影响来推断他们的社会人口统计学属性。为了建立推断社会人口学属性的模型,使用编辑个人资料页面上公开披露的信息,通过不同的策略获得了真实值标签。不同的嵌入技术用于从编辑的概要文本中派生特征。在不同机器学习模型的对比评估中,我们发现属性性别的预测准确率最高,女性和男性的准确率分别为82%到91%,平均f1得分为0.78。对于其他属性,如年龄组、教育程度和宗教,所使用的分类器在0.32到0.74的范围内显示f1分数,这取决于模型类别。通过仅仅使用维基百科编者公开披露的信息,我们强调了围绕维基百科编者隐私的问题,并讨论了缓解这一问题的方法。我们相信我们的工作可以帮助开启一场对话,仔细权衡信息丰富、预先标记的维基百科编辑个人资料页面的存在所带来的潜在利益和危害。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Inferring Sociodemographic Attributes of Wikipedia Editors: State-of-the-art and Implications for Editor Privacy
In this paper, we investigate the state-of-the-art of machine learning models to infer sociodemographic attributes of Wikipedia editors based on their public profile pages and corresponding implications for editor privacy. To build models for inferring sociodemographic attributes, ground truth labels are obtained via different strategies, using publicly disclosed information from editor profile pages. Different embedding techniques are used to derive features from editors’ profile texts. In comparative evaluations of different machine learning models, we show that the highest prediction accuracy can be obtained for the attribute gender, with precision values of 82% to 91% for women and men respectively, as well as an averaged F1-score of 0.78. For other attributes like age group, education, and religion, the utilized classifiers exhibit F1-scores in the range of 0.32 to 0.74, depending on the model class. By merely using publicly disclosed information of Wikipedia editors, we highlight issues surrounding editor privacy on Wikipedia and discuss ways to mitigate this problem. We believe our work can help start a conversation about carefully weighing the potential benefits and harms that come with the existence of information-rich, pre-labeled profile pages of Wikipedia editors.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Do I Trust this Stranger? Generalized Trust and the Governance of Online Communities Explainable Demand Forecasting: A Data Mining Goldmine Tracing the Factoids: the Anatomy of Information Re-organization in Wikipedia Articles AI Principles in Identifying Toxicity in Online Conversation: Keynote at the Third Workshop on Fairness, Accountability, Transparency, Ethics and Society on the Web Fairness beyond “equal”: The Diversity Searcher as a Tool to Detect and Enhance the Representation of Socio-political Actors in News Media
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1