Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)最新文献

英文中文

OLALA: Object-Level Active Learning for Efficient Document Layout Annotation OLALA:高效文档布局标注的对象级主动学习

Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)

Pub Date : 2020-10-05 DOI: 10.18653/v1/2022.nlpcss-1.19

Zejiang Shen, Jian Zhao, Melissa Dell, Yaoliang Yu, Weining Li

Layout detection is an essential step for accurately extracting structured contents from historical documents. The intricate and varied layouts present in these document images make it expensive to label the numerous layout regions that can be densely arranged on each page. Current active learning methods typically rank and label samples at the image level, where the annotation budget is not optimally spent due to the overexposure of common objects per image. Inspired by recent progress in semi-supervised learning and self-training, we propose OLALA, an Object-Level Active Learning framework for efficient document layout Annotation. OLALA aims to optimize the annotation process by selectively annotating only the most ambiguous regions within an image, while using automatically generated labels for the rest. Central to OLALA is a perturbation-based scoring function that determines which objects require manual annotation. Extensive experiments show that OLALA can significantly boost model performance and improve annotation efficiency, facilitating the extraction of masses of structured text for downstream NLP applications.

版面检测是从历史文档中准确提取结构化内容的重要步骤。这些文档图像中呈现的复杂多样的布局使得标记可以密集排列在每个页面上的众多布局区域的成本很高。当前的主动学习方法通常在图像级别对样本进行排序和标记，由于每张图像的常见对象过度曝光，标注预算没有得到最优的使用。受最近半监督学习和自我训练进展的启发，我们提出了OLALA，一种用于高效文档布局标注的对象级主动学习框架。OLALA旨在通过选择性地注释图像中最模糊的区域来优化注释过程，而对其余部分使用自动生成的标签。OLALA的核心是一个基于扰动的评分功能，它决定哪些对象需要手动注释。大量实验表明，OLALA可以显著提高模型性能和标注效率，为下游NLP应用提取大量结构化文本提供便利。

{"title":"OLALA: Object-Level Active Learning for Efficient Document Layout Annotation","authors":"Zejiang Shen, Jian Zhao, Melissa Dell, Yaoliang Yu, Weining Li","doi":"10.18653/v1/2022.nlpcss-1.19","DOIUrl":"https://doi.org/10.18653/v1/2022.nlpcss-1.19","url":null,"abstract":"Layout detection is an essential step for accurately extracting structured contents from historical documents. The intricate and varied layouts present in these document images make it expensive to label the numerous layout regions that can be densely arranged on each page. Current active learning methods typically rank and label samples at the image level, where the annotation budget is not optimally spent due to the overexposure of common objects per image. Inspired by recent progress in semi-supervised learning and self-training, we propose OLALA, an Object-Level Active Learning framework for efficient document layout Annotation. OLALA aims to optimize the annotation process by selectively annotating only the most ambiguous regions within an image, while using automatically generated labels for the rest. Central to OLALA is a perturbation-based scoring function that determines which objects require manual annotation. Extensive experiments show that OLALA can significantly boost model performance and improve annotation efficiency, facilitating the extraction of masses of structured text for downstream NLP applications.","PeriodicalId":438120,"journal":{"name":"Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117273338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

An Analysis of Acknowledgments in NLP Conference Proceedings NLP会议论文集致谢分析

Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)

Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.nlpcss-1.17

Winston Wu

While acknowledgments are often overlooked and sometimes entirely missing from publications, this short section of a paper can provide insights on the state of a field. We characterize and perform a textual analysis of acknowledgments in NLP conference proceedings across the last 17 years, revealing broader trends in funding and research directions in NLP as well as interesting phenomena including career incentives and the influence of defaults.

虽然致谢经常被忽视，有时在出版物中完全没有，但论文的这一小部分可以提供对一个领域状态的见解。我们对过去17年NLP会议记录中的致谢进行了描述和文本分析，揭示了NLP在资金和研究方向上的更广泛趋势，以及包括职业激励和违约影响在内的有趣现象。

引用次数: 0

Utilizing Weak Supervision to Create S3D: A Sarcasm Annotated Dataset 利用弱监督创建S3D:一个讽刺注释数据集

Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)

Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.nlpcss-1.22

Jordan Painter, H. Treharne, Diptesh Kanojia

Sarcasm is prevalent in all corners of social media, posing many challenges within Natural Language Processing (NLP), particularly for sentiment analysis. Sarcasm detection remains a largely unsolved problem in many NLP tasks due to its contradictory and typically derogatory nature as a figurative language construct. With recent strides in NLP, many pre-trained language models exist that have been trained on data from specific social media platforms, i.e., Twitter. In this paper, we evaluate the efficacy of multiple sarcasm detection datasets using machine and deep learning models. We create two new datasets - a manually annotated gold standard Sarcasm Annotated Dataset (SAD) and a Silver-Standard Sarcasm-annotated Dataset (S3D). Using a combination of existing sarcasm datasets with SAD, we train a sarcasm detection model over a social-media domain pre-trained language model, BERTweet, which yields an F1-score of 78.29%. Using an Ensemble model with an underlying majority technique, we further label S3D to produce a weakly supervised dataset containing over $100,000$ tweets. We publicly release all the code, our manually annotated and weakly supervised datasets, and fine-tuned models for further research.

讽刺在社交媒体的各个角落都很普遍，这给自然语言处理(NLP)带来了许多挑战，尤其是在情感分析方面。反讽作为一种比喻性的语言结构，具有矛盾性和典型的贬义性质，因此在许多NLP任务中，反讽检测仍然是一个很大程度上未解决的问题。随着NLP最近的进步，许多预先训练的语言模型已经在特定的社交媒体平台(如Twitter)上进行了训练。在本文中，我们使用机器和深度学习模型评估了多个讽刺检测数据集的有效性。我们创建了两个新的数据集-一个手动注释的金标准讽刺注释数据集(SAD)和一个银标准讽刺注释数据集(S3D)。结合现有的讽刺数据集和SAD，我们在社交媒体领域预训练的语言模型BERTweet上训练了一个讽刺检测模型，其f1得分为78.29%。使用具有底层多数技术的集成模型，我们进一步标记S3D以生成包含超过100,000条tweet的弱监督数据集。我们公开发布了所有的代码，我们手工注释和弱监督的数据集，以及为进一步研究微调的模型。

{"title":"Utilizing Weak Supervision to Create S3D: A Sarcasm Annotated Dataset","authors":"Jordan Painter, H. Treharne, Diptesh Kanojia","doi":"10.18653/v1/2022.nlpcss-1.22","DOIUrl":"https://doi.org/10.18653/v1/2022.nlpcss-1.22","url":null,"abstract":"Sarcasm is prevalent in all corners of social media, posing many challenges within Natural Language Processing (NLP), particularly for sentiment analysis. Sarcasm detection remains a largely unsolved problem in many NLP tasks due to its contradictory and typically derogatory nature as a figurative language construct. With recent strides in NLP, many pre-trained language models exist that have been trained on data from specific social media platforms, i.e., Twitter. In this paper, we evaluate the efficacy of multiple sarcasm detection datasets using machine and deep learning models. We create two new datasets - a manually annotated gold standard Sarcasm Annotated Dataset (SAD) and a Silver-Standard Sarcasm-annotated Dataset (S3D). Using a combination of existing sarcasm datasets with SAD, we train a sarcasm detection model over a social-media domain pre-trained language model, BERTweet, which yields an F1-score of 78.29%. Using an Ensemble model with an underlying majority technique, we further label S3D to produce a weakly supervised dataset containing over $100,000$ tweets. We publicly release all the code, our manually annotated and weakly supervised datasets, and fine-tuned models for further research.","PeriodicalId":438120,"journal":{"name":"Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130524409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

To Prefer or to Choose? Generating Agency and Power Counterfactuals Jointly for Gender Bias Mitigation 喜欢还是选择?发电机构和电力公司共同应对性别偏见

Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)

Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.nlpcss-1.6

Maja Stahl, Maximilian Spliethöver, Henning Wachsmuth

Gender bias may emerge from an unequal representation of agency and power, for example, by portraying women frequently as passive and powerless (“She accepted her future”) and men as proactive and powerful (“He chose his future”). When language models learn from respective texts, they may reproduce or even amplify the bias. An effective way to mitigate bias is to generate counterfactual sentences with opposite agency and power to the training. Recent work targeted agency-specific verbs from a lexicon to this end. We argue that this is insufficient, due to the interaction of agency and power and their dependence on context. In this paper, we thus develop a new rewriting model that identifies verbs with the desired agency and power in the context of the given sentence. The verbs’ probability is then boosted to encourage the model to rewrite both connotations jointly. According to automatic metrics, our model effectively controls for power while being competitive in agency to the state of the art. In our main evaluation, human annotators favored its counterfactuals in terms of both connotations, also deeming its meaning preservation better.

性别偏见可能来自机构和权力的不平等代表，例如，经常将妇女描绘成被动和无能为力(“她接受了她的未来”)，而将男子描绘成主动和强大(“他选择了他的未来”)。当语言模型从各自的文本中学习时，它们可能会复制甚至放大这种偏见。一种有效的减轻偏见的方法是生成与训练相反的代理和权力的反事实句。最近的工作是针对特定于机构的动词从一个词典中实现这一目的。我们认为这是不够的，因为代理和权力的相互作用以及它们对语境的依赖。因此，在本文中，我们开发了一种新的重写模型，该模型可以在给定句子的上下文中识别具有所需代理和权力的动词。然后提高动词的概率，以鼓励模型共同重写两个含义。根据自动度量，我们的模型有效地控制了权力，同时在代理方面具有竞争力。在我们的主要评估中，人类注释者在两种内涵方面都倾向于它的反事实，也认为它的意义保存得更好。

{"title":"To Prefer or to Choose? Generating Agency and Power Counterfactuals Jointly for Gender Bias Mitigation","authors":"Maja Stahl, Maximilian Spliethöver, Henning Wachsmuth","doi":"10.18653/v1/2022.nlpcss-1.6","DOIUrl":"https://doi.org/10.18653/v1/2022.nlpcss-1.6","url":null,"abstract":"Gender bias may emerge from an unequal representation of agency and power, for example, by portraying women frequently as passive and powerless (“She accepted her future”) and men as proactive and powerful (“He chose his future”). When language models learn from respective texts, they may reproduce or even amplify the bias. An effective way to mitigate bias is to generate counterfactual sentences with opposite agency and power to the training. Recent work targeted agency-specific verbs from a lexicon to this end. We argue that this is insufficient, due to the interaction of agency and power and their dependence on context. In this paper, we thus develop a new rewriting model that identifies verbs with the desired agency and power in the context of the given sentence. The verbs’ probability is then boosted to encourage the model to rewrite both connotations jointly. According to automatic metrics, our model effectively controls for power while being competitive in agency to the state of the art. In our main evaluation, human annotators favored its counterfactuals in terms of both connotations, also deeming its meaning preservation better.","PeriodicalId":438120,"journal":{"name":"Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116360683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Fine-Grained Extraction and Classification of Skill Requirements in German-Speaking Job Ads 德语招聘广告中技能要求的细粒度提取与分类

Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)

Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.nlpcss-1.2

A. Gnehm, Eva Bühlmann, Helen Buchs, S. Clematide

Monitoring the development of labor market skill requirements is an information need that is more and more approached by applying text mining methods to job advertisement data. We present an approach for fine-grained extraction and classification of skill requirements from German-speaking job advertisements. We adapt pre-trained transformer-based language models to the domain and task of computing meaningful representations of sentences or spans. By using context from job advertisements and the large ESCO domain ontology we improve our similarity-based unsupervised multi-label classification results. Our best model achieves a mean average precision of 0.969 on the skill class level.

监测劳动力市场技能需求的发展是一种信息需求，越来越多的人将文本挖掘方法应用于招聘广告数据。我们提出了一种从德语招聘广告中细粒度提取和分类技能要求的方法。我们将预训练的基于转换器的语言模型应用于计算句子或跨度的有意义表示的领域和任务。通过使用招聘广告上下文和大型ESCO领域本体，改进了基于相似度的无监督多标签分类结果。我们的最佳模型在技能等级水平上的平均精度为0.969。

引用次数: 2

Detecting Dissonant Stance in Social Media: The Role of Topic Exposure 社交媒体中不和谐立场的发现:话题曝光的作用

Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)

Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.nlpcss-1.16

Vasudha Varadarajan, Nikita Soni, Weixi Wang, C. Luhmann, H. A. Schwartz, Naoya Inoue

We address dissonant stance detection, classifying conflicting stance between two input statements.Computational models for traditional stance detection have typically been trained to indicate pro/con for a given target topic (e.g. gun control) and thus do not generalize well to new topics.In this paper, we systematically evaluate the generalizability of dissonant stance detection to situations where examples of the topic have not been seen at all or have only been seen a few times.We show that dissonant stance detection models trained on only 8 topics, none of which are the target topic, can perform as well as those trained only on a target topic. Further, adding non-target topics boosts performance further up to approximately 32 topics where accuracies start to plateau. Taken together, our experiments suggest dissonant stance detection models can generalize to new unanticipated topics, an important attribute for the social scientific study of social media where new topics emerge daily.

我们解决了不协调的立场检测，分类两个输入语句之间的冲突立场。传统姿态检测的计算模型通常被训练为对给定目标主题(例如枪支管制)表示赞成或反对，因此不能很好地推广到新主题。在本文中，我们系统地评估了不和谐姿态检测的泛化性，在这些情况下，主题的例子根本没有被看到或只被看到过几次。我们表明，仅在8个主题上训练的不协调姿态检测模型，其中没有一个是目标主题，可以表现得与仅在目标主题上训练的模型一样好。此外，添加非目标主题将性能进一步提高到大约32个主题，准确度开始趋于平稳。综上所述，我们的实验表明，不和谐姿态检测模型可以推广到新的意想不到的话题，这是社交媒体社会科学研究的一个重要属性，因为社交媒体每天都会出现新的话题。

引用次数: 2

Conspiracy Narratives in the Protest Movement Against COVID-19 Restrictions in Germany. A Long-term Content Analysis of Telegram Chat Groups. 德国反新冠肺炎限制抗议运动中的阴谋叙事。电报聊天群的长期内容分析。

Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)

Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.nlpcss-1.8

Manuel Weigand, Maximilian Weber, Johannes B. Gruber

From the start of the COVID-19 pandemic in Germany, different groups have been protesting measures implemented by different government bodies in Germany to control the pandemic. It was widely claimed that many of the offline and online protests were driven by conspiracy narratives disseminated through groups and channels on the messenger app Telegram. We investigate this claim by measuring the frequency of conspiracy narratives in messages from open Telegram chat groups of the Querdenken movement, set up to organize protests against COVID-19 restrictions in Germany. We furthermore explore the content of these messages using topic modelling. To this end, we collected 822k text messages sent between April 2020 and May 2022 in 34 chat groups. By fine-tuning a Distilbert model, using self-annotated data, we find that 8.24% of the sent messages contain signs of conspiracy narratives. This number is not static, however, as the share of conspiracy messages grew while the overall number of messages shows a downward trend since its peak at the end of 2020. We further find a mix of known conspiracy narratives make up the topics in our topic model. Our findings suggest that the Querdenken movement is getting smaller over time, but its remaining members focus even more on conspiracy narratives.

从德国新冠肺炎疫情开始，不同团体一直在抗议德国不同政府机构为控制疫情而采取的措施。人们普遍认为，许多线下和线上抗议活动都是由通过即时通讯应用Telegram上的群组和频道传播的阴谋叙事推动的。我们通过测量公开的Querdenken运动Telegram聊天群信息中阴谋叙事的频率来调查这一说法，Querdenken运动是为了在德国组织抗议COVID-19限制而建立的。我们进一步使用主题建模来探索这些消息的内容。为此，我们收集了2020年4月至2022年5月在34个聊天组中发送的82.2万条短信。通过对蒸馏器模型进行微调，使用自注释数据，我们发现8.24%的发送信息包含阴谋叙事的迹象。然而，这个数字并不是一成不变的，因为阴谋信息的份额在增长，而信息的总数自2020年底达到峰值以来呈下降趋势。我们进一步发现，在我们的主题模型中，已知阴谋叙事的混合构成了主题。我们的研究结果表明，随着时间的推移，奎尔登肯运动的规模越来越小，但其剩余成员更关注阴谋叙事。

{"title":"Conspiracy Narratives in the Protest Movement Against COVID-19 Restrictions in Germany. A Long-term Content Analysis of Telegram Chat Groups.","authors":"Manuel Weigand, Maximilian Weber, Johannes B. Gruber","doi":"10.18653/v1/2022.nlpcss-1.8","DOIUrl":"https://doi.org/10.18653/v1/2022.nlpcss-1.8","url":null,"abstract":"From the start of the COVID-19 pandemic in Germany, different groups have been protesting measures implemented by different government bodies in Germany to control the pandemic. It was widely claimed that many of the offline and online protests were driven by conspiracy narratives disseminated through groups and channels on the messenger app Telegram. We investigate this claim by measuring the frequency of conspiracy narratives in messages from open Telegram chat groups of the Querdenken movement, set up to organize protests against COVID-19 restrictions in Germany. We furthermore explore the content of these messages using topic modelling. To this end, we collected 822k text messages sent between April 2020 and May 2022 in 34 chat groups. By fine-tuning a Distilbert model, using self-annotated data, we find that 8.24% of the sent messages contain signs of conspiracy narratives. This number is not static, however, as the share of conspiracy messages grew while the overall number of messages shows a downward trend since its peak at the end of 2020. We further find a mix of known conspiracy narratives make up the topics in our topic model. Our findings suggest that the Querdenken movement is getting smaller over time, but its remaining members focus even more on conspiracy narratives.","PeriodicalId":438120,"journal":{"name":"Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)","volume":"24 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114133200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Understanding Narratives from Demographic Survey Data: a Comparative Study with Multiple Neural Topic Models 从人口调查数据中理解叙事:多神经主题模型的比较研究

Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)

Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.nlpcss-1.4

Xiao Xu, Gert Stulp, Antal van den Bosch, Anne Gauthier

Fertility intentions as verbalized in surveys are a poor predictor of actual fertility outcomes, the number of children people have. This can partly be explained by the uncertainty people have in their intentions. Such uncertainties are hard to capture through traditional survey questions, although open-ended questions can be used to get insight into people’s subjective narratives of the future that determine their intentions. Analyzing such answers to open-ended questions can be done through Natural Language Processing techniques. Traditional topic models (e.g., LSA and LDA), however, often fail to do since they rely on co-occurrences, which are often rare in short survey responses. The aim of this study was to apply and evaluate topic models on demographic survey data. In this study, we applied neural topic models (e.g. BERTopic, CombinedTM) based on language models to responses from Dutch women on their fertility plans, and compared the topics and their coherence scores from each model to expert judgments. Our results show that neural models produce topics more in line with human interpretation compared to LDA. However, the coherence score could only partly reflect on this, depending on the corpus used for calculation. This research is important because, first, it helps us develop more informed strategies on model selection and evaluation for topic modeling on survey data; and second, it shows that the field of demography has much to gain from adopting NLP methods.

调查中口头表达的生育意愿并不能很好地预测实际的生育结果，即人们拥有的孩子数量。这在一定程度上可以用人们对自己意图的不确定性来解释。这种不确定性很难通过传统的调查问题捕捉到，尽管开放式问题可以用来洞察人们对未来的主观叙述，这些叙述决定了他们的意图。分析开放式问题的答案可以通过自然语言处理技术来完成。然而，传统的主题模型(例如LSA和LDA)往往不能做到这一点，因为它们依赖于共发生，而这在简短的调查回答中往往很少见。本研究的目的是应用和评估人口调查数据的主题模型。在本研究中，我们应用基于语言模型的神经话题模型(如BERTopic, CombinedTM)对荷兰妇女关于生育计划的回答进行了分析，并将每个模型的话题及其一致性得分与专家判断进行了比较。我们的研究结果表明，与LDA相比，神经模型产生的主题更符合人类的解释。然而，连贯分数只能部分反映这一点，这取决于用于计算的语料库。这项研究的重要性在于，首先，它有助于我们在调查数据的主题建模中制定更明智的模型选择和评估策略;其次，它表明人口统计学领域可以从采用NLP方法中获益良多。

{"title":"Understanding Narratives from Demographic Survey Data: a Comparative Study with Multiple Neural Topic Models","authors":"Xiao Xu, Gert Stulp, Antal van den Bosch, Anne Gauthier","doi":"10.18653/v1/2022.nlpcss-1.4","DOIUrl":"https://doi.org/10.18653/v1/2022.nlpcss-1.4","url":null,"abstract":"Fertility intentions as verbalized in surveys are a poor predictor of actual fertility outcomes, the number of children people have. This can partly be explained by the uncertainty people have in their intentions. Such uncertainties are hard to capture through traditional survey questions, although open-ended questions can be used to get insight into people’s subjective narratives of the future that determine their intentions. Analyzing such answers to open-ended questions can be done through Natural Language Processing techniques. Traditional topic models (e.g., LSA and LDA), however, often fail to do since they rely on co-occurrences, which are often rare in short survey responses. The aim of this study was to apply and evaluate topic models on demographic survey data. In this study, we applied neural topic models (e.g. BERTopic, CombinedTM) based on language models to responses from Dutch women on their fertility plans, and compared the topics and their coherence scores from each model to expert judgments. Our results show that neural models produce topics more in line with human interpretation compared to LDA. However, the coherence score could only partly reflect on this, depending on the corpus used for calculation. This research is important because, first, it helps us develop more informed strategies on model selection and evaluation for topic modeling on survey data; and second, it shows that the field of demography has much to gain from adopting NLP methods.","PeriodicalId":438120,"journal":{"name":"Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129676624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Conditional Language Models for Community-Level Linguistic Variation 社区层面语言变异的条件语言模型

Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)

Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.nlpcss-1.9

Bill Noble, Jean-Philippe Bernardy

Community-level linguistic variation is a core concept in sociolinguistics. In this paper, we use conditioned neural language models to learn vector representations for 510 online communities. We use these representations to measure linguistic variation between commu-nities and investigate the degree to which linguistic variation corresponds with social connections between communities. We find that our sociolinguistic embeddings are highly correlated with a social network-based representation that does not use any linguistic input.

社区层面的语言变异是社会语言学的一个核心概念。在本文中，我们使用条件神经语言模型来学习510个在线社区的向量表示。我们使用这些表征来衡量社区之间的语言差异，并研究语言差异与社区之间社会联系的对应程度。我们发现我们的社会语言学嵌入与不使用任何语言输入的基于社会网络的表示高度相关。

引用次数: 0

Can Contextualizing User Embeddings Improve Sarcasm and Hate Speech Detection? 语境化用户嵌入能提高讽刺和仇恨言论的检测吗?

Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)

Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.nlpcss-1.14

Kim Breitwieser

While implicit embeddings so far have been mostly concerned with creating an overall representation of the user, we evaluate a different approach. By only considering content directed at a specific topic, we create sub-user embeddings, and measure their usefulness on the tasks of sarcasm and hate speech detection. In doing so, we show that task-related topics can have a noticeable effect on model performance, especially when dealing with intended expressions like sarcasm, but less so for hate speech, which is usually labelled as such on the receiving end.

虽然隐式嵌入到目前为止主要关注的是创建用户的整体表示，但我们评估了一种不同的方法。通过只考虑针对特定主题的内容，我们创建子用户嵌入，并衡量它们在讽刺和仇恨言论检测任务中的有用性。在这样做的过程中，我们表明，与任务相关的主题可以对模型性能产生明显的影响，特别是在处理讽刺等预期表达时，但对仇恨言论的影响较小，因为仇恨言论通常在接收端被标记为这样。

引用次数: 1

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀