Predicting Good Configurations for GitHub and Stack Overflow Topic Models

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR) Pub Date : 2018-04-13 DOI:10.1109/MSR.2019.00022

Christoph Treude, Markus Wagner

{"title":"Predicting Good Configurations for GitHub and Stack Overflow Topic Models","authors":"Christoph Treude, Markus Wagner","doi":"10.1109/MSR.2019.00022","DOIUrl":null,"url":null,"abstract":"Software repositories contain large amounts of textual data, ranging from source code comments and issue descriptions to questions, answers, and comments on Stack Overflow. To make sense of this textual data, topic modelling is frequently used as a text-mining tool for the discovery of hidden semantic structures in text bodies. Latent Dirichlet allocation (LDA) is a commonly used topic model that aims to explain the structure of a corpus by grouping texts. LDA requires multiple parameters to work well, and there are only rough and sometimes conflicting guidelines available on how these parameters should be set. In this paper, we contribute (i) a broad study of parameters to arrive at good local optima for GitHub and Stack Overflow text corpora, (ii) an a-posteriori characterisation of text corpora related to eight programming languages, and (iii) an analysis of corpus feature importance via per-corpus LDA configuration. We find that (1) popular rules of thumb for topic modelling parameter configuration are not applicable to the corpora used in our experiments, (2) corpora sampled from GitHub and Stack Overflow have different characteristics and require different configurations to achieve good model fit, and (3) we can predict good configurations for unseen corpora reliably. These findings support researchers and practitioners in efficiently determining suitable configurations for topic modelling when analysing textual data contained in software repositories.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"75 1","pages":"84-95"},"PeriodicalIF":0.0000,"publicationDate":"2018-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"38","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MSR.2019.00022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 38

Abstract

Software repositories contain large amounts of textual data, ranging from source code comments and issue descriptions to questions, answers, and comments on Stack Overflow. To make sense of this textual data, topic modelling is frequently used as a text-mining tool for the discovery of hidden semantic structures in text bodies. Latent Dirichlet allocation (LDA) is a commonly used topic model that aims to explain the structure of a corpus by grouping texts. LDA requires multiple parameters to work well, and there are only rough and sometimes conflicting guidelines available on how these parameters should be set. In this paper, we contribute (i) a broad study of parameters to arrive at good local optima for GitHub and Stack Overflow text corpora, (ii) an a-posteriori characterisation of text corpora related to eight programming languages, and (iii) an analysis of corpus feature importance via per-corpus LDA configuration. We find that (1) popular rules of thumb for topic modelling parameter configuration are not applicable to the corpora used in our experiments, (2) corpora sampled from GitHub and Stack Overflow have different characteristics and require different configurations to achieve good model fit, and (3) we can predict good configurations for unseen corpora reliably. These findings support researchers and practitioners in efficiently determining suitable configurations for topic modelling when analysing textual data contained in software repositories.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

预测GitHub和堆栈溢出主题模型的良好配置

软件存储库包含大量的文本数据，从源代码注释和问题描述到Stack Overflow上的问题、答案和注释。为了理解这些文本数据，主题建模经常被用作文本挖掘工具，用于发现文本主体中隐藏的语义结构。潜在狄利克雷分配(Latent Dirichlet allocation, LDA)是一种常用的主题模型，旨在通过对文本进行分组来解释语料库的结构。LDA需要多个参数才能正常工作，关于如何设置这些参数，只有粗略的指导方针，有时还存在冲突。在本文中，我们贡献了(i)对参数的广泛研究，以达到GitHub和Stack Overflow文本语料库的良好局部最优，(ii)与八种编程语言相关的文本语料库的后检特征，以及(iii)通过每个语料库LDA配置分析语料库特征的重要性。我们发现(1)流行的主题建模参数配置经验法则不适用于我们实验中使用的语料库，(2)从GitHub和Stack Overflow采样的语料库具有不同的特征，需要不同的配置来实现良好的模型拟合，(3)我们可以可靠地预测未见过的语料库的良好配置。这些发现支持研究人员和实践者在分析软件存储库中包含的文本数据时有效地确定主题建模的合适配置。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

自引率

0.00%

发文量

期刊最新文献

SeSaMe: A Data Set of Semantically Similar Java Methods Lessons Learned from Using a Deep Tree-Based Model for Software Defect Prediction in Practice STRAIT: A Tool for Automated Software Reliability Growth Analysis Assessing Diffusion and Perception of Test Smells in Scala Projects An Empirical History of Permission Requests and Mistakes in Open Source Android Apps