Predicting Good Configurations for GitHub and Stack Overflow Topic Models

Christoph Treude, Markus Wagner
{"title":"Predicting Good Configurations for GitHub and Stack Overflow Topic Models","authors":"Christoph Treude, Markus Wagner","doi":"10.1109/MSR.2019.00022","DOIUrl":null,"url":null,"abstract":"Software repositories contain large amounts of textual data, ranging from source code comments and issue descriptions to questions, answers, and comments on Stack Overflow. To make sense of this textual data, topic modelling is frequently used as a text-mining tool for the discovery of hidden semantic structures in text bodies. Latent Dirichlet allocation (LDA) is a commonly used topic model that aims to explain the structure of a corpus by grouping texts. LDA requires multiple parameters to work well, and there are only rough and sometimes conflicting guidelines available on how these parameters should be set. In this paper, we contribute (i) a broad study of parameters to arrive at good local optima for GitHub and Stack Overflow text corpora, (ii) an a-posteriori characterisation of text corpora related to eight programming languages, and (iii) an analysis of corpus feature importance via per-corpus LDA configuration. We find that (1) popular rules of thumb for topic modelling parameter configuration are not applicable to the corpora used in our experiments, (2) corpora sampled from GitHub and Stack Overflow have different characteristics and require different configurations to achieve good model fit, and (3) we can predict good configurations for unseen corpora reliably. These findings support researchers and practitioners in efficiently determining suitable configurations for topic modelling when analysing textual data contained in software repositories.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"75 1","pages":"84-95"},"PeriodicalIF":0.0000,"publicationDate":"2018-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"38","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MSR.2019.00022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 38

Abstract

Software repositories contain large amounts of textual data, ranging from source code comments and issue descriptions to questions, answers, and comments on Stack Overflow. To make sense of this textual data, topic modelling is frequently used as a text-mining tool for the discovery of hidden semantic structures in text bodies. Latent Dirichlet allocation (LDA) is a commonly used topic model that aims to explain the structure of a corpus by grouping texts. LDA requires multiple parameters to work well, and there are only rough and sometimes conflicting guidelines available on how these parameters should be set. In this paper, we contribute (i) a broad study of parameters to arrive at good local optima for GitHub and Stack Overflow text corpora, (ii) an a-posteriori characterisation of text corpora related to eight programming languages, and (iii) an analysis of corpus feature importance via per-corpus LDA configuration. We find that (1) popular rules of thumb for topic modelling parameter configuration are not applicable to the corpora used in our experiments, (2) corpora sampled from GitHub and Stack Overflow have different characteristics and require different configurations to achieve good model fit, and (3) we can predict good configurations for unseen corpora reliably. These findings support researchers and practitioners in efficiently determining suitable configurations for topic modelling when analysing textual data contained in software repositories.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
预测GitHub和堆栈溢出主题模型的良好配置
软件存储库包含大量的文本数据,从源代码注释和问题描述到Stack Overflow上的问题、答案和注释。为了理解这些文本数据,主题建模经常被用作文本挖掘工具,用于发现文本主体中隐藏的语义结构。潜在狄利克雷分配(Latent Dirichlet allocation, LDA)是一种常用的主题模型,旨在通过对文本进行分组来解释语料库的结构。LDA需要多个参数才能正常工作,关于如何设置这些参数,只有粗略的指导方针,有时还存在冲突。在本文中,我们贡献了(i)对参数的广泛研究,以达到GitHub和Stack Overflow文本语料库的良好局部最优,(ii)与八种编程语言相关的文本语料库的后检特征,以及(iii)通过每个语料库LDA配置分析语料库特征的重要性。我们发现(1)流行的主题建模参数配置经验法则不适用于我们实验中使用的语料库,(2)从GitHub和Stack Overflow采样的语料库具有不同的特征,需要不同的配置来实现良好的模型拟合,(3)我们可以可靠地预测未见过的语料库的良好配置。这些发现支持研究人员和实践者在分析软件存储库中包含的文本数据时有效地确定主题建模的合适配置。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
SeSaMe: A Data Set of Semantically Similar Java Methods Lessons Learned from Using a Deep Tree-Based Model for Software Defect Prediction in Practice STRAIT: A Tool for Automated Software Reliability Growth Analysis Assessing Diffusion and Perception of Test Smells in Scala Projects An Empirical History of Permission Requests and Mistakes in Open Source Android Apps
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1