首页 > 最新文献

2009 6th IEEE International Working Conference on Mining Software Repositories最新文献

英文 中文
Mining the history of synchronous changes to refine code ownership 挖掘同步变更的历史以细化代码所有权
Pub Date : 2009-05-16 DOI: 10.1109/MSR.2009.5069492
Lile Hattori, Michele Lanza
When software repositories are mined, two distinct sources of information are usually explored: the history log and snapshots of the system. Results of analyses derived from these two sources are biased by the frequency with which developers commit their changes. We argue that the usage of mainstream SCM systems influences the way that developers work. For example, since it is tedious to resolve conflicts due to parallel commits, developers tend to minimize conflicts by not contemporarily modifying the same file. This however defeats one of the purposes of such systems. We mine repositories created by our Syde tool, which records every change by every developer in multi-developer projects. This new source of information can augment the accuracy of analyses and breaks new ground in terms of how such information can assist developers. In this paper we illustrate how the information we mine can help to provide a refined notion of code ownership. As a case study, we analyze the developers' activities of the development of a commercial system.
当挖掘软件存储库时,通常会探索两个不同的信息源:历史日志和系统快照。来自这两个来源的分析结果会受到开发人员提交变更的频率的影响。我们认为主流SCM系统的使用影响了开发人员的工作方式。例如,由于解决由并行提交引起的冲突是很繁琐的,开发人员倾向于通过不同时修改相同的文件来最小化冲突。然而,这违背了这种系统的目的之一。我们挖掘由Syde工具创建的存储库,它记录了多开发人员项目中每个开发人员的每个更改。这种新的信息源可以增加分析的准确性,并在这些信息如何帮助开发人员方面开辟了新的领域。在本文中,我们将说明我们挖掘的信息如何帮助提供代码所有权的精确概念。作为一个案例,我们分析了开发一个商业系统的开发人员的活动。
{"title":"Mining the history of synchronous changes to refine code ownership","authors":"Lile Hattori, Michele Lanza","doi":"10.1109/MSR.2009.5069492","DOIUrl":"https://doi.org/10.1109/MSR.2009.5069492","url":null,"abstract":"When software repositories are mined, two distinct sources of information are usually explored: the history log and snapshots of the system. Results of analyses derived from these two sources are biased by the frequency with which developers commit their changes. We argue that the usage of mainstream SCM systems influences the way that developers work. For example, since it is tedious to resolve conflicts due to parallel commits, developers tend to minimize conflicts by not contemporarily modifying the same file. This however defeats one of the purposes of such systems. We mine repositories created by our Syde tool, which records every change by every developer in multi-developer projects. This new source of information can augment the accuracy of analyses and breaks new ground in terms of how such information can assist developers. In this paper we illustrate how the information we mine can help to provide a refined notion of code ownership. As a case study, we analyze the developers' activities of the development of a commercial system.","PeriodicalId":413721,"journal":{"name":"2009 6th IEEE International Working Conference on Mining Software Repositories","volume":"193 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115268177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 45
Code siblings: Technical and legal implications of copying code between applications 代码兄弟:在应用程序之间复制代码的技术和法律含义
Pub Date : 2009-05-16 DOI: 10.1109/MSR.2009.5069483
D. Germán, M. D. Penta, Yann-Gaël Guéhéneuc, G. Antoniol
Source code cloning does not happen within a single system only. It can also occur between one system and another. We use the term code sibling to refer to a code clone that evolves in a different system than the code from which it originates. Code siblings can only occur when the source code copyright owner allows it and when the conditions imposed by such license are not incompatible with the license of the destination system. In some situations copying of source code fragments are allowed—legally—in one direction, but not in the other. In this paper, we use clone detection, license mining and classification, and change history techniques to understand how code siblings—under different licenses—flow in one direction or the other between Linux and two BSD Unixes, FreeBSD and OpenBSD. Our results show that, in most cases, this migration appears to happen according to the terms of the license of the original code being copied, favoring always copying from less restrictive licenses towards more restrictive ones. We also discovered that sometimes code is inserted to the kernels from an outside source.
源代码克隆不会只发生在单个系统中。它也可能发生在一个系统和另一个系统之间。我们使用术语“代码同胞”来指在不同系统中进化的代码克隆,而不是在其起源的代码中。代码兄弟只能在源代码版权所有者允许的情况下发生,并且这种许可所施加的条件与目标系统的许可不兼容。在某些情况下,允许在一个方向上合法地复制源代码片段,但不允许在另一个方向上复制。在本文中,我们使用克隆检测、许可证挖掘和分类以及变更历史技术来了解不同许可证下的代码同胞如何在Linux和两个BSD unix (FreeBSD和OpenBSD)之间以一个方向或另一个方向流动。我们的结果表明,在大多数情况下,这种迁移似乎是根据被复制的原始代码的许可条款发生的,总是倾向于从限制较少的许可证进行复制,而不是从限制较多的许可证进行复制。我们还发现,有时代码是从外部源插入到内核的。
{"title":"Code siblings: Technical and legal implications of copying code between applications","authors":"D. Germán, M. D. Penta, Yann-Gaël Guéhéneuc, G. Antoniol","doi":"10.1109/MSR.2009.5069483","DOIUrl":"https://doi.org/10.1109/MSR.2009.5069483","url":null,"abstract":"Source code cloning does not happen within a single system only. It can also occur between one system and another. We use the term code sibling to refer to a code clone that evolves in a different system than the code from which it originates. Code siblings can only occur when the source code copyright owner allows it and when the conditions imposed by such license are not incompatible with the license of the destination system. In some situations copying of source code fragments are allowed—legally—in one direction, but not in the other. In this paper, we use clone detection, license mining and classification, and change history techniques to understand how code siblings—under different licenses—flow in one direction or the other between Linux and two BSD Unixes, FreeBSD and OpenBSD. Our results show that, in most cases, this migration appears to happen according to the terms of the license of the original code being copied, favoring always copying from less restrictive licenses towards more restrictive ones. We also discovered that sometimes code is inserted to the kernels from an outside source.","PeriodicalId":413721,"journal":{"name":"2009 6th IEEE International Working Conference on Mining Software Repositories","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123394546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 96
From work to word: How do software developers describe their work? 从工作到单词:软件开发人员如何描述他们的工作?
Pub Date : 2009-05-16 DOI: 10.1109/MSR.2009.5069490
W. Maalej, Hans-Jörg Happel
Developers take notes about their work sessions, either to remember the work status and share it with collaborators, or because employers explicitly require this for project management matters. We report on an exploratory study which aims at understanding how software developers describe their work. We analyzed more than 750,000 work descriptions of about 2,000 professionals taken over 8 years in three settings. We observed several similarities in the content and time meta-data of work descriptions. Most frequent terms, such as top-30 performed activities, are used consistently. Particular templates such as “ACTION concerning ARTIFACT because of CAUSE” occur frequently. Developers described sessions that last 30–120 min. 4–16 times a day. Maintaining diaries seems to consume between 3–6% of the total work time, and in 10% of the sessions, developers did not describe their work in sufficient detail. We argue that our results make the first step towards automatically generating work diaries for software developers.
开发人员记录他们的工作会议,要么是为了记住工作状态并与合作者分享,要么是因为雇主明确要求项目管理事项这样做。我们报告了一项探索性研究,旨在了解软件开发人员如何描述他们的工作。我们分析了大约2000名专业人士在8年多的时间里在三种环境下的75万份工作描述。我们观察到工作描述的内容和时间元数据有几个相似之处。最常见的术语,如前30名执行的活动,被一致地使用。诸如“由于CAUSE而涉及工件的行动”这样的特定模板经常出现。开发者描述了每次持续30-120分钟,每天4-16次的会话。维持日记似乎消耗了总工作时间的3-6%,并且在10%的会议中,开发人员没有足够详细地描述他们的工作。我们认为我们的结果为软件开发人员自动生成工作日记迈出了第一步。
{"title":"From work to word: How do software developers describe their work?","authors":"W. Maalej, Hans-Jörg Happel","doi":"10.1109/MSR.2009.5069490","DOIUrl":"https://doi.org/10.1109/MSR.2009.5069490","url":null,"abstract":"Developers take notes about their work sessions, either to remember the work status and share it with collaborators, or because employers explicitly require this for project management matters. We report on an exploratory study which aims at understanding how software developers describe their work. We analyzed more than 750,000 work descriptions of about 2,000 professionals taken over 8 years in three settings. We observed several similarities in the content and time meta-data of work descriptions. Most frequent terms, such as top-30 performed activities, are used consistently. Particular templates such as “ACTION concerning ARTIFACT because of CAUSE” occur frequently. Developers described sessions that last 30–120 min. 4–16 times a day. Maintaining diaries seems to consume between 3–6% of the total work time, and in 10% of the sessions, developers did not describe their work in sufficient detail. We argue that our results make the first step towards automatically generating work diaries for software developers.","PeriodicalId":413721,"journal":{"name":"2009 6th IEEE International Working Conference on Mining Software Repositories","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125552171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 42
Tracking concept drift of software projects using defect prediction quality 利用缺陷预测质量跟踪软件项目的概念漂移
Pub Date : 2009-05-16 DOI: 10.1109/MSR.2009.5069480
J. Ekanayake, Jonas Tappolet, H. Gall, A. Bernstein
Defect prediction is an important task in the mining of software repositories, but the quality of predictions varies strongly within and across software projects. In this paper we investigate the reasons why the prediction quality is so fluctuating due to the altering nature of the bug (or defect) fixing process. Therefore, we adopt the notion of a concept drift, which denotes that the defect prediction model has become unsuitable as set of influencing features has changed - usually due to a change in the underlying bug generation process (i.e., the concept). We explore four open source projects (Eclipse, OpenOffice, Netbeans and Mozilla) and construct file-level and project-level features for each of them from their respective CVS and Bugzilla repositories. We then use this data to build defect prediction models and visualize the prediction quality along the time axis. These visualizations allow us to identify concept drifts and - as a consequence - phases of stability and instability expressed in the level of defect prediction quality. Further, we identify those project features, which are influencing the defect prediction quality using both a tree induction-algorithm and a linear regression model. Our experiments uncover that software systems are subject to considerable concept drifts in their evolution history. Specifically, we observe that the change in number of authors editing a file and the number of defects fixed by them contribute to a project's concept drift and therefore influence the defect prediction quality. Our findings suggest that project managers using defect prediction models for decision making should be aware of the actual phase of stability or instability due to a potential concept drift.
缺陷预测是软件存储库挖掘中的一项重要任务,但是预测的质量在软件项目内部和不同项目之间差异很大。在本文中,我们研究了由于bug(或缺陷)修复过程的变化而导致预测质量波动的原因。因此,我们采用概念漂移的概念,这表示随着一组影响特性的变化,缺陷预测模型已经变得不合适了——通常是由于底层bug生成过程(即概念)的变化。我们研究了四个开源项目(Eclipse、OpenOffice、Netbeans和Mozilla),并从它们各自的CVS和Bugzilla存储库中为每个项目构建文件级和项目级特性。然后我们使用这些数据来构建缺陷预测模型,并沿着时间轴可视化预测质量。这些可视化使我们能够识别概念漂移,并且——作为结果——在缺陷预测质量级别中表达的稳定性和不稳定性阶段。此外,我们使用树归纳算法和线性回归模型来识别那些影响缺陷预测质量的项目特征。我们的实验揭示了软件系统在其进化历史中受到相当大的概念漂移的影响。具体地说,我们观察到编辑文件的作者数量的变化以及由他们修复的缺陷数量的变化有助于项目的概念漂移,从而影响缺陷预测的质量。我们的发现表明,使用缺陷预测模型进行决策的项目经理应该意识到由于潜在的概念漂移而导致的稳定或不稳定的实际阶段。
{"title":"Tracking concept drift of software projects using defect prediction quality","authors":"J. Ekanayake, Jonas Tappolet, H. Gall, A. Bernstein","doi":"10.1109/MSR.2009.5069480","DOIUrl":"https://doi.org/10.1109/MSR.2009.5069480","url":null,"abstract":"Defect prediction is an important task in the mining of software repositories, but the quality of predictions varies strongly within and across software projects. In this paper we investigate the reasons why the prediction quality is so fluctuating due to the altering nature of the bug (or defect) fixing process. Therefore, we adopt the notion of a concept drift, which denotes that the defect prediction model has become unsuitable as set of influencing features has changed - usually due to a change in the underlying bug generation process (i.e., the concept). We explore four open source projects (Eclipse, OpenOffice, Netbeans and Mozilla) and construct file-level and project-level features for each of them from their respective CVS and Bugzilla repositories. We then use this data to build defect prediction models and visualize the prediction quality along the time axis. These visualizations allow us to identify concept drifts and - as a consequence - phases of stability and instability expressed in the level of defect prediction quality. Further, we identify those project features, which are influencing the defect prediction quality using both a tree induction-algorithm and a linear regression model. Our experiments uncover that software systems are subject to considerable concept drifts in their evolution history. Specifically, we observe that the change in number of authors editing a file and the number of defects fixed by them contribute to a project's concept drift and therefore influence the defect prediction quality. Our findings suggest that project managers using defect prediction models for decision making should be aware of the actual phase of stability or instability due to a potential concept drift.","PeriodicalId":413721,"journal":{"name":"2009 6th IEEE International Working Conference on Mining Software Repositories","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114314067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 70
Evolution of the core team of developers in libre software projects 自由软件项目中核心开发团队的发展
Pub Date : 2009-05-16 DOI: 10.1109/MSR.2009.5069497
G. Robles, Jesus M. Gonzalez-Barahona, I. Herraiz
In many libre (free, open source) software projects, most of the development is performed by a relatively small number of persons, the “core team”. The stability and permanence of this group of most active developers is of great importance for the evolution and sustainability of the project. In this position paper we propose a quantitative methodology to study the evolution of core teams by analyzing information from source code management repositories. The most active developers in different periods are identified, and their activity is calculated over time, looking for core team evolution patterns.
在许多自由(free, open source)软件项目中,大部分的开发都是由相对较少的人完成的,即“核心团队”。这群最活跃的开发人员的稳定性和持久性对项目的发展和可持续性至关重要。在本文中,我们提出了一种定量方法,通过分析来自源代码管理存储库的信息来研究核心团队的演变。确定不同时期最活跃的开发人员,并随着时间的推移计算他们的活动,寻找核心团队的演变模式。
{"title":"Evolution of the core team of developers in libre software projects","authors":"G. Robles, Jesus M. Gonzalez-Barahona, I. Herraiz","doi":"10.1109/MSR.2009.5069497","DOIUrl":"https://doi.org/10.1109/MSR.2009.5069497","url":null,"abstract":"In many libre (free, open source) software projects, most of the development is performed by a relatively small number of persons, the “core team”. The stability and permanence of this group of most active developers is of great importance for the evolution and sustainability of the project. In this position paper we propose a quantitative methodology to study the evolution of core teams by analyzing information from source code management repositories. The most active developers in different periods are identified, and their activity is calculated over time, looking for core team evolution patterns.","PeriodicalId":413721,"journal":{"name":"2009 6th IEEE International Working Conference on Mining Software Repositories","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128112862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 90
Automatic labeling of software components and their evolution using log-likelihood ratio of word frequencies in source code 使用源代码中词频的对数似然比自动标记软件组件及其演变
Pub Date : 2009-05-16 DOI: 10.1109/MSR.2009.5069499
Adrian Kuhn
As more and more open-source software components become available on the internet we need automatic ways to label and compare them. For example, a developer who searches for reusable software must be able to quickly gain an understanding of retrieved components. This understanding cannot be gained at the level of source code due to the semantic gap between source code and the domain model. In this paper we present a lexical approach that uses the log-likelihood ratios of word frequencies to automatically provide labels for software components. We present a prototype implementation of our labeling/comparison algorithm and provide examples of its application. In particular, we apply the approach to detect trends in the evolution of a software system.
随着越来越多的开源软件组件在互联网上可用,我们需要自动标记和比较它们的方法。例如,搜索可重用软件的开发人员必须能够快速了解检索到的组件。由于源代码和领域模型之间的语义差距,无法在源代码级别获得这种理解。在本文中,我们提出了一种使用词频的对数似然比来自动为软件组件提供标签的词法方法。我们提出了我们的标签/比较算法的原型实现,并提供了其应用的例子。特别是,我们应用该方法来检测软件系统发展的趋势。
{"title":"Automatic labeling of software components and their evolution using log-likelihood ratio of word frequencies in source code","authors":"Adrian Kuhn","doi":"10.1109/MSR.2009.5069499","DOIUrl":"https://doi.org/10.1109/MSR.2009.5069499","url":null,"abstract":"As more and more open-source software components become available on the internet we need automatic ways to label and compare them. For example, a developer who searches for reusable software must be able to quickly gain an understanding of retrieved components. This understanding cannot be gained at the level of source code due to the semantic gap between source code and the domain model. In this paper we present a lexical approach that uses the log-likelihood ratios of word frequencies to automatically provide labels for software components. We present a prototype implementation of our labeling/comparison algorithm and provide examples of its application. In particular, we apply the approach to detect trends in the evolution of a software system.","PeriodicalId":413721,"journal":{"name":"2009 6th IEEE International Working Conference on Mining Software Repositories","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133852894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Evaluating process quality in GNOME based on change request data 基于变更请求数据评估GNOME中的流程质量
Pub Date : 2009-05-16 DOI: 10.1109/MSR.2009.5069485
Holger Schackmann, H. Lichter
The lifecycle of defects reports and enhancement requests collected in the Bugzilla database of the GNOME project provides valuable information on the evolution of the change request process and for the assessment of process quality in the GNOME sub projects. We present a quality model for the analysis of quality characteristics that is based on evaluating metrics on the Bugzilla database, and illustrate it with a comparative evaluation for 25 of the largest products within GNOME.
GNOME项目的Bugzilla数据库中收集的缺陷报告和增强请求的生命周期提供了关于变更请求过程演变的有价值的信息,并为GNOME子项目中的过程质量评估提供了有价值的信息。我们提出了一个质量模型,用于分析基于Bugzilla数据库上的评估指标的质量特征,并通过对GNOME中25个最大产品的比较评估来说明它。
{"title":"Evaluating process quality in GNOME based on change request data","authors":"Holger Schackmann, H. Lichter","doi":"10.1109/MSR.2009.5069485","DOIUrl":"https://doi.org/10.1109/MSR.2009.5069485","url":null,"abstract":"The lifecycle of defects reports and enhancement requests collected in the Bugzilla database of the GNOME project provides valuable information on the evolution of the change request process and for the assessment of process quality in the GNOME sub projects. We present a quality model for the analysis of quality characteristics that is based on evaluating metrics on the Bugzilla database, and illustrate it with a comparative evaluation for 25 of the largest products within GNOME.","PeriodicalId":413721,"journal":{"name":"2009 6th IEEE International Working Conference on Mining Software Repositories","volume":"161 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123504146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Mining source code to automatically split identifiers for software analysis 挖掘源代码自动分割标识符用于软件分析
Pub Date : 2009-05-16 DOI: 10.1109/MSR.2009.5069482
Eric Enslen, Emily Hill, L. Pollock, K. Vijay-Shanker
Automated software engineering tools (e.g., program search, concern location, code reuse, quality assessment, etc.) increasingly rely on natural language information from comments and identifiers in code. The first step in analyzing words from identifiers requires splitting identifiers into their constituent words. Unlike natural languages, where space and punctuation are used to delineate words, identifiers cannot contain spaces. One common way to split identifiers is to follow programming language naming conventions. For example, Java programmers often use camel case, where words are delineated by uppercase letters or non-alphabetic characters. However, programmers also create identifiers by concatenating sequences of words together with no discernible delineation, which poses challenges to automatic identifier splitting. In this paper, we present an algorithm to automatically split identifiers into sequences of words by mining word frequencies in source code. With these word frequencies, our identifier splitter uses a scoring technique to automatically select the most appropriate partitioning for an identifier. In an evaluation of over 8000 identifiers from open source Java programs, our Samurai approach outperforms the existing state of the art techniques.
自动化的软件工程工具(例如,程序搜索、关注点定位、代码重用、质量评估等)越来越依赖于来自代码注释和标识符的自然语言信息。从标识符中分析单词的第一步需要将标识符拆分为组成它们的单词。与使用空格和标点符号来描绘单词的自然语言不同,标识符不能包含空格。分割标识符的一种常用方法是遵循编程语言的命名约定。例如,Java程序员经常使用驼峰大小写,其中单词由大写字母或非字母字符描述。然而,程序员也通过将单词序列连接在一起来创建标识符,而没有可识别的描述,这对自动标识符分割提出了挑战。本文提出了一种通过挖掘源代码中的词频将标识符自动分割成词序列的算法。有了这些单词频率,我们的标识符分配器使用评分技术自动为标识符选择最合适的分区。在对来自开放源码Java程序的8000多个标识符的评估中,我们的Samurai方法优于现有的艺术技术。
{"title":"Mining source code to automatically split identifiers for software analysis","authors":"Eric Enslen, Emily Hill, L. Pollock, K. Vijay-Shanker","doi":"10.1109/MSR.2009.5069482","DOIUrl":"https://doi.org/10.1109/MSR.2009.5069482","url":null,"abstract":"Automated software engineering tools (e.g., program search, concern location, code reuse, quality assessment, etc.) increasingly rely on natural language information from comments and identifiers in code. The first step in analyzing words from identifiers requires splitting identifiers into their constituent words. Unlike natural languages, where space and punctuation are used to delineate words, identifiers cannot contain spaces. One common way to split identifiers is to follow programming language naming conventions. For example, Java programmers often use camel case, where words are delineated by uppercase letters or non-alphabetic characters. However, programmers also create identifiers by concatenating sequences of words together with no discernible delineation, which poses challenges to automatic identifier splitting. In this paper, we present an algorithm to automatically split identifiers into sequences of words by mining word frequencies in source code. With these word frequencies, our identifier splitter uses a scoring technique to automatically select the most appropriate partitioning for an identifier. In an evaluation of over 8000 identifiers from open source Java programs, our Samurai approach outperforms the existing state of the art techniques.","PeriodicalId":413721,"journal":{"name":"2009 6th IEEE International Working Conference on Mining Software Repositories","volume":"312 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123678433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 193
Mining the coherence of GNOME bug reports with statistical topic models 利用统计主题模型挖掘GNOME bug报告的一致性
Pub Date : 2009-05-16 DOI: 10.1109/MSR.2009.5069486
Erik J. Linstead, P. Baldi
We adapt Latent Dirichlet Allocation to the problem of mining bug reports in order to define a new information-theoretic measure of coherence. We then apply our technique to a snapshot of the GNOME Bugzilla database consisting of 431,863 bug reports for multiple software projects. In addition to providing an unsupervised means for modeling report content, our results indicate substantial promise in applying statistical text mining algorithms for estimating bug report quality. Complete results are available from our supplementary materials website at http://sourcerer.ics.uci.edu/msr2009/gnome_coherence.html.
我们将潜在狄利克雷分配方法应用到bug报告挖掘问题中,以定义一种新的相干性的信息论度量。然后将我们的技术应用于GNOME Bugzilla数据库的快照,该数据库包含多个软件项目的431,863个bug报告。除了提供一种无监督的方法来对报告内容进行建模之外,我们的结果表明在应用统计文本挖掘算法来估计bug报告质量方面有很大的前景。完整的结果可从我们的补充材料网站http://sourcerer.ics.uci.edu/msr2009/gnome_coherence.html获得。
{"title":"Mining the coherence of GNOME bug reports with statistical topic models","authors":"Erik J. Linstead, P. Baldi","doi":"10.1109/MSR.2009.5069486","DOIUrl":"https://doi.org/10.1109/MSR.2009.5069486","url":null,"abstract":"We adapt Latent Dirichlet Allocation to the problem of mining bug reports in order to define a new information-theoretic measure of coherence. We then apply our technique to a snapshot of the GNOME Bugzilla database consisting of 431,863 bug reports for multiple software projects. In addition to providing an unsupervised means for modeling report content, our results indicate substantial promise in applying statistical text mining algorithms for estimating bug report quality. Complete results are available from our supplementary materials website at http://sourcerer.ics.uci.edu/msr2009/gnome_coherence.html.","PeriodicalId":413721,"journal":{"name":"2009 6th IEEE International Working Conference on Mining Software Repositories","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127411680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
SourcererDB: An aggregated repository of statically analyzed and cross-linked open source Java projects SourcererDB:静态分析和交叉链接的开源Java项目的聚合存储库
Pub Date : 2009-05-16 DOI: 10.1109/MSR.2009.5069501
Joel Ossher, S. Bajracharya, Erik J. Linstead, P. Baldi, C. Lopes
Abstract The open source movement has made vast quantities of source code available online for free, providing an extremely large dataset for empirical study and potential resuse. A major difficulty in exploiting this potential fully is that the data are currently scattered between competing source code repositories, none of which are structured for empirical analysis and cross-project comparison. As a result, software researchers and developers are left to compile their own datasets, resulting in duplicated effort and limited results. To address this challenge, we built SourcererDB, an aggregated repository of statically analyzed and cross-linked open source Java projects. SourcererDB contains local snapshots of 2,852 Java projects taken from Sourceforge, Apache and Java.net. These projects are statically analyzed to extract rich structural information, which is then stored in a relational database. References to entities in the 16,058 external jars are resolved and grouped, allowing for cross-project usage information to be accessed easily. This paper describes: (a) the mechanism for resolving and grouping these cross-project references, (b) the structure of and the metamodel for the SourcererDB repository, and (d) end-user dataset access mechanisms. Our goal in building SourcererDB is to provide a rich dataset of source code to facilitate the sharing of extracted data and to encourage reuse and repeatability of experiments.
开源运动使得大量的源代码可以在网上免费获得,为实证研究和潜在的再利用提供了一个极其庞大的数据集。充分利用这一潜力的一个主要困难是,数据目前分散在相互竞争的源代码存储库之间,没有一个是为经验分析和跨项目比较而构建的。因此,软件研究人员和开发人员只能自己编写数据集,这导致了重复的工作和有限的结果。为了应对这一挑战,我们构建了SourcererDB,这是一个静态分析和交叉链接的开源Java项目的聚合存储库。SourcererDB包含来自Sourceforge、Apache和Java.net的2852个Java项目的本地快照。对这些项目进行静态分析,以提取丰富的结构信息,然后将其存储在关系数据库中。对16,058个外部jar中的实体的引用进行了解析和分组,从而可以轻松地访问跨项目使用信息。本文描述了:(a)解析和分组这些跨项目引用的机制,(b) SourcererDB存储库的结构和元模型,以及(d)最终用户数据集访问机制。我们构建SourcererDB的目标是提供丰富的源代码数据集,以促进提取数据的共享,并鼓励实验的重用和可重复性。
{"title":"SourcererDB: An aggregated repository of statically analyzed and cross-linked open source Java projects","authors":"Joel Ossher, S. Bajracharya, Erik J. Linstead, P. Baldi, C. Lopes","doi":"10.1109/MSR.2009.5069501","DOIUrl":"https://doi.org/10.1109/MSR.2009.5069501","url":null,"abstract":"Abstract The open source movement has made vast quantities of source code available online for free, providing an extremely large dataset for empirical study and potential resuse. A major difficulty in exploiting this potential fully is that the data are currently scattered between competing source code repositories, none of which are structured for empirical analysis and cross-project comparison. As a result, software researchers and developers are left to compile their own datasets, resulting in duplicated effort and limited results. To address this challenge, we built SourcererDB, an aggregated repository of statically analyzed and cross-linked open source Java projects. SourcererDB contains local snapshots of 2,852 Java projects taken from Sourceforge, Apache and Java.net. These projects are statically analyzed to extract rich structural information, which is then stored in a relational database. References to entities in the 16,058 external jars are resolved and grouped, allowing for cross-project usage information to be accessed easily. This paper describes: (a) the mechanism for resolving and grouping these cross-project references, (b) the structure of and the metamodel for the SourcererDB repository, and (d) end-user dataset access mechanisms. Our goal in building SourcererDB is to provide a rich dataset of source code to facilitate the sharing of extracted data and to encourage reuse and repeatability of experiments.","PeriodicalId":413721,"journal":{"name":"2009 6th IEEE International Working Conference on Mining Software Repositories","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129439086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
期刊
2009 6th IEEE International Working Conference on Mining Software Repositories
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1