{"title":"On Mining Crowd-Based Speech Documentation","authors":"P. Moslehi, Bram Adams, J. Rilling","doi":"10.1145/2901739.2901771","DOIUrl":null,"url":null,"abstract":"Despite the globalization of software development, relevant documentation of a project, such as requirements and design documents, often still is missing, incomplete or outdated. However, parts of that documentation can be found outside the project, where it is fragmented across hundreds of textual web documents like blog posts, email messages and forum posts, as well as multimedia documents such as screencasts and podcasts. Since dissecting and filtering multimedia information based on its relevancy to a given project is an inherently difficult task, it is necessary to provide an automated approach for mining this crowd-based documentation. In this paper, we are interested in mining the speech part of YouTube screencasts, since this part typically contains the rationale and insights of a screencast. We introduce a methodology that transcribes and analyzes the transcribed text using various Information Extraction (IE) techniques, and present a case study to illustrate the applicability of our mining methodology. In this case study, we extract use case scenarios from WordPress tutorial videos and show how their content can supplement existing documentation. We then evaluate how well existing rankings of video content are able to pinpoint the most relevant videos for a given scenario.","PeriodicalId":6621,"journal":{"name":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","volume":"6 1","pages":"259-268"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2901739.2901771","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10
Abstract
Despite the globalization of software development, relevant documentation of a project, such as requirements and design documents, often still is missing, incomplete or outdated. However, parts of that documentation can be found outside the project, where it is fragmented across hundreds of textual web documents like blog posts, email messages and forum posts, as well as multimedia documents such as screencasts and podcasts. Since dissecting and filtering multimedia information based on its relevancy to a given project is an inherently difficult task, it is necessary to provide an automated approach for mining this crowd-based documentation. In this paper, we are interested in mining the speech part of YouTube screencasts, since this part typically contains the rationale and insights of a screencast. We introduce a methodology that transcribes and analyzes the transcribed text using various Information Extraction (IE) techniques, and present a case study to illustrate the applicability of our mining methodology. In this case study, we extract use case scenarios from WordPress tutorial videos and show how their content can supplement existing documentation. We then evaluate how well existing rankings of video content are able to pinpoint the most relevant videos for a given scenario.