Thu T. H. Doan, P. Nguyen, Juri Di Rocco, Davide Di Ruscio
{"title":"Too long; didn’t read: Automatic summarization of GitHub README.MD with Transformers","authors":"Thu T. H. Doan, P. Nguyen, Juri Di Rocco, Davide Di Ruscio","doi":"10.1145/3593434.3593448","DOIUrl":null,"url":null,"abstract":"The ability to allow developers to share their source code and collaborate on software projects has made GitHub a widely used open source platform. Each repository in GitHub is generally equipped with a README.MD file to exhibit an overview of the main functionalities. Nevertheless, while offering useful information, README.MD is usually lengthy, requiring time and effort to read and comprehend. Thus, besides README.MD, GitHub also allows its users to add a short description called “About,” giving a brief but informative summary about the repository. This enables visitors to quickly grasp the main content and decide whether to continue reading. Unfortunately, due to various reasons–not excluding laziness–oftentimes this field is left blank by developers. This paper proposes GitSum as a novel approach to the summarization of README.MD. GitSum is built on top of BART and T5, two cutting-edge deep learning techniques, learning from existing data to perform recommendations for repositories with a missing description. We test its performance using two datasets collected from GitHub. The evaluation shows that GitSum can generate relevant predictions, outperforming a well-established baseline.","PeriodicalId":178596,"journal":{"name":"Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3593434.3593448","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The ability to allow developers to share their source code and collaborate on software projects has made GitHub a widely used open source platform. Each repository in GitHub is generally equipped with a README.MD file to exhibit an overview of the main functionalities. Nevertheless, while offering useful information, README.MD is usually lengthy, requiring time and effort to read and comprehend. Thus, besides README.MD, GitHub also allows its users to add a short description called “About,” giving a brief but informative summary about the repository. This enables visitors to quickly grasp the main content and decide whether to continue reading. Unfortunately, due to various reasons–not excluding laziness–oftentimes this field is left blank by developers. This paper proposes GitSum as a novel approach to the summarization of README.MD. GitSum is built on top of BART and T5, two cutting-edge deep learning techniques, learning from existing data to perform recommendations for repositories with a missing description. We test its performance using two datasets collected from GitHub. The evaluation shows that GitSum can generate relevant predictions, outperforming a well-established baseline.