Mahmoud Jahanshahi, David Reid, Adam McDaniel, Audris Mockus
{"title":"OSS License Identification at Scale: A Comprehensive Dataset Using World of Code","authors":"Mahmoud Jahanshahi, David Reid, Adam McDaniel, Audris Mockus","doi":"arxiv-2409.04824","DOIUrl":null,"url":null,"abstract":"The proliferation of open source software (OSS) has led to a complex\nlandscape of licensing practices, making accurate license identification\ncrucial for legal and compliance purposes. This study presents a comprehensive\nanalysis of OSS licenses using the World of Code (WoC) infrastructure. We\nemploy an exhaustive approach, scanning all files containing ``license'' in\ntheir filepath, and apply the winnowing algorithm for robust text matching. Our\nmethod identifies and matches over 5.5 million distinct license blobs across\nmillions of OSS projects, creating a detailed project-to-license (P2L) map. We\nverify the accuracy of our approach through stratified sampling and manual\nreview, achieving a final accuracy of 92.08%, with precision of 87.14%, recall\nof 95.45%, and an F1 score of 91.11%. This work enhances the understanding of\nOSS licensing practices and provides a valuable resource for developers,\nresearchers, and legal professionals. Future work will expand the scope of\nlicense detection to include code files and references to licenses in project\ndocumentation.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.04824","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The proliferation of open source software (OSS) has led to a complex
landscape of licensing practices, making accurate license identification
crucial for legal and compliance purposes. This study presents a comprehensive
analysis of OSS licenses using the World of Code (WoC) infrastructure. We
employ an exhaustive approach, scanning all files containing ``license'' in
their filepath, and apply the winnowing algorithm for robust text matching. Our
method identifies and matches over 5.5 million distinct license blobs across
millions of OSS projects, creating a detailed project-to-license (P2L) map. We
verify the accuracy of our approach through stratified sampling and manual
review, achieving a final accuracy of 92.08%, with precision of 87.14%, recall
of 95.45%, and an F1 score of 91.11%. This work enhances the understanding of
OSS licensing practices and provides a valuable resource for developers,
researchers, and legal professionals. Future work will expand the scope of
license detection to include code files and references to licenses in project
documentation.