Characterizing Deep Learning Package Supply Chains in PyPI: Domains, Clusters, and Disengagement

IF 6.6 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING ACM Transactions on Software Engineering and Methodology Pub Date : 2024-01-10 DOI:10.1145/3640336

Kai Gao, Runzhi He, Bing Xie, Minghui Zhou

{"title":"Characterizing Deep Learning Package Supply Chains in PyPI: Domains, Clusters, and Disengagement","authors":"Kai Gao, Runzhi He, Bing Xie, Minghui Zhou","doi":"10.1145/3640336","DOIUrl":null,"url":null,"abstract":"Deep learning (DL) frameworks have become the cornerstone of the rapidly developing DL field. Through installation dependencies specified in the distribution metadata, numerous packages directly or transitively depend on DL frameworks, layer after layer, forming DL package supply chains (SCs), which are critical for DL frameworks to remain competitive. However, vital knowledge on how to nurture and sustain DL package SCs is still lacking. Achieving this knowledge may help DL frameworks formulate effective measures to strengthen their SCs to remain competitive and shed light on dependency issues and practices in the DL SC for researchers and practitioners. In this paper, we explore the domains, clusters, and disengagement of packages in two representative PyPI DL package SCs to bridge this knowledge gap. We analyze the metadata of nearly six million PyPI package distributions and construct version-sensitive SCs for two popular DL frameworks: TensorFlow and PyTorch. We find that popular packages (measured by the number of monthly downloads) in the two SCs cover 34 domains belonging to eight categories. Applications, Infrastructure, and Sciences categories account for over 85% of popular packages in either SC and TensorFlow and PyTorch SC have developed specializations on Infrastructure and Applications packages respectively. We employ the Leiden community detection algorithm and detect 131 and 100 clusters in the two SCs. The clusters mainly exhibit four shapes: Arrow, Star, Tree, and Forest with increasing dependency complexity. Most clusters are Arrow or Star, while Tree and Forest clusters account for most packages (Tensorflow SC: 70.7%, PyTorch SC: 92.9%). We identify three groups of reasons why packages disengage from the SC (i.e., remove the DL framework and its dependents from their installation dependencies): dependency issues, functional improvements, and ease of installation. The most common reason in TensorFlow SC is dependency incompatibility and in PyTorch SC is to simplify functionalities and reduce installation size. Our study provides rich implications for DL framework vendors, researchers, and practitioners on the maintenance and dependency management practices of PyPI DL SCs.","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"264 1","pages":""},"PeriodicalIF":6.6000,"publicationDate":"2024-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Software Engineering and Methodology","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3640336","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Deep learning (DL) frameworks have become the cornerstone of the rapidly developing DL field. Through installation dependencies specified in the distribution metadata, numerous packages directly or transitively depend on DL frameworks, layer after layer, forming DL package supply chains (SCs), which are critical for DL frameworks to remain competitive. However, vital knowledge on how to nurture and sustain DL package SCs is still lacking. Achieving this knowledge may help DL frameworks formulate effective measures to strengthen their SCs to remain competitive and shed light on dependency issues and practices in the DL SC for researchers and practitioners. In this paper, we explore the domains, clusters, and disengagement of packages in two representative PyPI DL package SCs to bridge this knowledge gap. We analyze the metadata of nearly six million PyPI package distributions and construct version-sensitive SCs for two popular DL frameworks: TensorFlow and PyTorch. We find that popular packages (measured by the number of monthly downloads) in the two SCs cover 34 domains belonging to eight categories. Applications, Infrastructure, and Sciences categories account for over 85% of popular packages in either SC and TensorFlow and PyTorch SC have developed specializations on Infrastructure and Applications packages respectively. We employ the Leiden community detection algorithm and detect 131 and 100 clusters in the two SCs. The clusters mainly exhibit four shapes: Arrow, Star, Tree, and Forest with increasing dependency complexity. Most clusters are Arrow or Star, while Tree and Forest clusters account for most packages (Tensorflow SC: 70.7%, PyTorch SC: 92.9%). We identify three groups of reasons why packages disengage from the SC (i.e., remove the DL framework and its dependents from their installation dependencies): dependency issues, functional improvements, and ease of installation. The most common reason in TensorFlow SC is dependency incompatibility and in PyTorch SC is to simplify functionalities and reduce installation size. Our study provides rich implications for DL framework vendors, researchers, and practitioners on the maintenance and dependency management practices of PyPI DL SCs.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

描述 PyPI 中的深度学习软件包供应链：领域、集群和脱离

深度学习（DL）框架已成为快速发展的 DL 领域的基石。通过分发元数据中指定的安装依赖关系，众多软件包直接或过渡依赖于 DL 框架，一层又一层，形成了 DL 软件包供应链（SC），这对 DL 框架保持竞争力至关重要。然而，关于如何培育和维持 DL 软件包供应链的重要知识仍然缺乏。获得这方面的知识可以帮助 DL 框架制定有效措施，加强其 SC 以保持竞争力，并为研究人员和从业人员揭示 DL SC 中的依赖性问题和实践。在本文中，我们探讨了两个具有代表性的 PyPI DL 软件包 SC 中软件包的领域、集群和脱离情况，以弥补这一知识空白。我们分析了近六百万个 PyPI 软件包发行版的元数据，并为两个流行的 DL 框架构建了对版本敏感的 SC：TensorFlow 和 PyTorch。我们发现，这两个 SC 中的流行软件包（以月下载量衡量）涵盖了属于 8 个类别的 34 个领域。应用、基础架构和科学类别占两个 SC 中流行软件包的 85% 以上，TensorFlow 和 PyTorch SC 分别开发了基础架构和应用软件包。我们采用莱顿社区检测算法，在两个 SC 中分别检测到 131 个和 100 个聚类。这些聚类主要呈现出四种形状：箭形、星形、树形和森林形，依赖复杂度依次增加。大多数聚类是箭头型或星型，而树型和森林型聚类则占了大多数软件包（Tensorflow SC：70.7%，PyTorch SC：92.9%）。我们发现，软件包脱离 SC（即从安装依赖关系中移除 DL 框架及其依赖关系）有三类原因：依赖关系问题、功能改进和安装方便。在 TensorFlow SC 中，最常见的原因是依赖关系不兼容，而在 PyTorch SC 中，最常见的原因是简化功能和减少安装体积。我们的研究就 PyPI DL SC 的维护和依赖性管理实践为 DL 框架供应商、研究人员和从业人员提供了丰富的启示。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Software Engineering and Methodology 工程技术-计算机：软件工程

CiteScore

6.30

自引率

4.50%

发文量

164

审稿时长

>12 weeks

期刊介绍： Designing and building a large, complex software system is a tremendous challenge. ACM Transactions on Software Engineering and Methodology (TOSEM) publishes papers on all aspects of that challenge: specification, design, development and maintenance. It covers tools and methodologies, languages, data structures, and algorithms. TOSEM also reports on successful efforts, noting practical lessons that can be scaled and transferred to other projects, and often looks at applications of innovative technologies. The tone is scholarly but readable; the content is worthy of study; the presentation is effective.