首页 > 最新文献

Journal of Parallel and Distributed Computing最新文献

英文 中文
The (t,k)-diagnosability of Cayley graph generated by 2-tree 由2-tree生成的Cayley图的(t,k)可诊断性
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-06-01 Epub Date: 2025-03-21 DOI: 10.1016/j.jpdc.2025.105068
Lulu Yang , Shuming Zhou , Eddie Cheng
Multiprocessor systems, which typically use interconnection networks (or graphs) as underlying topologies, are widely utilized for big data analysis in scientific computing due to the advancements in technologies such as cloud computing, IoT, social network. With the dramatic expansion in the scale of multiprocessor systems, the pursuit and optimization of strategies for identifying faulty processors have become crucial to ensuring the normal operation of high-performance computing systems. System-level diagnosis is a process designed to distinguish between faulty processors and fault-free processors in multiprocessor systems. The (t,k)-diagnosis, a generalization of sequential diagnosis, proceeds to identify at least k faulty processors and repair them in each iteration under the assumption that there are at most t faulty processors whenever tk. We show that Cayley graph generated by 2-tree is (2n3,2n4)-diagnosable under the PMC model for n5 while it is (2n3(2n6)2n4,2n4)-diagnosable under the MM model for n4. As an empirical case study, the (t,k)-diagnosabilities of the alternating group graph AGn under the PMC model and the MM* model have been determined.
多处理器系统通常使用互连网络(或图形)作为底层拓扑,由于云计算、物联网、社交网络等技术的进步,多处理器系统被广泛用于科学计算中的大数据分析。随着多处理器系统规模的急剧扩大,故障处理器识别策略的追求和优化已成为保证高性能计算系统正常运行的关键。在多处理机系统中,系统级诊断是一种用于区分故障处理机和无故障处理机的过程。(t,k)-诊断是对顺序诊断的一种推广,它在假设t≥k时最多有t个故障处理器的情况下,在每次迭代中识别出至少k个故障处理器并对其进行修复。我们证明了由2-tree生成的Cayley图在n≥5的PMC模型下是(2n−3,2n−4)可诊断的,而在n≥4的MM模型下是(2n−3(2n−6)2n−4,2n−4)可诊断的。作为实证研究,确定了交替群图AGn在PMC模型和MM*模型下的(t,k)-可诊断性。
{"title":"The (t,k)-diagnosability of Cayley graph generated by 2-tree","authors":"Lulu Yang ,&nbsp;Shuming Zhou ,&nbsp;Eddie Cheng","doi":"10.1016/j.jpdc.2025.105068","DOIUrl":"10.1016/j.jpdc.2025.105068","url":null,"abstract":"<div><div>Multiprocessor systems, which typically use interconnection networks (or graphs) as underlying topologies, are widely utilized for big data analysis in scientific computing due to the advancements in technologies such as cloud computing, IoT, social network. With the dramatic expansion in the scale of multiprocessor systems, the pursuit and optimization of strategies for identifying faulty processors have become crucial to ensuring the normal operation of high-performance computing systems. System-level diagnosis is a process designed to distinguish between faulty processors and fault-free processors in multiprocessor systems. The <span><math><mo>(</mo><mi>t</mi><mo>,</mo><mi>k</mi><mo>)</mo></math></span>-diagnosis, a generalization of sequential diagnosis, proceeds to identify at least <em>k</em> faulty processors and repair them in each iteration under the assumption that there are at most <em>t</em> faulty processors whenever <span><math><mi>t</mi><mo>≥</mo><mi>k</mi></math></span>. We show that Cayley graph generated by 2-tree is <span><math><mo>(</mo><msup><mrow><mn>2</mn></mrow><mrow><mi>n</mi><mo>−</mo><mn>3</mn></mrow></msup><mo>,</mo><mn>2</mn><mi>n</mi><mo>−</mo><mn>4</mn><mo>)</mo></math></span>-diagnosable under the PMC model for <span><math><mi>n</mi><mo>≥</mo><mn>5</mn></math></span> while it is <span><math><mo>(</mo><mfrac><mrow><msup><mrow><mn>2</mn></mrow><mrow><mi>n</mi><mo>−</mo><mn>3</mn></mrow></msup><mo>(</mo><mn>2</mn><mi>n</mi><mo>−</mo><mn>6</mn><mo>)</mo></mrow><mrow><mn>2</mn><mi>n</mi><mo>−</mo><mn>4</mn></mrow></mfrac><mo>,</mo><mn>2</mn><mi>n</mi><mo>−</mo><mn>4</mn><mo>)</mo></math></span>-diagnosable under the MM<sup>⁎</sup> model for <span><math><mi>n</mi><mo>≥</mo><mn>4</mn></math></span>. As an empirical case study, the <span><math><mo>(</mo><mi>t</mi><mo>,</mo><mi>k</mi><mo>)</mo></math></span>-diagnosabilities of the alternating group graph <span><math><mi>A</mi><msub><mrow><mi>G</mi></mrow><mrow><mi>n</mi></mrow></msub></math></span> under the PMC model and the MM* model have been determined.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"200 ","pages":"Article 105068"},"PeriodicalIF":3.4,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143687634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IMI-GPU: Inverted multi-index for billion-scale approximate nearest neighbor search with GPUs IMI-GPU:基于gpu的十亿尺度近似最近邻搜索的反向多索引
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-06-01 Epub Date: 2025-03-04 DOI: 10.1016/j.jpdc.2025.105066
Alan Araujo , Willian Barreiros Jr. , Jun Kong , Renato Ferreira , George Teodoro
Similarity search is utilized in specialized database systems designed to handle multimedia data, often represented by high-dimensional features. In this paper, we focus on speeding up the search process with GPUs. This problem has been previously approached by accelerating the Inverted File with Asymmetric Distance Computation algorithm on GPUs (IVFADC-GPU). However, the most recent algorithm for CPU, Inverted Multi-Index (IMI), was not considered for parallelization, being found too challenging for efficient GPU deployment. Thus, we propose a novel and efficient version of IMI for GPUs called IMI-GPU. We propose a new design of the multi-sequence algorithm of IMI, enabling efficient GPU execution. We compared IMI-GPU with IVFADC-GPU using a billion-scale dataset in which IMI-GPU achieved speedups of about 3.2× and 1.9× at Recall@1 and at Recall@16 respectively. The algorithms have been compared in a variety of scenarios and our novel IMI-GPU has shown to significantly outperform IVFADC on GPUs for the majority of tested cases.
相似度搜索在专门的数据库系统中被用于处理多媒体数据,这些数据通常由高维特征表示。在本文中,我们的重点是加快图形处理器的搜索过程。这个问题之前已经通过gpu上的非对称距离计算算法(IVFADC-GPU)来加速倒排文件。然而,最新的CPU算法,倒排多索引(IMI),并没有考虑并行化,被发现对高效的GPU部署太具有挑战性。因此,我们提出了一种新颖高效的gpu IMI版本,称为IMI- gpu。我们提出了一种新的IMI多序列算法设计,使GPU能够高效地执行。我们使用十亿规模的数据集将IMI-GPU与IVFADC-GPU进行了比较,其中IMI-GPU在Recall@1和Recall@16分别实现了约3.2倍和1.9倍的加速。这些算法已经在各种场景中进行了比较,我们的新型IMI-GPU在大多数测试用例中都显示出明显优于gpu上的IVFADC。
{"title":"IMI-GPU: Inverted multi-index for billion-scale approximate nearest neighbor search with GPUs","authors":"Alan Araujo ,&nbsp;Willian Barreiros Jr. ,&nbsp;Jun Kong ,&nbsp;Renato Ferreira ,&nbsp;George Teodoro","doi":"10.1016/j.jpdc.2025.105066","DOIUrl":"10.1016/j.jpdc.2025.105066","url":null,"abstract":"<div><div>Similarity search is utilized in specialized database systems designed to handle multimedia data, often represented by high-dimensional features. In this paper, we focus on speeding up the search process with GPUs. This problem has been previously approached by accelerating the Inverted File with Asymmetric Distance Computation algorithm on GPUs (IVFADC-GPU). However, the most recent algorithm for CPU, Inverted Multi-Index (IMI), was not considered for parallelization, being found too challenging for efficient GPU deployment. Thus, we propose a novel and efficient version of IMI for GPUs called IMI-GPU. We propose a new design of the multi-sequence algorithm of IMI, enabling efficient GPU execution. We compared IMI-GPU with IVFADC-GPU using a billion-scale dataset in which IMI-GPU achieved speedups of about 3.2× and 1.9× at Recall@1 and at Recall@16 respectively. The algorithms have been compared in a variety of scenarios and our novel IMI-GPU has shown to significantly outperform IVFADC on GPUs for the majority of tested cases.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"200 ","pages":"Article 105066"},"PeriodicalIF":3.4,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143550639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring data science workflows: A practice-oriented approach to teaching processing of massive datasets 探索数据科学工作流:面向实践的海量数据集处理教学方法
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-06-01 Epub Date: 2025-02-12 DOI: 10.1016/j.jpdc.2025.105043
Johannes Schoder , H. Martin Bücker
Massive datasets are typically processed by a sequence of different stages, comprising data acquisition and preparation, data processing, data analysis, result validation, and visualization. In conjunction, these stages form a data science workflow, a key element enabling the solution of data-intensive problems. The complexity and heterogeneity of these stages require a diverse set of techniques and skills. This article discusses a hands-on practice-oriented approach aiming to enable and motivate graduate students to engage with realistic data science workflows. A major goal of the approach is to bridge the gap between academia and industry by integrating programming assignments that implement different data workflows with real-world data. In consecutive assignments, students are exposed to the methodology of solving problems using big data frameworks and are required to implement different data workflows of varying complexity. This practice-oriented approach is well received by students, as confirmed by different surveys.
海量数据集通常由一系列不同的阶段处理,包括数据采集和准备、数据处理、数据分析、结果验证和可视化。结合起来,这些阶段形成了数据科学工作流,这是解决数据密集型问题的关键要素。这些阶段的复杂性和异质性需要不同的技术和技能。本文讨论了一种面向实践的方法,旨在使和激励研究生参与现实的数据科学工作流程。该方法的一个主要目标是通过将实现不同数据工作流的编程作业与实际数据集成在一起,弥合学术界和工业界之间的差距。在连续的作业中,学生将接触到使用大数据框架解决问题的方法,并需要实现不同复杂性的不同数据工作流程。不同的调查证实,这种以实践为导向的方法深受学生的欢迎。
{"title":"Exploring data science workflows: A practice-oriented approach to teaching processing of massive datasets","authors":"Johannes Schoder ,&nbsp;H. Martin Bücker","doi":"10.1016/j.jpdc.2025.105043","DOIUrl":"10.1016/j.jpdc.2025.105043","url":null,"abstract":"<div><div>Massive datasets are typically processed by a sequence of different stages, comprising data acquisition and preparation, data processing, data analysis, result validation, and visualization. In conjunction, these stages form a data science workflow, a key element enabling the solution of data-intensive problems. The complexity and heterogeneity of these stages require a diverse set of techniques and skills. This article discusses a hands-on practice-oriented approach aiming to enable and motivate graduate students to engage with realistic data science workflows. A major goal of the approach is to bridge the gap between academia and industry by integrating programming assignments that implement different data workflows with real-world data. In consecutive assignments, students are exposed to the methodology of solving problems using big data frameworks and are required to implement different data workflows of varying complexity. This practice-oriented approach is well received by students, as confirmed by different surveys.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"200 ","pages":"Article 105043"},"PeriodicalIF":3.4,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143534360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues) 封面1 -完整的扉页(每期)/特刊扉页(每期)
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-06-01 Epub Date: 2025-04-06 DOI: 10.1016/S0743-7315(25)00041-3
{"title":"Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues)","authors":"","doi":"10.1016/S0743-7315(25)00041-3","DOIUrl":"10.1016/S0743-7315(25)00041-3","url":null,"abstract":"","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"200 ","pages":"Article 105074"},"PeriodicalIF":3.4,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143785399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distributed landmark labeling for social networks 面向社交网络的分布式地标标记
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-06-01 Epub Date: 2025-02-13 DOI: 10.1016/j.jpdc.2025.105057
Arda Şener, Hüsnü Yenigün, Kamer Kaya
Distance queries are a fundamental part of many network analysis applications. They can be used to infer the closeness of two users in social networks, the relation between two sites in a web graph, or the importance of the interaction between two proteins or molecules. Being able to answer these queries rapidly has many benefits in the area of network analysis. Pruned Landmark Labeling (Pll) is a technique used to generate an index for a given graph that allows the shortest path queries to be completed in a fraction of the time when compared to a standard breadth-first or a depth-first search-based algorithm. Parallel Shortest-distance Labeling (Psl) reorganizes the steps of Pll for the multithreaded setting and is designed particularly for social networks for which the index sizes can be much larger than what a single server can store. Even for a medium-size, 5 million vertex graph, the index size can be more than 40 GB. This paper proposes a hybrid, shared- and distributed-memory algorithm, DPSL, by partitioning the input graph via a vertex separator. The proposed method improves both the parallel execution time and the maximum memory consumption by distributing both the data and the work across multiple nodes of a cluster. For instance, on a graph with 5M vertices and 150M edges, using 4 nodes, DPSL reduces the execution time and maximum memory consumption by 2.13× and 1.87×, respectively, compared to our improved implementation of Psl.
距离查询是许多网络分析应用程序的基本组成部分。它们可以用来推断社交网络中两个用户的亲密程度,网络图中两个站点之间的关系,或者两个蛋白质或分子之间相互作用的重要性。能够快速回答这些查询在网络分析领域有很多好处。修剪的地标标记(Pll)是一种用于为给定图生成索引的技术,与基于标准宽度优先或深度优先的搜索算法相比,该技术允许在很短的时间内完成最短路径查询。并行最短距离标记(Parallel short -distance Labeling, Psl)为多线程设置重新组织了Pll的步骤,它是专门为索引大小远远大于单个服务器所能存储的索引大小的社交网络而设计的。即使是中等大小的500万个顶点图,索引大小也可能超过40 GB。本文提出了一种混合、共享和分布式内存算法DPSL,该算法通过一个顶点分隔符对输入图进行划分。该方法通过将数据和工作分布在集群的多个节点上,提高了并行执行时间和最大内存消耗。例如,在一个有5M个顶点和150M条边的图上,使用4个节点,与我们改进的Psl实现相比,DPSL的执行时间和最大内存消耗分别减少了2.13倍和1.87倍。
{"title":"Distributed landmark labeling for social networks","authors":"Arda Şener,&nbsp;Hüsnü Yenigün,&nbsp;Kamer Kaya","doi":"10.1016/j.jpdc.2025.105057","DOIUrl":"10.1016/j.jpdc.2025.105057","url":null,"abstract":"<div><div>Distance queries are a fundamental part of many network analysis applications. They can be used to infer the closeness of two users in social networks, the relation between two sites in a web graph, or the importance of the interaction between two proteins or molecules. Being able to answer these queries rapidly has many benefits in the area of network analysis. Pruned Landmark Labeling (<span>Pll</span>) is a technique used to generate an index for a given graph that allows the shortest path queries to be completed in a fraction of the time when compared to a standard breadth-first or a depth-first search-based algorithm. Parallel Shortest-distance Labeling (<span>Psl</span>) reorganizes the steps of <span>Pll</span> for the multithreaded setting and is designed particularly for social networks for which the index sizes can be much larger than what a single server can store. Even for a medium-size, 5 million vertex graph, the index size can be more than 40 GB. This paper proposes a hybrid, shared- and distributed-memory algorithm, DPSL, by partitioning the input graph via a vertex separator. The proposed method improves both the parallel execution time and the maximum memory consumption by distributing both the data and the work across multiple nodes of a cluster. For instance, on a graph with 5M vertices and 150M edges, using 4 nodes, DPSL reduces the execution time and maximum memory consumption by 2.13× and 1.87×, respectively, compared to our improved implementation of <span>Psl</span>.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"200 ","pages":"Article 105057"},"PeriodicalIF":3.4,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143427648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data quality management in big data: Strategies, tools, and educational implications 大数据中的数据质量管理:策略、工具和教育意义
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-06-01 Epub Date: 2025-03-13 DOI: 10.1016/j.jpdc.2025.105067
Thu Nguyen , Hong-Tri Nguyen , Tu-Anh Nguyen-Hoang
This study addresses the critical need for effective Big Data Quality Management (BDQM) in education, a field where data quality has profound implications but remains underexplored. The work systematically progresses from requirement analysis and standard development to the deployment of tools for monitoring and enhancing data quality in big data workflows. The study's contributions are substantiated through five research questions that explore the impact of data quality on analytics, the establishment of evaluation standards, centralized management strategies, improvement techniques, and education-specific BDQM adaptations. By addressing these questions, the research advances both theoretical and practical frameworks, equipping stakeholders with the tools to enhance the reliability and efficiency of data-driven educational initiatives. Integrating Artificial Intelligence (AI) and distributed computing, this research introduces a novel multi-stage BDQM framework that emphasizes data quality assessment, centralized governance, and AI-enhanced improvement techniques. This work underscores the transformative potential of robust BDQM systems in supporting informed decision-making and achieving sustainable outcomes in educational projects. The survey findings highlight the potential for automated data management within big data architectures, suggesting that data quality frameworks can be significantly enhanced by leveraging AI and distributed computing. Additionally, the survey emphasizes emerging trends in big data quality management, specifically (i) automated data cleaning and cleansing and (ii) data enrichment and augmentation.
本研究探讨了教育领域对有效的大数据质量管理(BDQM)的迫切需求,数据质量在这一领域具有深远影响,但仍未得到充分探索。这项工作从需求分析和标准制定系统地推进到大数据工作流中用于监控和提高数据质量的工具的部署。本研究的贡献体现在五个研究问题上,即数据质量对分析的影响、评估标准的建立、集中管理策略、改进技术以及针对教育的 BDQM 适应性。通过解决这些问题,研究推进了理论和实践框架,为利益相关者提供了提高数据驱动型教育计划的可靠性和效率的工具。这项研究整合了人工智能(AI)和分布式计算,引入了一个新颖的多阶段 BDQM 框架,强调数据质量评估、集中管理和 AI 增强型改进技术。这项工作强调了强大的 BDQM 系统在支持知情决策和实现教育项目可持续成果方面的变革潜力。调查结果凸显了大数据架构中自动数据管理的潜力,表明数据质量框架可以通过利用人工智能和分布式计算得到显著提升。此外,调查还强调了大数据质量管理的新兴趋势,特别是(i)自动数据清理和清洗以及(ii)数据丰富和增强。
{"title":"Data quality management in big data: Strategies, tools, and educational implications","authors":"Thu Nguyen ,&nbsp;Hong-Tri Nguyen ,&nbsp;Tu-Anh Nguyen-Hoang","doi":"10.1016/j.jpdc.2025.105067","DOIUrl":"10.1016/j.jpdc.2025.105067","url":null,"abstract":"<div><div>This study addresses the critical need for effective Big Data Quality Management (BDQM) in education, a field where data quality has profound implications but remains underexplored. The work systematically progresses from requirement analysis and standard development to the deployment of tools for monitoring and enhancing data quality in big data workflows. The study's contributions are substantiated through five research questions that explore the impact of data quality on analytics, the establishment of evaluation standards, centralized management strategies, improvement techniques, and education-specific BDQM adaptations. By addressing these questions, the research advances both theoretical and practical frameworks, equipping stakeholders with the tools to enhance the reliability and efficiency of data-driven educational initiatives. Integrating Artificial Intelligence (AI) and distributed computing, this research introduces a novel multi-stage BDQM framework that emphasizes data quality assessment, centralized governance, and AI-enhanced improvement techniques. This work underscores the transformative potential of robust BDQM systems in supporting informed decision-making and achieving sustainable outcomes in educational projects. The survey findings highlight the potential for automated data management within big data architectures, suggesting that data quality frameworks can be significantly enhanced by leveraging AI and distributed computing. Additionally, the survey emphasizes emerging trends in big data quality management, specifically (i) automated data cleaning and cleansing and (ii) data enrichment and augmentation.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"200 ","pages":"Article 105067"},"PeriodicalIF":3.4,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143621250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Latency-aware placement of stream processing operators in modern-day stream processing frameworks 在现代流处理框架中对流处理操作符的延迟感知放置
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-05-01 Epub Date: 2025-01-27 DOI: 10.1016/j.jpdc.2025.105041
Raphael Ecker , Vasileios Karagiannis , Michael Sober , Stefan Schulte
The rise of the Internet of Things has substantially increased the number of interconnected devices at the edge of the network. As a result, a large number of computations are now distributed in the compute continuum, spanning from the edge to the cloud, generating vast amounts of data. Stream processing is typically employed to process this data in near real-time due to its efficiency in handling continuous streams of information in a scalable manner. However, many stream processing approaches do not consider the underlying network devices of the compute continuum as candidate resources for processing data. Moreover, many existing works do not consider the incurred network latency of performing computations on multiple devices in a distributed way. To avoid this, we formulate an optimization problem for utilizing the complete compute continuum resources and design heuristics to solve this problem efficiently. Furthermore, we integrate our heuristics into Apache Storm and perform experiments that show latency- and throughput-related benefits compared to alternatives.
物联网的兴起大大增加了网络边缘互联设备的数量。因此,现在大量的计算分布在计算连续体中,从边缘到云,产生大量的数据。流处理通常用于近乎实时地处理这些数据,因为它在以可扩展的方式处理连续信息流方面效率很高。然而,许多流处理方法不考虑计算连续体的底层网络设备作为处理数据的候选资源。此外,许多现有的工作没有考虑到以分布式方式在多个设备上执行计算所产生的网络延迟。为了避免这种情况,我们制定了一个利用完整计算连续体资源的优化问题,并利用设计启发式来有效地解决这个问题。此外,我们将我们的启发式方法集成到Apache Storm中,并执行实验,显示与替代方案相比,延迟和吞吐量相关的优势。
{"title":"Latency-aware placement of stream processing operators in modern-day stream processing frameworks","authors":"Raphael Ecker ,&nbsp;Vasileios Karagiannis ,&nbsp;Michael Sober ,&nbsp;Stefan Schulte","doi":"10.1016/j.jpdc.2025.105041","DOIUrl":"10.1016/j.jpdc.2025.105041","url":null,"abstract":"<div><div>The rise of the Internet of Things has substantially increased the number of interconnected devices at the edge of the network. As a result, a large number of computations are now distributed in the compute continuum, spanning from the edge to the cloud, generating vast amounts of data. Stream processing is typically employed to process this data in near real-time due to its efficiency in handling continuous streams of information in a scalable manner. However, many stream processing approaches do not consider the underlying network devices of the compute continuum as candidate resources for processing data. Moreover, many existing works do not consider the incurred network latency of performing computations on multiple devices in a distributed way. To avoid this, we formulate an optimization problem for utilizing the complete compute continuum resources and design heuristics to solve this problem efficiently. Furthermore, we integrate our heuristics into Apache Storm and perform experiments that show latency- and throughput-related benefits compared to alternatives.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"199 ","pages":"Article 105041"},"PeriodicalIF":3.4,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143098544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues) 封面 1 - 完整扉页(常规期刊)/特刊扉页(特刊)
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-05-01 Epub Date: 2025-03-01 DOI: 10.1016/S0743-7315(25)00027-9
{"title":"Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues)","authors":"","doi":"10.1016/S0743-7315(25)00027-9","DOIUrl":"10.1016/S0743-7315(25)00027-9","url":null,"abstract":"","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"199 ","pages":"Article 105060"},"PeriodicalIF":3.4,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143527269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GPU memory usage optimization for backward propagation in deep network training 深度网络训练中向后传播的GPU内存使用优化
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-05-01 Epub Date: 2025-02-11 DOI: 10.1016/j.jpdc.2025.105053
Ding-Yong Hong , Tzu-Hsien Tsai , Ning Wang , Pangfeng Liu , Jan-Jan Wu
In modern Deep Learning, it has been a trend to design larger Deep Neural Networks (DNNs) for the execution of more complex tasks and better accuracy. On the other hand, Convolutional Neural Networks (CNNs) have become the standard method for most of computer vision tasks. However, the memory allocation for the intermediate data in convolution layers can cause severe memory pressure during model training. Many solutions have been proposed to resolve the problem. Besides hardware-dependent solutions, a general methodology rematerialization can reduce GPU memory usage by trading computation for memory efficiently. The idea is to select a set of intermediate results during the forward phase as checkpoints, and only save them in memory to reduce memory usage. The backward phase recomputes the intermediate data from the closest checkpoints in memory as needed. This recomputation increases execution time but saves memory by not storing all intermediate results in memory during the forward phase. In this paper, we will focus on efficiently finding the optimal checkpoint subset to achieve the least peak memory usage during the model training. We first describe the theoretical background of the training of a neural network using mathematical equations. We use these equations to identify all essential data required during both forward and backward phases to compute the gradient of weights of the model. We first identify the checkpoint selection problem and propose a dynamic programming algorithm with time complexity O(n3) to solve the problem of finding the optimal checkpoint subset. With extensive experiments, we formulate a more accurate description of the problem using our theoretical analysis and revise the objective function based on the tracing, and propose an O(n)-time algorithm for finding the optimal checkpoint subset.
在现代深度学习中,设计更大的深度神经网络(dnn)来执行更复杂的任务和更高的准确性已经成为一种趋势。另一方面,卷积神经网络(cnn)已经成为大多数计算机视觉任务的标准方法。然而,卷积层中间数据的内存分配在模型训练过程中会造成严重的内存压力。已经提出了许多解决这个问题的办法。除了依赖于硬件的解决方案外,通用的方法重物化可以通过有效地将计算转换为内存来减少GPU内存的使用。其思想是在转发阶段选择一组中间结果作为检查点,并仅将它们保存在内存中以减少内存使用。向后阶段根据需要从内存中最近的检查点重新计算中间数据。这种重新计算增加了执行时间,但由于在转发阶段没有将所有中间结果存储在内存中,因此节省了内存。在本文中,我们将专注于在模型训练过程中有效地找到最优检查点子集,以实现最小的峰值内存使用。我们首先用数学方程描述神经网络训练的理论背景。我们使用这些方程来确定前向和后向阶段所需的所有基本数据,以计算模型的权重梯度。我们首先识别了检查点选择问题,提出了一种时间复杂度为0 (n3)的动态规划算法来解决寻找最优检查点子集的问题。通过大量的实验,我们利用我们的理论分析和基于跟踪的目标函数来制定更准确的问题描述,并提出了一个O(n)时间算法来寻找最优检查点子集。
{"title":"GPU memory usage optimization for backward propagation in deep network training","authors":"Ding-Yong Hong ,&nbsp;Tzu-Hsien Tsai ,&nbsp;Ning Wang ,&nbsp;Pangfeng Liu ,&nbsp;Jan-Jan Wu","doi":"10.1016/j.jpdc.2025.105053","DOIUrl":"10.1016/j.jpdc.2025.105053","url":null,"abstract":"<div><div>In modern Deep Learning, it has been a trend to design larger Deep Neural Networks (DNNs) for the execution of more complex tasks and better accuracy. On the other hand, Convolutional Neural Networks (CNNs) have become the standard method for most of computer vision tasks. However, the memory allocation for the intermediate data in convolution layers can cause severe memory pressure during model training. Many solutions have been proposed to resolve the problem. Besides hardware-dependent solutions, a general methodology <em>rematerialization</em> can reduce GPU memory usage by trading computation for memory efficiently. The idea is to select a set of intermediate results during the forward phase as <em>checkpoints</em>, and only save them in memory to reduce memory usage. The backward phase recomputes the intermediate data from the closest checkpoints in memory as needed. This recomputation increases execution time but saves memory by not storing all intermediate results in memory during the forward phase. In this paper, we will focus on efficiently finding the optimal checkpoint subset to achieve the least peak memory usage during the model training. We first describe the theoretical background of the training of a neural network using mathematical equations. We use these equations to identify all essential data required during both forward and backward phases to compute the gradient of weights of the model. We first identify the <em>checkpoint selection</em> problem and propose a dynamic programming algorithm with time complexity <span><math><mi>O</mi><mo>(</mo><msup><mrow><mi>n</mi></mrow><mrow><mn>3</mn></mrow></msup><mo>)</mo></math></span> to solve the problem of finding the optimal checkpoint subset. With extensive experiments, we formulate a more accurate description of the problem using our theoretical analysis and revise the objective function based on the tracing, and propose an <span><math><mi>O</mi><mo>(</mo><mi>n</mi><mo>)</mo></math></span>-time algorithm for finding the optimal checkpoint subset.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"199 ","pages":"Article 105053"},"PeriodicalIF":3.4,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143420196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DRViT: A dynamic redundancy-aware vision transformer accelerator via algorithm and architecture co-design on FPGA 基于FPGA的算法与架构协同设计的动态冗余感知视觉变压器加速器
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-05-01 Epub Date: 2025-01-28 DOI: 10.1016/j.jpdc.2025.105042
Xiangfeng Sun , Yuanting Zhang , Qinyu Wang , Xiaofeng Zou , Yujia Liu , Ziqian Zeng , Huiping Zhuang
The multi-modal artificial intelligence (MAI) has attracted significant interest due to its capability to process and integrate data from multiple modalities, including images, text, and audio. Addressing MAI tasks in distributed systems necessitate robust and efficient architectures. The Transformer architecture has emerged as a primary network in this context. The integration of Vision Transformers (ViTs) within multimodal frameworks is crucial for enhancing the processing and comprehension of image data across diverse modalities. However, the complex architecture of ViTs and the extensive resources required for processing large-scale image data pose high computational and storage demands. These demands are particularly challenging for deploying ViTs on edge devices within distributed frameworks. To address this issue, we propose a novel dynamic redundancy-aware ViT accelerator based on parallel computing, termed DRViT. DRViT is supported by an algorithm and architecture co-design. We first propose a hardware-friendly lightweight algorithm featuring token merging, token pruning, and an INT8 quantization scheme. Then, we design a specialized architecture to support this algorithm, transforming the lightweight algorithm into significant latency and energy-efficiency improvements. Our design is implemented on the Xilinx Alveo U250, achieving an overall inference latency of 0.86 ms and 1.17 ms per image for ViT-tiny at 140 MHz and 100 MHz, respectively. The throughput can reach 1,380 GOP/s at peak, demonstrating superior performance compared to state-of-the-art accelerators, even at lower frequencies.
多模态人工智能(MAI)由于能够处理和集成来自多种模态的数据(包括图像、文本和音频)而引起了人们的极大兴趣。在分布式系统中处理MAI任务需要健壮和高效的体系结构。在这种情况下,Transformer体系结构已经成为主要的网络。在多模态框架中集成视觉转换器(ViTs)对于增强跨不同模态图像数据的处理和理解至关重要。然而,ViTs复杂的体系结构和处理大规模图像数据所需的大量资源对计算和存储提出了很高的要求。这些需求对于在分布式框架内的边缘设备上部署vit尤其具有挑战性。为了解决这个问题,我们提出了一种新的基于并行计算的动态冗余感知ViT加速器,称为DRViT。DRViT由算法和架构协同设计支持。我们首先提出了一种硬件友好的轻量级算法,该算法具有令牌合并、令牌修剪和INT8量化方案。然后,我们设计了一个专门的架构来支持该算法,将轻量级算法转化为显著的延迟和能效改进。我们的设计是在Xilinx Alveo U250上实现的,在140 MHz和100 MHz的viti -tiny下,每个图像的总体推理延迟分别为0.86 ms和1.17 ms。吞吐量峰值可达1,380 GOP/s,即使在较低的频率下,与最先进的加速器相比,也表现出卓越的性能。
{"title":"DRViT: A dynamic redundancy-aware vision transformer accelerator via algorithm and architecture co-design on FPGA","authors":"Xiangfeng Sun ,&nbsp;Yuanting Zhang ,&nbsp;Qinyu Wang ,&nbsp;Xiaofeng Zou ,&nbsp;Yujia Liu ,&nbsp;Ziqian Zeng ,&nbsp;Huiping Zhuang","doi":"10.1016/j.jpdc.2025.105042","DOIUrl":"10.1016/j.jpdc.2025.105042","url":null,"abstract":"<div><div>The multi-modal artificial intelligence (MAI) has attracted significant interest due to its capability to process and integrate data from multiple modalities, including images, text, and audio. Addressing MAI tasks in distributed systems necessitate robust and efficient architectures. The Transformer architecture has emerged as a primary network in this context. The integration of Vision Transformers (ViTs) within multimodal frameworks is crucial for enhancing the processing and comprehension of image data across diverse modalities. However, the complex architecture of ViTs and the extensive resources required for processing large-scale image data pose high computational and storage demands. These demands are particularly challenging for deploying ViTs on edge devices within distributed frameworks. To address this issue, we propose a novel dynamic redundancy-aware ViT accelerator based on parallel computing, termed DRViT. DRViT is supported by an algorithm and architecture co-design. We first propose a hardware-friendly lightweight algorithm featuring token merging, token pruning, and an INT8 quantization scheme. Then, we design a specialized architecture to support this algorithm, transforming the lightweight algorithm into significant latency and energy-efficiency improvements. Our design is implemented on the Xilinx Alveo U250, achieving an overall inference latency of 0.86 ms and 1.17 ms per image for ViT-tiny at 140 MHz and 100 MHz, respectively. The throughput can reach 1,380 GOP/s at peak, demonstrating superior performance compared to state-of-the-art accelerators, even at lower frequencies.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"199 ","pages":"Article 105042"},"PeriodicalIF":3.4,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143098545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Parallel and Distributed Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1