Transformer models have become widely popular in numerous applications, and especially for building foundation large language models (LLMs). Recently, there has been a surge in the exploration of transformer-based architectures in non-LLM applications. In particular, the self-attention mechanism within the transformer architecture offers a way to exploit any hidden relations within data, making it widely applicable for a variety of spatio-temporal tasks in scientific computing domains (e.g., weather, traffic, agriculture). Most of these efforts have primarily focused on accelerating the inference phase. However, the computational resources required to train these attention-based models for scientific applications remain a significant challenge to address. Emerging non-volatile memory (NVM)-based processing-in-memory (PIM) architectures can achieve higher performance and better energy efficiency than their GPU-based counterparts. However, the frequent weight updates during training would necessitate write operations to NVM cells, posing a significant barrier for considering stand-alone NVM-based PIM architectures. In this paper, we present HpT, a new hybrid approach to accelerate the training of attention-based models for scientific applications. Our approach is hybrid at two different layers: at the software layer, our approach dynamically switches from a full-parameter training mode to a lower-parameter training mode by incorporating intrinsic dimensionality; and at the hardware layer, our approach harnesses the combined power of GPUs, resistive random-access memory (ReRAM)-based PIM devices, and systolic arrays. This software-hardware co-design approach is aimed at adaptively reducing both runtime and energy costs during the training phase, without compromising on quality. Experiments on four concrete real-world scientific applications demonstrate that our hybrid approach is able to significantly reduce training time (up to $11.9times$) and energy consumption (up to $12.05times$), compared to the corresponding full-parameter training executing on only GPUs. Our approach serves as an example for accelerating the training of attention-based models on heterogeneous platforms including ReRAMs.
{"title":"HpT: Hybrid Acceleration of Spatio-Temporal Attention Model Training on Heterogeneous Manycore Architectures","authors":"Saiman Dahal;Pratyush Dhingra;Krishu Kumar Thapa;Partha Pratim Pande;Ananth Kalyanaraman","doi":"10.1109/TPDS.2024.3522781","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3522781","url":null,"abstract":"Transformer models have become widely popular in numerous applications, and especially for building foundation large language models (LLMs). Recently, there has been a surge in the exploration of transformer-based architectures in non-LLM applications. In particular, the self-attention mechanism within the transformer architecture offers a way to exploit any hidden relations within data, making it widely applicable for a variety of spatio-temporal tasks in scientific computing domains (e.g., weather, traffic, agriculture). Most of these efforts have primarily focused on accelerating the inference phase. However, the computational resources required to train these attention-based models for scientific applications remain a significant challenge to address. Emerging non-volatile memory (NVM)-based processing-in-memory (PIM) architectures can achieve higher performance and better energy efficiency than their GPU-based counterparts. However, the frequent weight updates during training would necessitate write operations to NVM cells, posing a significant barrier for considering stand-alone NVM-based PIM architectures. In this paper, we present <monospace>HpT</monospace>, a new hybrid approach to accelerate the training of attention-based models for scientific applications. Our approach is hybrid at two different layers: at the software layer, our approach dynamically switches from a full-parameter training mode to a lower-parameter training mode by incorporating intrinsic dimensionality; and at the hardware layer, our approach harnesses the combined power of GPUs, resistive random-access memory (ReRAM)-based PIM devices, and systolic arrays. This software-hardware co-design approach is aimed at adaptively reducing both runtime and energy costs during the training phase, without compromising on quality. Experiments on four concrete real-world scientific applications demonstrate that our hybrid approach is able to significantly reduce training time (up to <inline-formula><tex-math>$11.9times$</tex-math></inline-formula>) and energy consumption (up to <inline-formula><tex-math>$12.05times$</tex-math></inline-formula>), compared to the corresponding full-parameter training executing on only GPUs. Our approach serves as an example for accelerating the training of attention-based models on heterogeneous platforms including ReRAMs.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 3","pages":"407-421"},"PeriodicalIF":5.6,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sharding is a promising solution to scale blockchain by separating the system into multiple shards to process transactions in parallel. However, due to state separation and shard isolation, it is still challenging to efficiently support smart contracts on a blockchain sharding system where smart contracts can interact with each other, involving states maintained by multiple shards. Specifically, existing sharding systems adopt a costly multi-step collaboration mechanism to execute smart contracts, resulting in long latency and low throughput. This article proposes Sparrow, a blockchain sharding protocol achieving one-step execution for smart contracts. To break shard isolation, inspired by non-local hotspot data caching in traditional databases, we propose a new idea of inter-shard caching, allowing a shard to prefetch and cache frequently accessed contract states of other shards. The miner can thus use the inter-shard cache to pre-execute a pending transaction, retrieve all its contract invocations, and commit it to multiple shards in one step. Particularly, we first propose a speculative dispersal cache synchronisation mechanism for efficient and secure cache synchronization across shards in Byzantine environments. Then, we propose a multi-branch exploration mechanism to solve the rollback problem during the optimistic one-step execution of contract invocations with dependencies. We also present a series of conflict resolution mechanisms to decrease the rollback caused by inherent transaction conflicts. We implement prototypes for Sparrow and existing sharding systems, and the evaluation shows that Sparrow improves the throughput by $2.44times$ and reduces the transaction latency by 30% compared with the existing sharding systems.
{"title":"Sparrow: Expediting Smart Contract Execution for Blockchain Sharding via Inter-Shard Caching","authors":"Junyuan Liang;Peiyuan Yao;Wuhui Chen;Zicong Hong;Jianting Zhang;Ting Cai;Min Sun;Zibin Zheng","doi":"10.1109/TPDS.2024.3522016","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3522016","url":null,"abstract":"Sharding is a promising solution to scale blockchain by separating the system into multiple shards to process transactions in parallel. However, due to state separation and shard isolation, it is still challenging to efficiently support smart contracts on a blockchain sharding system where smart contracts can interact with each other, involving states maintained by multiple shards. Specifically, existing sharding systems adopt a costly multi-step collaboration mechanism to execute smart contracts, resulting in long latency and low throughput. This article proposes <small>Sparrow</small>, a blockchain sharding protocol achieving one-step execution for smart contracts. To break shard isolation, inspired by non-local hotspot data caching in traditional databases, we propose a new idea of <i>inter-shard caching</i>, allowing a shard to prefetch and cache frequently accessed contract states of other shards. The miner can thus use the inter-shard cache to pre-execute a pending transaction, retrieve all its contract invocations, and commit it to multiple shards in one step. Particularly, we first propose a speculative dispersal cache synchronisation mechanism for efficient and secure cache synchronization across shards in Byzantine environments. Then, we propose a multi-branch exploration mechanism to solve the rollback problem during the optimistic one-step execution of contract invocations with dependencies. We also present a series of conflict resolution mechanisms to decrease the rollback caused by inherent transaction conflicts. We implement prototypes for <small>Sparrow</small> and existing sharding systems, and the evaluation shows that <small>Sparrow</small> improves the throughput by <inline-formula><tex-math>$2.44times$</tex-math></inline-formula> and reduces the transaction latency by 30% compared with the existing sharding systems.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 3","pages":"377-390"},"PeriodicalIF":5.6,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Since 2017, NVIDIA GPUs have been equipped with specialized units known as Tensor Cores, which demonstrate remarkable efficiency in processing matrix multiplications (GEMMs). Beyond GEMMs, researchers have explored the potential applications of Tensor Cores in matrix factorization, such as QR factorization. However, the inside GEMMs in QR factorization are typically tall and skinny. Compared to compute-bound square GEMMs, these tall and skinny GEMMs are memory bound, leading to suboptimal performance on Tensor Cores. To solve this problem, we indicate the recursive QR factorization can convert the tall and skinny GEMMs to relatively square and large GEMMs, resulting in better performance on Tensor Cores. Besides, we extend the FP16 Tensor-Cores-based QR factorization to accommodate FP32 and FP64 on FP16 and INT8 Tensor Cores, respectively. Additionally, to address the issue of orthogonality loss in the preceding Tensor Cores-based QR factorization, we transition from the Gram-Schmidt to the Householder algorithm while preserving high performance. According to our experimental evaluation conducted on NVIDIA's A100 and GeForce RTX 3090 GPU, the precision levels of FP64, FP32, and FP16 are up to 6.22x, 8.67x, and 4.03x faster, respectively, than the current state-of-the-art implementations.
{"title":"High Performance Householder QR Factorization on Emerging GPU Architectures Using Tensor Cores","authors":"Yuhan Leng;Gaoyuan Zou;Hansheng Wang;Panruo Wu;Shaoshuai Zhang","doi":"10.1109/TPDS.2024.3522776","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3522776","url":null,"abstract":"Since 2017, NVIDIA GPUs have been equipped with specialized units known as Tensor Cores, which demonstrate remarkable efficiency in processing matrix multiplications (GEMMs). Beyond GEMMs, researchers have explored the potential applications of Tensor Cores in matrix factorization, such as QR factorization. However, the inside GEMMs in QR factorization are typically tall and skinny. Compared to compute-bound square GEMMs, these tall and skinny GEMMs are memory bound, leading to suboptimal performance on Tensor Cores. To solve this problem, we indicate the recursive QR factorization can convert the tall and skinny GEMMs to relatively square and large GEMMs, resulting in better performance on Tensor Cores. Besides, we extend the FP16 Tensor-Cores-based QR factorization to accommodate FP32 and FP64 on FP16 and INT8 Tensor Cores, respectively. Additionally, to address the issue of orthogonality loss in the preceding Tensor Cores-based QR factorization, we transition from the Gram-Schmidt to the Householder algorithm while preserving high performance. According to our experimental evaluation conducted on NVIDIA's A100 and GeForce RTX 3090 GPU, the precision levels of FP64, FP32, and FP16 are up to 6.22x, 8.67x, and 4.03x faster, respectively, than the current state-of-the-art implementations.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 3","pages":"422-436"},"PeriodicalIF":5.6,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143105824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GPU clusters have been widely used to co-locate various deep learning (DL) workloads in a multi-tenant way. Although such resource sharing can significantly reduce training cost, resource contention and interference among co-located workloads make task scheduling very complex and challenging. To simplify the scheduling problem, existing algorithms usually divide the procedure of scheduling into two sub-tasks, i.e., task placement and resource allocation, and allocate resources according to pre-defined and fixed resource demands. However, such a paradigm significantly constrains the selection of potential scheduling solutions. In this article, we present MAIFS, a novel multi-agent reinforcement learning based scheduling algorithm that handles task placement and resource allocation integratedly, and allows fungible resource allocation based on resource sensitivity of DL workloads. The core of MAIFS lies in two mechanisms. The multi-agent attention mechanism is designed to learn and share inter-related resource state features observed from different agents, which enables agents to explore fungible resource allocation solutions. The dynamic coordination graph mechanism is designed for coordinating interactive task placement decisions of agents during integrated scheduling, so as to mitigate potential task conflicts. Simulated experiments using two large scale production DL workload traces and physical deployment experiments based on a Kubernetes based GPU cluster show that MAIFS can outperform state-of-the-art scheduling algorithms by up to 44% in terms of makespan and 46% in terms of job completion time (JCT).
{"title":"Integrated and Fungible Scheduling of Deep Learning Workloads Using Multi-Agent Reinforcement Learning","authors":"Jialun Li;Danyang Xiao;Diying Yang;Xuan Mo;Weigang Wu","doi":"10.1109/TPDS.2024.3522333","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3522333","url":null,"abstract":"GPU clusters have been widely used to co-locate various deep learning (DL) workloads in a multi-tenant way. Although such resource sharing can significantly reduce training cost, resource contention and interference among co-located workloads make task scheduling very complex and challenging. To simplify the scheduling problem, existing algorithms usually divide the procedure of scheduling into two sub-tasks, i.e., task placement and resource allocation, and allocate resources according to pre-defined and fixed resource demands. However, such a paradigm significantly constrains the selection of potential scheduling solutions. In this article, we present MAIFS, a novel multi-agent reinforcement learning based scheduling algorithm that handles task placement and resource allocation integratedly, and allows fungible resource allocation based on resource sensitivity of DL workloads. The core of MAIFS lies in two mechanisms. The multi-agent attention mechanism is designed to learn and share inter-related resource state features observed from different agents, which enables agents to explore fungible resource allocation solutions. The dynamic coordination graph mechanism is designed for coordinating interactive task placement decisions of agents during integrated scheduling, so as to mitigate potential task conflicts. Simulated experiments using two large scale production DL workload traces and physical deployment experiments based on a Kubernetes based GPU cluster show that MAIFS can outperform state-of-the-art scheduling algorithms by up to 44% in terms of makespan and 46% in terms of job completion time (JCT).","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 3","pages":"391-406"},"PeriodicalIF":5.6,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143105817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-24DOI: 10.1109/TPDS.2024.3521897
Hongkuan Zhou;Bingyi Zhang;Rajgopal Kannan;Carl Busart;Viktor K. Prasanna
Temporal Graph Neural Networks (TGNNs) are powerful models to capture temporal, structural, and contextual information on temporal graphs, outperforming other methods in many high-impact downstream tasks. However, achieving high-performance TGNN inference in production environments is challenging because TGNN models suffer from high computation complexity and intrinsic temporal data dependency that hinders data parallelism. In addition, real-world TGNN applications have different latency and throughput requirements. This work presents ViTeGNN, a versatile TGNN inference solution for memory-based TGNNs on FPGAs. ViTeGNN performs algorithm-model-architecture co-design to meet the latency and throughput requirements of real-world TGNN applications. Besides the vanilla inference mode ViTeGNN-bal that updates embeddings for nodes interacting with others, we propose ViTeGNN-lat and ViTeGNN-thpt, optimized for latency and throughput. Our model optimizations include a lightweight method to compute attention scores and a related temporal neighbor pruning strategy to reduce computation and memory accesses. These are holistically coupled with key hardware optimizations that leverage the FPGA hardware. We propose a novel hardware module to execute the complex neighbor update process efficiently. To ensure similar accuracy vis-á-vis the original model, the simplified models are trained using the knowledge distillation technique. We propose a unified hardware design that supports all of these three inference modes without FPGA reconfiguration. Enabled by our flexible hardware architecture, we further propose ViTeGNN-auto, which automatically selects the best inference mode at runtime based on latency and throughput requirements, guided by our accurate performance model. We evaluate the performance of the proposed hardware accelerator on five real-world datasets. ViTeGNN-bal reduces the computation complexity by an average of 62% and memory accesses by an average of 36% with only 0.0042 accuracy loss. Compared with state-of-the-art implementations on CPU and GPU, our FPGA implementation achieves $53.9/26.0/16.1times$ speedup and $8.2/4.0/2.5times$ speedup for ViTeGNN-lat/-bal/-thpt, respectively.
{"title":"ViTeGNN: Towards Versatile Inference of Temporal Graph Neural Networks on FPGA","authors":"Hongkuan Zhou;Bingyi Zhang;Rajgopal Kannan;Carl Busart;Viktor K. Prasanna","doi":"10.1109/TPDS.2024.3521897","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3521897","url":null,"abstract":"Temporal Graph Neural Networks (TGNNs) are powerful models to capture temporal, structural, and contextual information on temporal graphs, outperforming other methods in many high-impact downstream tasks. However, achieving high-performance TGNN inference in production environments is challenging because TGNN models suffer from high computation complexity and intrinsic temporal data dependency that hinders data parallelism. In addition, real-world TGNN applications have different latency and throughput requirements. This work presents ViTeGNN, a versatile TGNN inference solution for memory-based TGNNs on FPGAs. ViTeGNN performs algorithm-model-architecture co-design to meet the latency and throughput requirements of real-world TGNN applications. Besides the vanilla inference mode ViTeGNN-bal that updates embeddings for nodes interacting with others, we propose ViTeGNN-lat and ViTeGNN-thpt, optimized for latency and throughput. Our model optimizations include a lightweight method to compute attention scores and a related temporal neighbor pruning strategy to reduce computation and memory accesses. These are holistically coupled with key hardware optimizations that leverage the FPGA hardware. We propose a novel hardware module to execute the complex neighbor update process efficiently. To ensure similar accuracy vis-á-vis the original model, the simplified models are trained using the knowledge distillation technique. We propose a unified hardware design that supports all of these three inference modes without FPGA reconfiguration. Enabled by our flexible hardware architecture, we further propose ViTeGNN-auto, which automatically selects the best inference mode at runtime based on latency and throughput requirements, guided by our accurate performance model. We evaluate the performance of the proposed hardware accelerator on five real-world datasets. ViTeGNN-bal reduces the computation complexity by an average of 62% and memory accesses by an average of 36% with only 0.0042 accuracy loss. Compared with state-of-the-art implementations on CPU and GPU, our FPGA implementation achieves <inline-formula><tex-math>$53.9/26.0/16.1times$</tex-math></inline-formula> speedup and <inline-formula><tex-math>$8.2/4.0/2.5times$</tex-math></inline-formula> speedup for ViTeGNN-lat/-bal/-thpt, respectively.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 3","pages":"502-519"},"PeriodicalIF":5.6,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143105821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In cloud data centers, the exponential growth of data places increasing demands on computing, storage, and network resources, especially in multi-tenant environments. While this growth is crucial for ensuring Quality of Service (QoS), it also introduces challenges such as fluctuating resource requirements and static container configurations, which can lead to resource underutilization and high energy consumption. This article addresses online resource provisioning and efficient scheduling for multi-tenant environments, aiming to minimize energy consumption while balancing elasticity and QoS requirements. To address this, we propose a novel optimization framework that reformulates the resource provisioning problem into a more manageable form. By reducing the original multi-constraint optimization to a container placement problem, we apply the interior-point barrier method to simplify the optimization, integrating constraints directly into the objective function for efficient computation. We also introduce elasticity as a key parameter to balance energy consumption with autonomous resource scaling, ensuring that resource consolidation does not compromise system flexibility. The proposed Energy-Efficient and Elastic Resource Provisioning (EEP) framework comprises three main modules: a distributed resource management module that employs vertical partitioning and dynamic leader election for adaptive resource allocation; a prediction module using $omega$-step prediction for accurate resource demand forecasting; and an elastic scheduling module that dynamically adjusts to tenant scaling needs, optimizing resource allocation and minimizing energy consumption. Extensive experiments across diverse cloud scenarios demonstrate that the EEP framework significantly improves energy efficiency and resource utilization compared to established baselines, supporting sustainable cloud management practices.
{"title":"Online Elastic Resource Provisioning With QoS Guarantee in Container-Based Cloud Computing","authors":"Shuaibing Lu;Ran Yan;Jie Wu;Jackson Yang;Xinyu Deng;Shen Wu;Zhi Cai;Juan Fang","doi":"10.1109/TPDS.2024.3522085","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3522085","url":null,"abstract":"In cloud data centers, the exponential growth of data places increasing demands on computing, storage, and network resources, especially in multi-tenant environments. While this growth is crucial for ensuring Quality of Service (QoS), it also introduces challenges such as fluctuating resource requirements and static container configurations, which can lead to resource underutilization and high energy consumption. This article addresses online resource provisioning and efficient scheduling for multi-tenant environments, aiming to minimize energy consumption while balancing elasticity and QoS requirements. To address this, we propose a novel optimization framework that reformulates the resource provisioning problem into a more manageable form. By reducing the original multi-constraint optimization to a container placement problem, we apply the interior-point barrier method to simplify the optimization, integrating constraints directly into the objective function for efficient computation. We also introduce elasticity as a key parameter to balance energy consumption with autonomous resource scaling, ensuring that resource consolidation does not compromise system flexibility. The proposed Energy-Efficient and Elastic Resource Provisioning (EEP) framework comprises three main modules: a distributed resource management module that employs vertical partitioning and dynamic leader election for adaptive resource allocation; a prediction module using <inline-formula><tex-math>$omega$</tex-math></inline-formula>-step prediction for accurate resource demand forecasting; and an elastic scheduling module that dynamically adjusts to tenant scaling needs, optimizing resource allocation and minimizing energy consumption. Extensive experiments across diverse cloud scenarios demonstrate that the EEP framework significantly improves energy efficiency and resource utilization compared to established baselines, supporting sustainable cloud management practices.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 3","pages":"361-376"},"PeriodicalIF":5.6,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143105920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-20DOI: 10.1109/TPDS.2024.3520395
Cristóbal A. Navarro;Felipe A. Quezada;Enzo Meneses;Héctor Ferrada;Nancy Hitschfeld
Cellular automata (CA) are simulation models that can produce complex emergent behaviors from simple local rules. Although state-of-the-art GPU solutions are already fast due to their data-parallel nature, their performance can rapidly degrade in CA with a large neighborhood radius. With the inclusion of tensor cores across the entire GPU ecosystem, interest has grown in finding ways to leverage these fast units outside the field of artificial intelligence, which was their original purpose. In this work, we present CAT, a GPU tensor core approach that can accelerate CA in which the cell transition function acts on a weighted summation of its neighborhood. CAT is evaluated theoretically, using an extended PRAM cost model, as well as empirically using the Larger Than Life (LTL) family of CA as case studies. The results confirm that the cost model is accurate, showing that CAT exhibits constant time throughout the entire radius range $1 leq r leq 16$