Yelleti Vivek, Sri Krishna Vadlamani, Vadlamani Ravi, P. Radha Krishna
Modern deep learning continues to achieve outstanding performance on an astounding variety of high-dimensional tasks. In practice, this is obtained by fitting deep neural models to all the input data with minimal feature engineering, thus sacrificing interpretability in many cases. However, in applications such as medicine, where interpretability is crucial, feature subset selection becomes an important problem. Metaheuristics such as Binary Differential Evolution are a popular approach to feature selection, and the research literature continues to introduce novel ideas, drawn from quantum computing and chaos theory, for instance, to improve them. In this paper, we demonstrate that introducing chaos-generated variables, generated from considerations of the Lyapunov time, in place of random variables in quantum-inspired metaheuristics significantly improves their performance on high-dimensional medical classification tasks and outperforms other approaches. We show that this chaos-induced improvement is a general phenomenon by demonstrating it for multiple varieties of underlying quantum-inspired metaheuristics. Performance is further enhanced through Lasso-assisted feature pruning. At the implementation level, we vastly speed up our algorithms through a scalable island-based computing cluster parallelization technique.
{"title":"Improved Differential Evolution based Feature Selection through Quantum, Chaos, and Lasso","authors":"Yelleti Vivek, Sri Krishna Vadlamani, Vadlamani Ravi, P. Radha Krishna","doi":"arxiv-2408.10693","DOIUrl":"https://doi.org/arxiv-2408.10693","url":null,"abstract":"Modern deep learning continues to achieve outstanding performance on an\u0000astounding variety of high-dimensional tasks. In practice, this is obtained by\u0000fitting deep neural models to all the input data with minimal feature\u0000engineering, thus sacrificing interpretability in many cases. However, in\u0000applications such as medicine, where interpretability is crucial, feature\u0000subset selection becomes an important problem. Metaheuristics such as Binary\u0000Differential Evolution are a popular approach to feature selection, and the\u0000research literature continues to introduce novel ideas, drawn from quantum\u0000computing and chaos theory, for instance, to improve them. In this paper, we\u0000demonstrate that introducing chaos-generated variables, generated from\u0000considerations of the Lyapunov time, in place of random variables in\u0000quantum-inspired metaheuristics significantly improves their performance on\u0000high-dimensional medical classification tasks and outperforms other approaches.\u0000We show that this chaos-induced improvement is a general phenomenon by\u0000demonstrating it for multiple varieties of underlying quantum-inspired\u0000metaheuristics. Performance is further enhanced through Lasso-assisted feature\u0000pruning. At the implementation level, we vastly speed up our algorithms through\u0000a scalable island-based computing cluster parallelization technique.","PeriodicalId":501347,"journal":{"name":"arXiv - CS - Neural and Evolutionary Computing","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142188278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Róbert Csordás, Christopher Potts, Christopher D. Manning, Atticus Geiger
The Linear Representation Hypothesis (LRH) states that neural networks learn to encode concepts as directions in activation space, and a strong version of the LRH states that models learn only such encodings. In this paper, we present a counterexample to this strong LRH: when trained to repeat an input token sequence, gated recurrent neural networks (RNNs) learn to represent the token at each position with a particular order of magnitude, rather than a direction. These representations have layered features that are impossible to locate in distinct linear subspaces. To show this, we train interventions to predict and manipulate tokens by learning the scaling factor corresponding to each sequence position. These interventions indicate that the smallest RNNs find only this magnitude-based solution, while larger RNNs have linear representations. These findings strongly indicate that interpretability research should not be confined by the LRH.
{"title":"Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations","authors":"Róbert Csordás, Christopher Potts, Christopher D. Manning, Atticus Geiger","doi":"arxiv-2408.10920","DOIUrl":"https://doi.org/arxiv-2408.10920","url":null,"abstract":"The Linear Representation Hypothesis (LRH) states that neural networks learn\u0000to encode concepts as directions in activation space, and a strong version of\u0000the LRH states that models learn only such encodings. In this paper, we present\u0000a counterexample to this strong LRH: when trained to repeat an input token\u0000sequence, gated recurrent neural networks (RNNs) learn to represent the token\u0000at each position with a particular order of magnitude, rather than a direction.\u0000These representations have layered features that are impossible to locate in\u0000distinct linear subspaces. To show this, we train interventions to predict and\u0000manipulate tokens by learning the scaling factor corresponding to each sequence\u0000position. These interventions indicate that the smallest RNNs find only this\u0000magnitude-based solution, while larger RNNs have linear representations. These\u0000findings strongly indicate that interpretability research should not be\u0000confined by the LRH.","PeriodicalId":501347,"journal":{"name":"arXiv - CS - Neural and Evolutionary Computing","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142188297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiao Wang, Yao Rong, Fuling Wang, Jianing Li, Lin Zhu, Bo Jiang, Yaowei Wang
Sign Language Translation (SLT) is a core task in the field of AI-assisted disability. Unlike traditional SLT based on visible light videos, which is easily affected by factors such as lighting, rapid hand movements, and privacy breaches, this paper proposes the use of high-definition Event streams for SLT, effectively mitigating the aforementioned issues. This is primarily because Event streams have a high dynamic range and dense temporal signals, which can withstand low illumination and motion blur well. Additionally, due to their sparsity in space, they effectively protect the privacy of the target person. More specifically, we propose a new high-resolution Event stream sign language dataset, termed Event-CSL, which effectively fills the data gap in this area of research. It contains 14,827 videos, 14,821 glosses, and 2,544 Chinese words in the text vocabulary. These samples are collected in a variety of indoor and outdoor scenes, encompassing multiple angles, light intensities, and camera movements. We have benchmarked existing mainstream SLT works to enable fair comparison for future efforts. Based on this dataset and several other large-scale datasets, we propose a novel baseline method that fully leverages the Mamba model's ability to integrate temporal information of CNN features, resulting in improved sign language translation outcomes. Both the benchmark dataset and source code will be released on https://github.com/Event-AHU/OpenESL
{"title":"Event Stream based Sign Language Translation: A High-Definition Benchmark Dataset and A New Algorithm","authors":"Xiao Wang, Yao Rong, Fuling Wang, Jianing Li, Lin Zhu, Bo Jiang, Yaowei Wang","doi":"arxiv-2408.10488","DOIUrl":"https://doi.org/arxiv-2408.10488","url":null,"abstract":"Sign Language Translation (SLT) is a core task in the field of AI-assisted\u0000disability. Unlike traditional SLT based on visible light videos, which is\u0000easily affected by factors such as lighting, rapid hand movements, and privacy\u0000breaches, this paper proposes the use of high-definition Event streams for SLT,\u0000effectively mitigating the aforementioned issues. This is primarily because\u0000Event streams have a high dynamic range and dense temporal signals, which can\u0000withstand low illumination and motion blur well. Additionally, due to their\u0000sparsity in space, they effectively protect the privacy of the target person.\u0000More specifically, we propose a new high-resolution Event stream sign language\u0000dataset, termed Event-CSL, which effectively fills the data gap in this area of\u0000research. It contains 14,827 videos, 14,821 glosses, and 2,544 Chinese words in\u0000the text vocabulary. These samples are collected in a variety of indoor and\u0000outdoor scenes, encompassing multiple angles, light intensities, and camera\u0000movements. We have benchmarked existing mainstream SLT works to enable fair\u0000comparison for future efforts. Based on this dataset and several other\u0000large-scale datasets, we propose a novel baseline method that fully leverages\u0000the Mamba model's ability to integrate temporal information of CNN features,\u0000resulting in improved sign language translation outcomes. Both the benchmark\u0000dataset and source code will be released on\u0000https://github.com/Event-AHU/OpenESL","PeriodicalId":501347,"journal":{"name":"arXiv - CS - Neural and Evolutionary Computing","volume":"52 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142188299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arthur Cerveira, Frederico Kremer, Darling de Andrade Lourenço, Ulisses B Corrêa
The widespread application of Artificial Intelligence (AI) techniques has significantly influenced the development of new therapeutic agents. These computational methods can be used to design and predict the properties of generated molecules. Multi-target Drug Discovery (MTDD) is an emerging paradigm for discovering drugs against complex disorders that do not respond well to more traditional target-specific treatments, such as central nervous system, immune system, and cardiovascular diseases. Still, there is yet to be an established benchmark suite for assessing the effectiveness of AI tools for designing multi-target compounds. Standardized benchmarks allow for comparing existing techniques and promote rapid research progress. Hence, this work proposes an evaluation framework for molecule generation techniques in MTDD scenarios, considering brain diseases as a case study. Our methodology involves using large language models to select the appropriate molecular targets, gathering and preprocessing the bioassay datasets, training quantitative structure-activity relationship models to predict target modulation, and assessing other essential drug-likeness properties for implementing the benchmarks. Additionally, this work will assess the performance of four deep generative models and evolutionary algorithms over our benchmark suite. In our findings, both evolutionary algorithms and generative models can achieve competitive results across the proposed benchmarks.
{"title":"Evaluation Framework for AI-driven Molecular Design of Multi-target Drugs: Brain Diseases as a Case Study","authors":"Arthur Cerveira, Frederico Kremer, Darling de Andrade Lourenço, Ulisses B Corrêa","doi":"arxiv-2408.10482","DOIUrl":"https://doi.org/arxiv-2408.10482","url":null,"abstract":"The widespread application of Artificial Intelligence (AI) techniques has\u0000significantly influenced the development of new therapeutic agents. These\u0000computational methods can be used to design and predict the properties of\u0000generated molecules. Multi-target Drug Discovery (MTDD) is an emerging paradigm\u0000for discovering drugs against complex disorders that do not respond well to\u0000more traditional target-specific treatments, such as central nervous system,\u0000immune system, and cardiovascular diseases. Still, there is yet to be an\u0000established benchmark suite for assessing the effectiveness of AI tools for\u0000designing multi-target compounds. Standardized benchmarks allow for comparing\u0000existing techniques and promote rapid research progress. Hence, this work\u0000proposes an evaluation framework for molecule generation techniques in MTDD\u0000scenarios, considering brain diseases as a case study. Our methodology involves\u0000using large language models to select the appropriate molecular targets,\u0000gathering and preprocessing the bioassay datasets, training quantitative\u0000structure-activity relationship models to predict target modulation, and\u0000assessing other essential drug-likeness properties for implementing the\u0000benchmarks. Additionally, this work will assess the performance of four deep\u0000generative models and evolutionary algorithms over our benchmark suite. In our\u0000findings, both evolutionary algorithms and generative models can achieve\u0000competitive results across the proposed benchmarks.","PeriodicalId":501347,"journal":{"name":"arXiv - CS - Neural and Evolutionary Computing","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142188279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The mutation strength adaptation properties of a multi-recombinative $(mu/mu_I, lambda)$-ES are studied for isotropic mutations. To this end, standard implementations of cumulative step-size adaptation (CSA) and mutative self-adaptation ($sigma$SA) are investigated experimentally and theoretically by assuming large population sizes ($mu$) in relation to the search space dimensionality ($N$). The adaptation is characterized in terms of the scale-invariant mutation strength on the sphere in relation to its maximum achievable value for positive progress. %The results show how the different $sigma$-adaptation variants behave as $mu$ and $N$ are varied. Standard CSA-variants show notably different adaptation properties and progress rates on the sphere, becoming slower or faster as $mu$ or $N$ are varied. This is shown by investigating common choices for the cumulation and damping parameters. Standard $sigma$SA-variants (with default learning parameter settings) can achieve faster adaptation and larger progress rates compared to the CSA. However, it is shown how self-adaptation affects the progress rate levels negatively. Furthermore, differences regarding the adaptation and stability of $sigma$SA with log-normal and normal mutation sampling are elaborated.
{"title":"Mutation Strength Adaptation of the $(μ/μ_I, λ)$-ES for Large Population Sizes on the Sphere Function","authors":"Amir Omeradzic, Hans-Georg Beyer","doi":"arxiv-2408.09761","DOIUrl":"https://doi.org/arxiv-2408.09761","url":null,"abstract":"The mutation strength adaptation properties of a multi-recombinative\u0000$(mu/mu_I, lambda)$-ES are studied for isotropic mutations. To this end,\u0000standard implementations of cumulative step-size adaptation (CSA) and mutative\u0000self-adaptation ($sigma$SA) are investigated experimentally and theoretically\u0000by assuming large population sizes ($mu$) in relation to the search space\u0000dimensionality ($N$). The adaptation is characterized in terms of the\u0000scale-invariant mutation strength on the sphere in relation to its maximum\u0000achievable value for positive progress. %The results show how the different\u0000$sigma$-adaptation variants behave as $mu$ and $N$ are varied. Standard\u0000CSA-variants show notably different adaptation properties and progress rates on\u0000the sphere, becoming slower or faster as $mu$ or $N$ are varied. This is shown\u0000by investigating common choices for the cumulation and damping parameters.\u0000Standard $sigma$SA-variants (with default learning parameter settings) can\u0000achieve faster adaptation and larger progress rates compared to the CSA.\u0000However, it is shown how self-adaptation affects the progress rate levels\u0000negatively. Furthermore, differences regarding the adaptation and stability of\u0000$sigma$SA with log-normal and normal mutation sampling are elaborated.","PeriodicalId":501347,"journal":{"name":"arXiv - CS - Neural and Evolutionary Computing","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142188300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scientific Machine Learning (ML) is gaining momentum as a cost-effective alternative to physics-based numerical solvers in many engineering applications. In fact, scientific ML is currently being used to build accurate and efficient surrogate models starting from high-fidelity numerical simulations, effectively encoding the parameterized temporal dynamics underlying Ordinary Differential Equations (ODEs), or even the spatio-temporal behavior underlying Partial Differential Equations (PDEs), in appropriately designed neural networks. We propose an extension of Latent Dynamics Networks (LDNets), namely Liquid Fourier LDNets (LFLDNets), to create parameterized space-time surrogate models for multiscale and multiphysics sets of highly nonlinear differential equations on complex geometries. LFLDNets employ a neurologically-inspired, sparse, liquid neural network for temporal dynamics, relaxing the requirement of a numerical solver for time advancement and leading to superior performance in terms of tunable parameters, accuracy, efficiency and learned trajectories with respect to neural ODEs based on feedforward fully-connected neural networks. Furthermore, in our implementation of LFLDNets, we use a Fourier embedding with a tunable kernel in the reconstruction network to learn high-frequency functions better and faster than using space coordinates directly as input. We challenge LFLDNets in the framework of computational cardiology and evaluate their capabilities on two 3-dimensional test cases arising from multiscale cardiac electrophysiology and cardiovascular hemodynamics. This paper illustrates the capability to run Artificial Intelligence-based numerical simulations on single or multiple GPUs in a matter of minutes and represents a significant step forward in the development of physics-informed digital twins.
{"title":"Liquid Fourier Latent Dynamics Networks for fast GPU-based numerical simulations in computational cardiology","authors":"Matteo Salvador, Alison L. Marsden","doi":"arxiv-2408.09818","DOIUrl":"https://doi.org/arxiv-2408.09818","url":null,"abstract":"Scientific Machine Learning (ML) is gaining momentum as a cost-effective\u0000alternative to physics-based numerical solvers in many engineering\u0000applications. In fact, scientific ML is currently being used to build accurate\u0000and efficient surrogate models starting from high-fidelity numerical\u0000simulations, effectively encoding the parameterized temporal dynamics\u0000underlying Ordinary Differential Equations (ODEs), or even the spatio-temporal\u0000behavior underlying Partial Differential Equations (PDEs), in appropriately\u0000designed neural networks. We propose an extension of Latent Dynamics Networks\u0000(LDNets), namely Liquid Fourier LDNets (LFLDNets), to create parameterized\u0000space-time surrogate models for multiscale and multiphysics sets of highly\u0000nonlinear differential equations on complex geometries. LFLDNets employ a\u0000neurologically-inspired, sparse, liquid neural network for temporal dynamics,\u0000relaxing the requirement of a numerical solver for time advancement and leading\u0000to superior performance in terms of tunable parameters, accuracy, efficiency\u0000and learned trajectories with respect to neural ODEs based on feedforward\u0000fully-connected neural networks. Furthermore, in our implementation of\u0000LFLDNets, we use a Fourier embedding with a tunable kernel in the\u0000reconstruction network to learn high-frequency functions better and faster than\u0000using space coordinates directly as input. We challenge LFLDNets in the\u0000framework of computational cardiology and evaluate their capabilities on two\u00003-dimensional test cases arising from multiscale cardiac electrophysiology and\u0000cardiovascular hemodynamics. This paper illustrates the capability to run\u0000Artificial Intelligence-based numerical simulations on single or multiple GPUs\u0000in a matter of minutes and represents a significant step forward in the\u0000development of physics-informed digital twins.","PeriodicalId":501347,"journal":{"name":"arXiv - CS - Neural and Evolutionary Computing","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142188304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dayena Jeong, Jaewoo Park, Jeonghee Jo, Jongkil Park, Jaewook Kim, Hyun Jae Jang, Suyoun Lee, Seongsik Park
Recent deep neural networks (DNNs), such as diffusion models [1], have faced high computational demands. Thus, spiking neural networks (SNNs) have attracted lots of attention as energy-efficient neural networks. However, conventional spiking neurons, such as leaky integrate-and-fire neurons, cannot accurately represent complex non-linear activation functions, such as Swish [2]. To approximate activation functions with spiking neurons, few spikes (FS) neurons were proposed [3], but the approximation performance was limited due to the lack of training methods considering the neurons. Thus, we propose tendency-based parameter initialization (TBPI) to enhance the approximation of activation function with FS neurons, exploiting temporal dependencies initializing the training parameters.
{"title":"A More Accurate Approximation of Activation Function with Few Spikes Neurons","authors":"Dayena Jeong, Jaewoo Park, Jeonghee Jo, Jongkil Park, Jaewook Kim, Hyun Jae Jang, Suyoun Lee, Seongsik Park","doi":"arxiv-2409.00044","DOIUrl":"https://doi.org/arxiv-2409.00044","url":null,"abstract":"Recent deep neural networks (DNNs), such as diffusion models [1], have faced\u0000high computational demands. Thus, spiking neural networks (SNNs) have attracted\u0000lots of attention as energy-efficient neural networks. However, conventional\u0000spiking neurons, such as leaky integrate-and-fire neurons, cannot accurately\u0000represent complex non-linear activation functions, such as Swish [2]. To\u0000approximate activation functions with spiking neurons, few spikes (FS) neurons\u0000were proposed [3], but the approximation performance was limited due to the\u0000lack of training methods considering the neurons. Thus, we propose\u0000tendency-based parameter initialization (TBPI) to enhance the approximation of\u0000activation function with FS neurons, exploiting temporal dependencies\u0000initializing the training parameters.","PeriodicalId":501347,"journal":{"name":"arXiv - CS - Neural and Evolutionary Computing","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142188306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kun Wu, Jeongmin Brian Park, Xiaofan Zhang, Mert Hidayetoğlu, Vikram Sharma Mailthody, Sitao Huang, Steven Sam Lumetta, Wen-mei Hwu
The growth rate of the GPU memory capacity has not been able to keep up with that of the size of large language models (LLMs), hindering the model training process. In particular, activations -- the intermediate tensors produced during forward propagation and reused in backward propagation -- dominate the GPU memory use. To address this challenge, we propose TBA to efficiently offload activations to high-capacity NVMe SSDs. This approach reduces GPU memory usage without impacting performance by adaptively overlapping data transfers with computation. TBA is compatible with popular deep learning frameworks like PyTorch, Megatron, and DeepSpeed, and it employs techniques such as tensor deduplication, forwarding, and adaptive offloading to further enhance efficiency. We conduct extensive experiments on GPT, BERT, and T5. Results demonstrate that TBA effectively reduces 47% of the activation peak memory usage. At the same time, TBA perfectly overlaps the I/O with the computation and incurs negligible performance overhead. We introduce the recompute-offload-keep (ROK) curve to compare the TBA offloading with other two tensor placement strategies, keeping activations in memory and layerwise full recomputation. We find that TBA achieves better memory savings than layerwise full recomputation while retaining the performance of keeping the activations in memory.
{"title":"TBA: Faster Large Language Model Training Using SSD-Based Activation Offloading","authors":"Kun Wu, Jeongmin Brian Park, Xiaofan Zhang, Mert Hidayetoğlu, Vikram Sharma Mailthody, Sitao Huang, Steven Sam Lumetta, Wen-mei Hwu","doi":"arxiv-2408.10013","DOIUrl":"https://doi.org/arxiv-2408.10013","url":null,"abstract":"The growth rate of the GPU memory capacity has not been able to keep up with\u0000that of the size of large language models (LLMs), hindering the model training\u0000process. In particular, activations -- the intermediate tensors produced during\u0000forward propagation and reused in backward propagation -- dominate the GPU\u0000memory use. To address this challenge, we propose TBA to efficiently offload\u0000activations to high-capacity NVMe SSDs. This approach reduces GPU memory usage\u0000without impacting performance by adaptively overlapping data transfers with\u0000computation. TBA is compatible with popular deep learning frameworks like\u0000PyTorch, Megatron, and DeepSpeed, and it employs techniques such as tensor\u0000deduplication, forwarding, and adaptive offloading to further enhance\u0000efficiency. We conduct extensive experiments on GPT, BERT, and T5. Results\u0000demonstrate that TBA effectively reduces 47% of the activation peak memory\u0000usage. At the same time, TBA perfectly overlaps the I/O with the computation\u0000and incurs negligible performance overhead. We introduce the\u0000recompute-offload-keep (ROK) curve to compare the TBA offloading with other two\u0000tensor placement strategies, keeping activations in memory and layerwise full\u0000recomputation. We find that TBA achieves better memory savings than layerwise\u0000full recomputation while retaining the performance of keeping the activations\u0000in memory.","PeriodicalId":501347,"journal":{"name":"arXiv - CS - Neural and Evolutionary Computing","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142188302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Active Inference framework models perception and action as a unified process, where agents use probabilistic models to predict and actively minimize sensory discrepancies. In complement and contrast, traditional population-based metaheuristics rely on reactive environmental interactions without anticipatory adaptation. This paper proposes the integration of Active Inference into these metaheuristics to enhance performance through anticipatory environmental adaptation. We demonstrate this approach specifically with Ant Colony Optimization (ACO) on the Travelling Salesman Problem (TSP). Experimental results indicate that Active Inference can yield some improved solutions with only a marginal increase in computational cost, with interesting patterns of performance that relate to number and topology of nodes in the graph. Further work will characterize where and when different types of Active Inference augmentation of population metaheuristics may be efficacious.
{"title":"Enhancing Population-based Search with Active Inference","authors":"Nassim Dehouche, Daniel Friedman","doi":"arxiv-2408.09548","DOIUrl":"https://doi.org/arxiv-2408.09548","url":null,"abstract":"The Active Inference framework models perception and action as a unified\u0000process, where agents use probabilistic models to predict and actively minimize\u0000sensory discrepancies. In complement and contrast, traditional population-based\u0000metaheuristics rely on reactive environmental interactions without anticipatory\u0000adaptation. This paper proposes the integration of Active Inference into these\u0000metaheuristics to enhance performance through anticipatory environmental\u0000adaptation. We demonstrate this approach specifically with Ant Colony\u0000Optimization (ACO) on the Travelling Salesman Problem (TSP). Experimental\u0000results indicate that Active Inference can yield some improved solutions with\u0000only a marginal increase in computational cost, with interesting patterns of\u0000performance that relate to number and topology of nodes in the graph. Further\u0000work will characterize where and when different types of Active Inference\u0000augmentation of population metaheuristics may be efficacious.","PeriodicalId":501347,"journal":{"name":"arXiv - CS - Neural and Evolutionary Computing","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142188301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Erik B. Terres-Escudero, Javier Del Ser, Pablo Garcia-Bringas
Forward-only learning algorithms have recently gained attention as alternatives to gradient backpropagation, replacing the backward step of this latter solver with an additional contrastive forward pass. Among these approaches, the so-called Forward-Forward Algorithm (FFA) has been shown to achieve competitive levels of performance in terms of generalization and complexity. Networks trained using FFA learn to contrastively maximize a layer-wise defined goodness score when presented with real data (denoted as positive samples) and to minimize it when processing synthetic data (corr. negative samples). However, this algorithm still faces weaknesses that negatively affect the model accuracy and training stability, primarily due to a gradient imbalance between positive and negative samples. To overcome this issue, in this work we propose a novel implementation of the FFA algorithm, denoted as Polar-FFA, which extends the original formulation by introducing a neural division (emph{polarization}) between positive and negative instances. Neurons in each of these groups aim to maximize their goodness when presented with their respective data type, thereby creating a symmetric gradient behavior. To empirically gauge the improved learning capabilities of our proposed Polar-FFA, we perform several systematic experiments using different activation and goodness functions over image classification datasets. Our results demonstrate that Polar-FFA outperforms FFA in terms of accuracy and convergence speed. Furthermore, its lower reliance on hyperparameters reduces the need for hyperparameter tuning to guarantee optimal generalization capabilities, thereby allowing for a broader range of neural network configurations.
{"title":"On the Improvement of Generalization and Stability of Forward-Only Learning via Neural Polarization","authors":"Erik B. Terres-Escudero, Javier Del Ser, Pablo Garcia-Bringas","doi":"arxiv-2408.09210","DOIUrl":"https://doi.org/arxiv-2408.09210","url":null,"abstract":"Forward-only learning algorithms have recently gained attention as\u0000alternatives to gradient backpropagation, replacing the backward step of this\u0000latter solver with an additional contrastive forward pass. Among these\u0000approaches, the so-called Forward-Forward Algorithm (FFA) has been shown to\u0000achieve competitive levels of performance in terms of generalization and\u0000complexity. Networks trained using FFA learn to contrastively maximize a\u0000layer-wise defined goodness score when presented with real data (denoted as\u0000positive samples) and to minimize it when processing synthetic data (corr.\u0000negative samples). However, this algorithm still faces weaknesses that\u0000negatively affect the model accuracy and training stability, primarily due to a\u0000gradient imbalance between positive and negative samples. To overcome this\u0000issue, in this work we propose a novel implementation of the FFA algorithm,\u0000denoted as Polar-FFA, which extends the original formulation by introducing a\u0000neural division (emph{polarization}) between positive and negative instances.\u0000Neurons in each of these groups aim to maximize their goodness when presented\u0000with their respective data type, thereby creating a symmetric gradient\u0000behavior. To empirically gauge the improved learning capabilities of our\u0000proposed Polar-FFA, we perform several systematic experiments using different\u0000activation and goodness functions over image classification datasets. Our\u0000results demonstrate that Polar-FFA outperforms FFA in terms of accuracy and\u0000convergence speed. Furthermore, its lower reliance on hyperparameters reduces\u0000the need for hyperparameter tuning to guarantee optimal generalization\u0000capabilities, thereby allowing for a broader range of neural network\u0000configurations.","PeriodicalId":501347,"journal":{"name":"arXiv - CS - Neural and Evolutionary Computing","volume":"45 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142188457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}