Y. Heryadi, B. Wijanarko, Dina Fitria Murad, C. Tho, Kiyota Hashimoto
{"title":"Revalidating the Encoder-Decoder Depths and Activation Function to Find Optimum Vanilla Transformer Model","authors":"Y. Heryadi, B. Wijanarko, Dina Fitria Murad, C. Tho, Kiyota Hashimoto","doi":"10.1109/ICCoSITE57641.2023.10127790","DOIUrl":null,"url":null,"abstract":"The transformer model has become a state-of-the-art model in Natural Language Processing. The initial transformer model, known as the vanilla transformer model, is designed to improve some prominent models in sequence modeling and transduction problems such as language modeling and machine translation. The initial transformer model has 6 stacks of identical encoder-decoder layers with an attention mechanism whose aim is to push limitations of common recurrent language models and encoder-decoder architectures. Its outstanding performance has inspired many researchers to extend the architecture to improve its performance and computation efficiency. Despite many extensions to the vanilla transformer, there is no clear explanation of the encoder-decoder set out depth in the vanilla transformer model. This paper presents exploration results on the effect of combination encoder-decoder layer depth and activation function in the feed-forward layer of the vanilla transformer model on its performance. The model is tested to address a downstream task: text translation from Bahasa Indonesia to the Sundanese language. Although the value difference is not significantly large, the empirical results show that the combination of depth = 2 with Sigmoid, Tanh, and ReLU activation function; and d = 6 with ReLU activation shows the highest average training accuracy. Interestingly, d = 6 and ReLU show the lowest average training and validation loss. However, statistically, there is no significant difference between depth and activation functions.","PeriodicalId":256184,"journal":{"name":"2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE)","volume":"397 ","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCoSITE57641.2023.10127790","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The transformer model has become a state-of-the-art model in Natural Language Processing. The initial transformer model, known as the vanilla transformer model, is designed to improve some prominent models in sequence modeling and transduction problems such as language modeling and machine translation. The initial transformer model has 6 stacks of identical encoder-decoder layers with an attention mechanism whose aim is to push limitations of common recurrent language models and encoder-decoder architectures. Its outstanding performance has inspired many researchers to extend the architecture to improve its performance and computation efficiency. Despite many extensions to the vanilla transformer, there is no clear explanation of the encoder-decoder set out depth in the vanilla transformer model. This paper presents exploration results on the effect of combination encoder-decoder layer depth and activation function in the feed-forward layer of the vanilla transformer model on its performance. The model is tested to address a downstream task: text translation from Bahasa Indonesia to the Sundanese language. Although the value difference is not significantly large, the empirical results show that the combination of depth = 2 with Sigmoid, Tanh, and ReLU activation function; and d = 6 with ReLU activation shows the highest average training accuracy. Interestingly, d = 6 and ReLU show the lowest average training and validation loss. However, statistically, there is no significant difference between depth and activation functions.