开放词汇码补全的优化标记化过程:实证研究

Yasir Hussain, Zhiqiu Huang, Yu Zhou, I. A. Khan, Nasrullah Khan, Muhammad Zahid Abbas
{"title":"开放词汇码补全的优化标记化过程:实证研究","authors":"Yasir Hussain, Zhiqiu Huang, Yu Zhou, I. A. Khan, Nasrullah Khan, Muhammad Zahid Abbas","doi":"10.1145/3593434.3594236","DOIUrl":null,"url":null,"abstract":"Studies have substantiated the efficacy of deep learning-based models in various source code modeling tasks. These models are usually trained on large datasets that are divided into smaller units, known as tokens, utilizing either an open or closed vocabulary system. The selection of a tokenization method can have a profound impact on the number of tokens generated, which in turn can significantly influence the performance of the model. This study investigates the effect of different tokenization methods on source code modeling and proposes an optimized tokenizer to enhance the tokenization performance. The proposed tokenizer employs a hybrid approach that initializes with a global vocabulary based on the most frequent unigrams and incrementally builds an open-vocabulary system. The proposed tokenizer is evaluated against popular tokenization methods such as Closed, Unigram, WordPiece, and BPE tokenizers, as well as tokenizers provided by large pre-trained models such as PolyCoder and CodeGen. The results indicate that the choice of tokenization method can significantly impact the number of sub-tokens generated, which can ultimately influence the modeling performance of a model. Furthermore, our empirical evaluation demonstrates that the proposed tokenizer outperforms other baselines, achieving improved tokenization performance both in terms of a reduced number of sub-tokens and time cost. In conclusion, this study highlights the significance of the choice of tokenization method in source code modeling and the potential for improvement through optimized tokenization techniques.","PeriodicalId":178596,"journal":{"name":"Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Optimized Tokenization Process for Open-Vocabulary Code Completion: An Empirical Study\",\"authors\":\"Yasir Hussain, Zhiqiu Huang, Yu Zhou, I. A. Khan, Nasrullah Khan, Muhammad Zahid Abbas\",\"doi\":\"10.1145/3593434.3594236\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Studies have substantiated the efficacy of deep learning-based models in various source code modeling tasks. These models are usually trained on large datasets that are divided into smaller units, known as tokens, utilizing either an open or closed vocabulary system. The selection of a tokenization method can have a profound impact on the number of tokens generated, which in turn can significantly influence the performance of the model. This study investigates the effect of different tokenization methods on source code modeling and proposes an optimized tokenizer to enhance the tokenization performance. The proposed tokenizer employs a hybrid approach that initializes with a global vocabulary based on the most frequent unigrams and incrementally builds an open-vocabulary system. The proposed tokenizer is evaluated against popular tokenization methods such as Closed, Unigram, WordPiece, and BPE tokenizers, as well as tokenizers provided by large pre-trained models such as PolyCoder and CodeGen. The results indicate that the choice of tokenization method can significantly impact the number of sub-tokens generated, which can ultimately influence the modeling performance of a model. Furthermore, our empirical evaluation demonstrates that the proposed tokenizer outperforms other baselines, achieving improved tokenization performance both in terms of a reduced number of sub-tokens and time cost. In conclusion, this study highlights the significance of the choice of tokenization method in source code modeling and the potential for improvement through optimized tokenization techniques.\",\"PeriodicalId\":178596,\"journal\":{\"name\":\"Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering\",\"volume\":\"48 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3593434.3594236\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3593434.3594236","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

研究已经证实了基于深度学习的模型在各种源代码建模任务中的有效性。这些模型通常在大型数据集上进行训练,这些数据集被分成更小的单元,称为标记,利用开放或封闭的词汇系统。标记化方法的选择会对生成的标记数量产生深远的影响,而这反过来又会显著影响模型的性能。本文研究了不同的标记化方法对源代码建模的影响,并提出了一种优化的标记器来提高标记化性能。所提出的标记器采用了一种混合方法,该方法初始化基于最常见字母的全局词汇表,并逐步构建开放词汇表系统。提出的标记器是针对流行的标记器方法(如Closed, Unigram, WordPiece和BPE标记器)以及由大型预训练模型(如PolyCoder和CodeGen)提供的标记器进行评估的。结果表明,标记化方法的选择会显著影响生成的子标记的数量,最终影响模型的建模性能。此外,我们的经验评估表明,所提出的标记器优于其他基准,在减少子标记数量和时间成本方面实现了改进的标记化性能。总之,本研究强调了在源代码建模中选择标记化方法的重要性以及通过优化标记化技术进行改进的潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Optimized Tokenization Process for Open-Vocabulary Code Completion: An Empirical Study
Studies have substantiated the efficacy of deep learning-based models in various source code modeling tasks. These models are usually trained on large datasets that are divided into smaller units, known as tokens, utilizing either an open or closed vocabulary system. The selection of a tokenization method can have a profound impact on the number of tokens generated, which in turn can significantly influence the performance of the model. This study investigates the effect of different tokenization methods on source code modeling and proposes an optimized tokenizer to enhance the tokenization performance. The proposed tokenizer employs a hybrid approach that initializes with a global vocabulary based on the most frequent unigrams and incrementally builds an open-vocabulary system. The proposed tokenizer is evaluated against popular tokenization methods such as Closed, Unigram, WordPiece, and BPE tokenizers, as well as tokenizers provided by large pre-trained models such as PolyCoder and CodeGen. The results indicate that the choice of tokenization method can significantly impact the number of sub-tokens generated, which can ultimately influence the modeling performance of a model. Furthermore, our empirical evaluation demonstrates that the proposed tokenizer outperforms other baselines, achieving improved tokenization performance both in terms of a reduced number of sub-tokens and time cost. In conclusion, this study highlights the significance of the choice of tokenization method in source code modeling and the potential for improvement through optimized tokenization techniques.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Classification-based Static Collection Selection for Java: Effectiveness and Adaptability Decentralised Autonomous Organisations for Public Procurement Analyzing the Resource Usage Overhead of Mobile App Development Frameworks Investigation of Security-related Commits in Android Apps Exploring the UK Cyber Skills Gap through a mapping of active job listings to the Cyber Security Body of Knowledge (CyBOK)
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1