Short-Circuit Dispatch: Accelerating Virtual Machine Interpreters on Embedded Processors

Channoh Kim, Sungmin Kim, Hyeonwoong Cho, Doo-Young Kim, Jaehyeok Kim, Young H. Oh, Hakbeom Jang, Jae W. Lee
{"title":"Short-Circuit Dispatch: Accelerating Virtual Machine Interpreters on Embedded Processors","authors":"Channoh Kim, Sungmin Kim, Hyeonwoong Cho, Doo-Young Kim, Jaehyeok Kim, Young H. Oh, Hakbeom Jang, Jae W. Lee","doi":"10.1145/3007787.3001168","DOIUrl":null,"url":null,"abstract":"Interpreters are widely used to implement high-level language virtual machines (VMs), especially on resource-constrained embedded platforms. Many scripting languages employ interpreter-based VMs for their advantages over native code compilers, such as portability, smaller resource footprint, and compact codes. For efficient interpretation a script (program) is first compiled into an intermediate representation, or bytecodes. The canonical interpreter then runs an infinite loop that fetches, decodes, and executes one bytecode at a time. This bytecode dispatch loop is a well-known source of inefficiency, typically featuring a large jump table with a hard-to-predict indirect jump. Most existing techniques to optimize this loop focus on reducing the misprediction rate of this indirect jump in both hardware and software. However, these techniques are much less effective on embedded processors with shallow pipelines and low IPCs. Instead, we tackle another source of inefficiency more prominent on embedded platforms - redundant computation in the dispatch loop. To this end, we propose Short-Circuit Dispatch (SCD), a low cost architectural extension that enables fast, hardware-based bytecode dispatch with fewer instructions. The key idea of SCD is to overlay the software-created bytecode jump table on a branch target buffer (BTB). Once a bytecode is fetched, the BTB is looked up using the bytecode, instead of PC, as key. If it hits, the interpreter directly jumps to the target address retrieved from the BTB, otherwise, it goes through the original dispatch path. This effectively eliminates redundant computation in the dispatcher code for decode, bound check, and target address calculation, thus significantly reducing total instruction count. Our simulation results demonstrate that SCD achieves geomean speedups of 19.9% and 14.1% for two production-grade script interpreters for Lua and JavaScript, respectively. Moreover, our fully synthesizable RTL design based on a RISC-V embedded processor shows that SCD improves the EDP of the Lua interpreter by 24.2%, while increasing the chip area by only 0.72% at a 40nm technology node.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"28 1","pages":"291-303"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3007787.3001168","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

Interpreters are widely used to implement high-level language virtual machines (VMs), especially on resource-constrained embedded platforms. Many scripting languages employ interpreter-based VMs for their advantages over native code compilers, such as portability, smaller resource footprint, and compact codes. For efficient interpretation a script (program) is first compiled into an intermediate representation, or bytecodes. The canonical interpreter then runs an infinite loop that fetches, decodes, and executes one bytecode at a time. This bytecode dispatch loop is a well-known source of inefficiency, typically featuring a large jump table with a hard-to-predict indirect jump. Most existing techniques to optimize this loop focus on reducing the misprediction rate of this indirect jump in both hardware and software. However, these techniques are much less effective on embedded processors with shallow pipelines and low IPCs. Instead, we tackle another source of inefficiency more prominent on embedded platforms - redundant computation in the dispatch loop. To this end, we propose Short-Circuit Dispatch (SCD), a low cost architectural extension that enables fast, hardware-based bytecode dispatch with fewer instructions. The key idea of SCD is to overlay the software-created bytecode jump table on a branch target buffer (BTB). Once a bytecode is fetched, the BTB is looked up using the bytecode, instead of PC, as key. If it hits, the interpreter directly jumps to the target address retrieved from the BTB, otherwise, it goes through the original dispatch path. This effectively eliminates redundant computation in the dispatcher code for decode, bound check, and target address calculation, thus significantly reducing total instruction count. Our simulation results demonstrate that SCD achieves geomean speedups of 19.9% and 14.1% for two production-grade script interpreters for Lua and JavaScript, respectively. Moreover, our fully synthesizable RTL design based on a RISC-V embedded processor shows that SCD improves the EDP of the Lua interpreter by 24.2%, while increasing the chip area by only 0.72% at a 40nm technology node.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
短路调度:在嵌入式处理器上加速虚拟机解释器
解释器被广泛用于实现高级语言虚拟机(vm),特别是在资源受限的嵌入式平台上。许多脚本语言使用基于解释器的vm,因为它们比本机代码编译器更有优势,比如可移植性、更小的资源占用和紧凑的代码。为了有效地解释,脚本(程序)首先被编译成中间表示形式或字节码。然后,规范解释器运行一个无限循环,一次获取、解码和执行一个字节码。这种字节码调度循环是效率低下的一个众所周知的原因,它通常具有难以预测的间接跳转的大型跳转表。大多数现有的优化这种循环的技术都集中在硬件和软件上减少这种间接跳跃的错误预测率。然而,这些技术在具有浅管道和低ipc的嵌入式处理器上的效果要差得多。相反,我们解决了另一个在嵌入式平台上更为突出的低效率来源——调度循环中的冗余计算。为此,我们提出了短路调度(SCD),这是一种低成本的架构扩展,可以用更少的指令实现快速的、基于硬件的字节码调度。SCD的关键思想是将软件创建的字节码跳转表覆盖在分支目标缓冲区(BTB)上。一旦获取字节码,就会使用字节码而不是PC作为密钥查找BTB。如果命中,解释器直接跳转到从BTB检索到的目标地址,否则,它将通过原始分派路径。这有效地消除了调度程序代码中解码、绑定检查和目标地址计算的冗余计算,从而大大减少了总指令计数。我们的模拟结果表明,SCD对于两个用于Lua和JavaScript的生产级脚本解释器分别实现了19.9%和14.1%的几何加速。此外,我们基于RISC-V嵌入式处理器的完全可合成RTL设计表明,在40nm技术节点上,SCD使Lua解释器的EDP提高了24.2%,而芯片面积仅增加了0.72%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
RelaxFault Memory Repair Boosting Access Parallelism to PCM-Based Main Memory Bit-Plane Compression: Transforming Data for Better Compression in Many-Core Architectures Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems Energy Efficient Architecture for Graph Analytics Accelerators
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1