{"title":"dsp阵列中的矩阵计算","authors":"Jeime Moreno, M. Medina","doi":"10.1109/ASAP.1992.218549","DOIUrl":null,"url":null,"abstract":"The authors present the use of the multimesh graph representation to map matrix algorithms onto arrays of digital signal processors (DSPs), using the TMS 320C30 as example. This processor, as most DSPs, is characterized by a two-level memory subsystem and a built-in DMA controller. The mapping process focuses on large matrices which do not fit in the first level of memory. Good utilization of the DSP resources is achieved by programming the execution of the algorithms by prisms from the multimesh graph; the optimal size of the prisms is obtained. Performance estimates indicate that it is possible to program the DSP in such a way that the impact of slower second-level memory is not significant (around 7% degradation with five wait states). Good load balancing throughout the array is achieved by allocating to processors partitions of the problem of nonuniform size, as suggested in a previous publication. The schedule of operations proposed deviates from the conventional ordering, wherein the inner-product among two vectors is fully computed at once. Instead, the proposed schedule divides inner-products into portions which are executed in interleaved manner throughout the computation of the entire problem, each portion using as an input the partial result obtained earlier from the execution of the corresponding previous portion.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Matrix computations in arrays of DSPs\",\"authors\":\"Jeime Moreno, M. Medina\",\"doi\":\"10.1109/ASAP.1992.218549\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The authors present the use of the multimesh graph representation to map matrix algorithms onto arrays of digital signal processors (DSPs), using the TMS 320C30 as example. This processor, as most DSPs, is characterized by a two-level memory subsystem and a built-in DMA controller. The mapping process focuses on large matrices which do not fit in the first level of memory. Good utilization of the DSP resources is achieved by programming the execution of the algorithms by prisms from the multimesh graph; the optimal size of the prisms is obtained. Performance estimates indicate that it is possible to program the DSP in such a way that the impact of slower second-level memory is not significant (around 7% degradation with five wait states). Good load balancing throughout the array is achieved by allocating to processors partitions of the problem of nonuniform size, as suggested in a previous publication. The schedule of operations proposed deviates from the conventional ordering, wherein the inner-product among two vectors is fully computed at once. Instead, the proposed schedule divides inner-products into portions which are executed in interleaved manner throughout the computation of the entire problem, each portion using as an input the partial result obtained earlier from the execution of the corresponding previous portion.<<ETX>>\",\"PeriodicalId\":265438,\"journal\":{\"name\":\"[1992] Proceedings of the International Conference on Application Specific Array Processors\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1992-08-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"[1992] Proceedings of the International Conference on Application Specific Array Processors\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASAP.1992.218549\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"[1992] Proceedings of the International Conference on Application Specific Array Processors","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASAP.1992.218549","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

作者以tms320c30为例,介绍了使用多网格图表示将矩阵算法映射到数字信号处理器(dsp)阵列上的方法。与大多数dsp一样,该处理器的特点是两级存储器子系统和内置DMA控制器。映射过程集中在不适合第一层内存的大矩阵上。利用多网格图中的棱镜对算法的执行进行编程,实现了DSP资源的有效利用;得到了棱镜的最佳尺寸。性能估计表明,有可能以这样一种方式对DSP进行编程,即较慢的二级存储器的影响并不显著(在五种等待状态下,性能下降约7%)。整个数组的良好负载平衡是通过将大小不一致的问题分区分配给处理器来实现的,这在以前的出版物中有过建议。所提出的操作计划偏离了传统的排序,其中两个向量之间的内积是一次完全计算的。相反,提议的调度将内积分成若干部分,这些部分在整个问题的计算过程中以交错的方式执行,每一部分使用从先前相应部分执行中获得的部分结果作为输入。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Matrix computations in arrays of DSPs
The authors present the use of the multimesh graph representation to map matrix algorithms onto arrays of digital signal processors (DSPs), using the TMS 320C30 as example. This processor, as most DSPs, is characterized by a two-level memory subsystem and a built-in DMA controller. The mapping process focuses on large matrices which do not fit in the first level of memory. Good utilization of the DSP resources is achieved by programming the execution of the algorithms by prisms from the multimesh graph; the optimal size of the prisms is obtained. Performance estimates indicate that it is possible to program the DSP in such a way that the impact of slower second-level memory is not significant (around 7% degradation with five wait states). Good load balancing throughout the array is achieved by allocating to processors partitions of the problem of nonuniform size, as suggested in a previous publication. The schedule of operations proposed deviates from the conventional ordering, wherein the inner-product among two vectors is fully computed at once. Instead, the proposed schedule divides inner-products into portions which are executed in interleaved manner throughout the computation of the entire problem, each portion using as an input the partial result obtained earlier from the execution of the corresponding previous portion.<>
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
An architecture for tree search based vector quantization for single chip implementation SPERT: a VLIW/SIMD microprocessor for artificial neural network computations Optimal design of lower dimensional processor arrays for uniform recurrences ARREST: an interactive graphic analysis tool for VLSI arrays High speed bit-level pipelined architectures for redundant CORDIC implementation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1