[1992] Proceedings of the International Conference on Application Specific Array Processors最新文献

英文中文

MAMACG: a tool for automatic mapping of matrix algorithms onto mesh array computational graphs MAMACG:一个将矩阵算法自动映射到网格阵列计算图的工具

[1992] Proceedings of the International Conference on Application Specific Array Processors

Pub Date : 1992-08-04 DOI: 10.1109/ASAP.1992.218548

D. Lê, M. Ercegovac, T. Lang, J. Moreno

The design of MAMACG, a software tool for automatically mapping an important class of matrix algorithms into mesh array computational graphs, is described. MAMACG is a concrete realization of the multimesh graph (MMG) method, implemented in Elk, a dialect of LISP with built-in X-graphics capabilities.<>

MAMACG是一种将一类重要的矩阵算法自动映射到网格阵列计算图的软件工具。MAMACG是多网格图(MMG)方法的具体实现，在Elk中实现，Elk是LISP的一种方言，具有内置的x图形功能。

引用次数: 4

High level software synthesis for signal processing systems 信号处理系统的高级软件合成

[1992] Proceedings of the International Conference on Application Specific Array Processors

Pub Date : 1992-08-04 DOI: 10.1109/ASAP.1992.218536

S. Ritz, M. Pankert, H. Meyr

For the design of complex digital signal processing systems, block diagram oriented simulation has become a widely accepted standard. Current research is concerned with the coupling of heterogenous simulation engines and the transition from simulation to the implementation of digital signal processing systems. Due to the difficulty in mastering complex design spaces high level hardware and software synthesis is becoming increasingly important. The authors concentrate on the block diagram oriented software synthesis of digital signal processing systems for programmable processors, such as digital signal processors (DSP). They present the synthesis environment DESCARTES illustrating novel optimization strategies. Furthermore they discuss goal directed software synthesis, by which code is interactively or automatically generated, which can be adapted to the application specific needs imposed by constraints on memory space, sampling rate or latency.<>

对于复杂数字信号处理系统的设计，面向方框图的仿真已经成为一种被广泛接受的标准。目前的研究关注的是异构仿真引擎的耦合以及从仿真到数字信号处理系统实现的过渡。由于难以掌握复杂的设计空间，高水平的硬件和软件综合变得越来越重要。作者集中于面向框图的软件合成的数字信号处理系统的可编程处理器，如数字信号处理器(DSP)。他们提出了合成环境笛卡尔说明新的优化策略。此外，他们还讨论了目标导向的软件合成，其中代码是交互式或自动生成的，可以适应由内存空间、采样率或延迟限制施加的应用程序特定需求。

引用次数: 103

Scheduling partitions in systolic algorithms 用收缩算法调度分区

[1992] Proceedings of the International Conference on Application Specific Array Processors

Pub Date : 1992-08-04 DOI: 10.1109/ASAP.1992.218540

A. Suarez, J. Llabería, A. Fernandez

The authors present a technique for scheduling partitions in systolic algorithms (SA). This technique can be used in combination with any possible projection used for the problem dependent size SA design and with any possible spatial mapping used for the partitions. They also present the necessary code transformations to transform the sequential code into the code that is executed in a processing element (PE) of the systolic processor (SP). This technique is applied to the matrix by vector problem using a non-unimodular transformation matrix and taking into account input and output of data. Cut&pile spatial mapping is used for the partitions.<>

提出了一种基于收缩算法的分区调度技术。此技术可以与用于问题相关的SA大小设计的任何可能的投影以及用于分区的任何可能的空间映射结合使用。它们还提供必要的代码转换，将顺序代码转换为在收缩处理器(SP)的处理元素(PE)中执行的代码。将该方法应用于考虑数据输入和输出的非单模变换矩阵的矩阵向量问题。Cut&pile空间映射用于分区。

引用次数: 7

Deterministic Boltzmann machine VLSI can be scaled using multi-chip modules 确定性玻尔兹曼机VLSI可以使用多芯片模块进行缩放

[1992] Proceedings of the International Conference on Application Specific Array Processors

Pub Date : 1992-08-04 DOI: 10.1109/ASAP.1992.218571

Michael Murray, J. Burr, D. Stork, Ming-Tak Leung, K. Boonyanit, G. Wolff, A. Peterson

Describes a special purpose, very high speed, digital deterministic Boltzmann neural network VLSI chip. Each chip has 32 physical neural processors, which can be apportioned into an arbitrary topology (input, multiple hidden and output layers) of up to 160 virtual neurons total. Under typical conditions, the chip learns at approximately 5*10/sup 8/ connection updates/second (CUPS). Through relatively minor (subsequent) modifications, the authors' chips can be 'tiled' in multi-chip modules, to make multi-layer networks of arbitrary size suffering only slight communications delays and overhead. In this way, the number of CUPS can be made arbitrarily large, limited only by the number of chips tiled. The chip's high speed is due to massively parallel array computation of the inner products of connection weights and neural activations, limited (but adequate) precision for weights and activations (5 bits), high clock rate (180 MHz), as well as several algorithmic and design insights.<>

介绍了一种专用的、非常高速的数字确定性玻尔兹曼神经网络VLSI芯片。每个芯片有32个物理神经处理器，可以分配到任意拓扑(输入层，多个隐藏层和输出层)，最多可容纳160个虚拟神经元。在典型条件下，芯片的学习速度约为5*10/sup / 8/连接更新/秒(CUPS)。通过相对较小的(后续)修改，作者的芯片可以“平铺”在多芯片模块中，使任意大小的多层网络只遭受轻微的通信延迟和开销。通过这种方式，CUPS的数量可以任意增加，仅受平铺的芯片数量的限制。该芯片的高速是由于连接权重和神经激活的内部乘积的大规模并行阵列计算，有限(但足够)的权重和激活精度(5位)，高时钟速率(180 MHz)，以及一些算法和设计见解。

引用次数: 4

Programming systolic arrays 编程收缩数组

[1992] Proceedings of the International Conference on Application Specific Array Processors

Pub Date : 1992-08-04 DOI: 10.1109/ASAP.1992.218541

R. Hughey

This paper presents the New Systolic Language as a general solution to the problem of systolic programming. The language provides a simple programming interface for systolic algorithms suitable for different hardware platforms and software simulators. The New Systolic Language hides the details and potential hazards of inter-processor communication, allowing data flow only via abstract systolic data streams. Data flows and systolic cell programs for the co-processor are integrated with host functions, enabling a single file to specify a complete systolic program.<>

本文提出了一种新的压缩语言，作为压缩编程问题的通用解决方案。该语言为适用于不同硬件平台和软件模拟器的收缩算法提供了简单的编程接口。新的收缩语言隐藏了处理器间通信的细节和潜在危险，只允许数据通过抽象的收缩数据流流动。协处理器的数据流和收缩细胞程序与主机功能集成在一起，使单个文件能够指定完整的收缩程序

引用次数: 17

Hierarchical scheduling of DSP programs onto multiprocessors for maximum throughput 在多处理器上对DSP程序进行分层调度以获得最大吞吐量

[1992] Proceedings of the International Conference on Application Specific Array Processors

Pub Date : 1992-08-04 DOI: 10.1109/ASAP.1992.218584

P. Hoang, J. Rabaey

A multiprocessor scheduling algorithm that simultaneously considers pipelining, retiming, parallel execution and hierarchical node decomposition to maximize performance throughput is presented. The algorithm is able to take into account interprocessor communication delays, and memory and processor availability constraints. The results on a set of benchmarks demonstrate the algorithm's ability to achieve near optimal speedups across a wide range of applications of various types of concurrency, with good scalability with respect to processor count.<>

提出了一种同时考虑流水线、重定时、并行执行和分层节点分解的多处理机调度算法。该算法能够考虑到处理器间通信延迟以及内存和处理器可用性约束。一组基准测试的结果表明，该算法能够在各种并发类型的广泛应用程序中实现近乎最佳的加速，并且在处理器数量方面具有良好的可伸缩性。

引用次数: 8

Discrete wavelet transforms in VLSI VLSI中的离散小波变换

[1992] Proceedings of the International Conference on Application Specific Array Processors

Pub Date : 1992-08-04 DOI: 10.1109/ASAP.1992.218570

M. Vishwanath, R. Owens, M. J. Irwin

Three architectures, based on linear systolic arrays, for computing the discrete wavelet transform, are described. The AT/sup 2/ lower bound for computing the DWT in a systolic model is derived and shown to be AT/sup 2/= Omega (N/sup 2/N/sub w/k). Two of the architectures are within a factor of log N from optimal, but they are of practical importance due to their regular structure, scalability and limited I/O needs. The third architecture is optimal, but it requires complex control.<>

描述了三种基于线性收缩阵列的离散小波变换计算体系结构。在收缩模型中计算DWT的AT/sup 2/下界被导出并显示为AT/sup 2/= Omega (N/sup 2/N/sub w/k)。其中两种体系结构距离最优值在log N以内，但由于其常规结构、可伸缩性和有限的I/O需求，它们具有实际重要性。第三种架构是最优的，但它需要复杂的控制

引用次数: 46

Fully static multiprocessor realization for real-time recursive DSP algorithms 全静态多处理器实现实时递归DSP算法

[1992] Proceedings of the International Conference on Application Specific Array Processors

Pub Date : 1992-08-04 DOI: 10.1109/ASAP.1992.218537

Duen-Jeng Wang, Y. Hu

A systematic approach to implement a real time recursive digital signal processing algorithm on a dedicated multiprocessor array is presented. First, the authors unfold the algorithm so that its corresponding dependence graph becomes a newly defined generalized perfect rate graph. They prove that the dependence graph of a recursive algorithm admits a desirable rate optimal, full static multiprocessor implementation if and only if it is a generalized perfect rate graph. Based on these results, an efficient heuristic algorithm is presented to perform optimal multi-processor scheduling and task assignment so that the number of processors required is minimized.<>

提出了一种在专用多处理器阵列上实现实时递归数字信号处理算法的系统方法。首先对算法展开，使其对应的依赖图成为新定义的广义完美率图。证明了递归算法的依赖图当且仅当它是一个广义的完美速率图时，才有理想的速率最优的全静态多处理器实现。在此基础上，提出了一种有效的启发式算法来实现多处理器的最优调度和任务分配，使所需的处理器数量最小化

引用次数: 10

Matrix computations in arrays of DSPs dsp阵列中的矩阵计算

[1992] Proceedings of the International Conference on Application Specific Array Processors

Pub Date : 1992-08-04 DOI: 10.1109/ASAP.1992.218549

Jeime Moreno, M. Medina

The authors present the use of the multimesh graph representation to map matrix algorithms onto arrays of digital signal processors (DSPs), using the TMS 320C30 as example. This processor, as most DSPs, is characterized by a two-level memory subsystem and a built-in DMA controller. The mapping process focuses on large matrices which do not fit in the first level of memory. Good utilization of the DSP resources is achieved by programming the execution of the algorithms by prisms from the multimesh graph; the optimal size of the prisms is obtained. Performance estimates indicate that it is possible to program the DSP in such a way that the impact of slower second-level memory is not significant (around 7% degradation with five wait states). Good load balancing throughout the array is achieved by allocating to processors partitions of the problem of nonuniform size, as suggested in a previous publication. The schedule of operations proposed deviates from the conventional ordering, wherein the inner-product among two vectors is fully computed at once. Instead, the proposed schedule divides inner-products into portions which are executed in interleaved manner throughout the computation of the entire problem, each portion using as an input the partial result obtained earlier from the execution of the corresponding previous portion.<>

作者以tms320c30为例，介绍了使用多网格图表示将矩阵算法映射到数字信号处理器(dsp)阵列上的方法。与大多数dsp一样，该处理器的特点是两级存储器子系统和内置DMA控制器。映射过程集中在不适合第一层内存的大矩阵上。利用多网格图中的棱镜对算法的执行进行编程，实现了DSP资源的有效利用;得到了棱镜的最佳尺寸。性能估计表明，有可能以这样一种方式对DSP进行编程，即较慢的二级存储器的影响并不显著(在五种等待状态下，性能下降约7%)。整个数组的良好负载平衡是通过将大小不一致的问题分区分配给处理器来实现的，这在以前的出版物中有过建议。所提出的操作计划偏离了传统的排序，其中两个向量之间的内积是一次完全计算的。相反，提议的调度将内积分成若干部分，这些部分在整个问题的计算过程中以交错的方式执行，每一部分使用从先前相应部分执行中获得的部分结果作为输入。

{"title":"Matrix computations in arrays of DSPs","authors":"Jeime Moreno, M. Medina","doi":"10.1109/ASAP.1992.218549","DOIUrl":"https://doi.org/10.1109/ASAP.1992.218549","url":null,"abstract":"The authors present the use of the multimesh graph representation to map matrix algorithms onto arrays of digital signal processors (DSPs), using the TMS 320C30 as example. This processor, as most DSPs, is characterized by a two-level memory subsystem and a built-in DMA controller. The mapping process focuses on large matrices which do not fit in the first level of memory. Good utilization of the DSP resources is achieved by programming the execution of the algorithms by prisms from the multimesh graph; the optimal size of the prisms is obtained. Performance estimates indicate that it is possible to program the DSP in such a way that the impact of slower second-level memory is not significant (around 7% degradation with five wait states). Good load balancing throughout the array is achieved by allocating to processors partitions of the problem of nonuniform size, as suggested in a previous publication. The schedule of operations proposed deviates from the conventional ordering, wherein the inner-product among two vectors is fully computed at once. Instead, the proposed schedule divides inner-products into portions which are executed in interleaved manner throughout the computation of the entire problem, each portion using as an input the partial result obtained earlier from the execution of the corresponding previous portion.<<ETX>>","PeriodicalId":265438,"journal":{"name":"[1992] Proceedings of the International Conference on Application Specific Array Processors","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1992-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123180437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SPERT: a VLIW/SIMD microprocessor for artificial neural network computations 用于人工神经网络计算的VLIW/SIMD微处理器

[1992] Proceedings of the International Conference on Application Specific Array Processors

Pub Date : 1992-08-04 DOI: 10.1109/ASAP.1992.218573

K. Asanović, J. Beck, Brian E. D. Kingsbury, P. Kohn, N. Morgan, J. Wawrzynek

SPERT (synthetic perceptron testbed) is a fully programmable single chip microprocessor designed for efficient execution of artificial neural network algorithms. The first implementation is in a 1.2 mu m CMOS technology with a 50 MHz clock rate, and a prototype system is being designed to occupy a double SBus slot within a Sun Sparcstation. SPERT sustains over 300*10/sup 6/ connections per second during pattern classification, and around 100*10/sup 6/ connection updates per second while running the popular error backpropagation training algorithm. This represents a speedup of around two orders of magnitude over a Sparcstation-2 for algorithms of interest. An earlier system produced by the group, the Ring Array Processor (RAP), used commercial DSP chips. Compared with a RAP multiprocessor of similar performance, SPERT represents over an order of magnitude reduction in cost for problems where fixed-point arithmetic is satisfactory.<>

SPERT (synthetic perceptron testbed)是一种完全可编程的单芯片微处理器，旨在有效地执行人工神经网络算法。第一个实现采用1.2 μ m CMOS技术，时钟频率为50 MHz，并且正在设计一个原型系统，用于占用Sun Sparcstation内的双SBus插槽。在模式分类期间，SPERT维持每秒超过300*10/sup 6/连接，在运行流行的误差反向传播训练算法时，每秒约100*10/sup 6/连接更新。对于感兴趣的算法，这表示比Sparcstation-2的速度提高了大约两个数量级。该集团生产的早期系统环形阵列处理器(RAP)使用商用DSP芯片。与具有类似性能的RAP多处理器相比，SPERT在满足不动点算法的情况下，将问题的成本降低了一个数量级以上

引用次数: 15

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

[1992] Proceedings of the International Conference on Application Specific Array Processors

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀