Frank E. B. Ophelders, S. Chakraborty, H. Corporaal
The heterogeneity of modern MPSoC architectures, coupled with the increasing complexity of the applications mapped onto them has recently led to a lot of interest in hybrid performance modeling techniques. Here, the idea is to apply different modeling and analysis techniques to different subsystems/components of an architecture/application. Such hybrid techniques often turn out to be more efficient and accurate compared to relying on a single analysis technique for the entire system. However, the challenge associated with this approach is to combine the different analysis results effectively to obtain conservative performance estimates for the entire system. In this paper we study a hybrid scheme where certain system components are simulated (e.g. using instruction set simulators), whereas others are analyzed using a formal technique called Real-Time Calculus (RTC). The main novelty of our approach stems from our use of this hybrid technique even for multiple tasks mapped onto a single processing element. In contrast to this, previous approaches relied on either full simulation or RTC-based analysis for an entire architectural component (e.g. a processor or a bus). The techniques we develop in this paper therefore allow for both intra- and inter-processor hybrid performance modeling and show how the different analysis results can be combined to efficiently obtain tight performance estimates for complex MPSoC architectures. We demonstrate the usefulness of this approach using an MPEG-2 decoder application that is partitioned and mapped onto two processing elements connected by FIFO buffers.
{"title":"Intra- and inter-processor hybrid performance modeling for MPSoC architectures","authors":"Frank E. B. Ophelders, S. Chakraborty, H. Corporaal","doi":"10.1145/1450135.1450156","DOIUrl":"https://doi.org/10.1145/1450135.1450156","url":null,"abstract":"The heterogeneity of modern MPSoC architectures, coupled with the increasing complexity of the applications mapped onto them has recently led to a lot of interest in hybrid performance modeling techniques. Here, the idea is to apply different modeling and analysis techniques to different subsystems/components of an architecture/application. Such hybrid techniques often turn out to be more efficient and accurate compared to relying on a single analysis technique for the entire system. However, the challenge associated with this approach is to combine the different analysis results effectively to obtain conservative performance estimates for the entire system. In this paper we study a hybrid scheme where certain system components are simulated (e.g. using instruction set simulators), whereas others are analyzed using a formal technique called Real-Time Calculus (RTC). The main novelty of our approach stems from our use of this hybrid technique even for multiple tasks mapped onto a single processing element. In contrast to this, previous approaches relied on either full simulation or RTC-based analysis for an entire architectural component (e.g. a processor or a bus). The techniques we develop in this paper therefore allow for both intra- and inter-processor hybrid performance modeling and show how the different analysis results can be combined to efficiently obtain tight performance estimates for complex MPSoC architectures. We demonstrate the usefulness of this approach using an MPEG-2 decoder application that is partitioned and mapped onto two processing elements connected by FIFO buffers.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133216042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adaptive Body Biasing (ABB) is a popularly used technique to mitigate the increasing impact of manufacturing process variations on leakage power dissipation. The efficacy of the ABB technique can be improved by partitioning a design into a number of "body-bias islands," each with its individual body-bias voltage. In this paper, we propose a system-level leakage variability mitigation framework to partition a multiprocessor system into body-bias islands at the processing element (PE) granularity at design time, and to optimally assign body-bias voltages to each island post-fabrication. As opposed to prior gate- and circuit-level partitioning techniques that constrain the global clock frequency of the system, we allow each island to run at a different speed and constrain only the relevant system performance metrics - in our case the execution deadlines. Experimental results show the efficacy of the proposed framework in reducing the mean and standard deviation of leakage power dissipation compared to a baseline system without ABB. At the same time, the proposed techniques provide significant runtime improvements over a previously proposed Monte-Carlo based technique while providing similar reductions in leakage power dissipation.
{"title":"System-level mitigation of WID leakage power variability using body-bias islands","authors":"S. Garg, Diana Marculescu","doi":"10.1145/1450135.1450197","DOIUrl":"https://doi.org/10.1145/1450135.1450197","url":null,"abstract":"Adaptive Body Biasing (ABB) is a popularly used technique to mitigate the increasing impact of manufacturing process variations on leakage power dissipation. The efficacy of the ABB technique can be improved by partitioning a design into a number of \"body-bias islands,\" each with its individual body-bias voltage. In this paper, we propose a system-level leakage variability mitigation framework to partition a multiprocessor system into body-bias islands at the processing element (PE) granularity at design time, and to optimally assign body-bias voltages to each island post-fabrication. As opposed to prior gate- and circuit-level partitioning techniques that constrain the global clock frequency of the system, we allow each island to run at a different speed and constrain only the relevant system performance metrics - in our case the execution deadlines. Experimental results show the efficacy of the proposed framework in reducing the mean and standard deviation of leakage power dissipation compared to a baseline system without ABB. At the same time, the proposed techniques provide significant runtime improvements over a previously proposed Monte-Carlo based technique while providing similar reductions in leakage power dissipation.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"571 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131444874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a novel technique for the modeling and the simulation of parallel applications for Multi-Processor Systems-on-Chip (MPSoCs). This technique consists of an application-transparent emulation of OS primitives, including task creation, scheduling, synchronization etc.; this emulation guarantees compatibility with any program compiled against the standard POSIX library, independently of the target OS. This methodology can be used to perform initial HW/SW partitioning and concurrent engineering of a given application, as it allows any software routine to be transparently emulated with SystemC modules. The proposed approach has been verified on a large set of multi-threaded benchmarks, with both POSIX Threads and OpenMP programming styles. Results show that our methodology enables (a) fast simulation of POSIX applications, (b) accurate analysis of multi-threaded applications, and (c) co-design and fast preliminary hardware-software partitioning.
{"title":"Concurrency emulation and analysis of parallel applications for multi-processor system-on-chip co-design","authors":"G. Beltrame, L. Fossati, D. Sciuto","doi":"10.1145/1450135.1450138","DOIUrl":"https://doi.org/10.1145/1450135.1450138","url":null,"abstract":"This paper presents a novel technique for the modeling and the simulation of parallel applications for Multi-Processor Systems-on-Chip (MPSoCs). This technique consists of an application-transparent emulation of OS primitives, including task creation, scheduling, synchronization etc.; this emulation guarantees compatibility with any program compiled against the standard POSIX library, independently of the target OS. This methodology can be used to perform initial HW/SW partitioning and concurrent engineering of a given application, as it allows any software routine to be transparently emulated with SystemC modules. The proposed approach has been verified on a large set of multi-threaded benchmarks, with both POSIX Threads and OpenMP programming styles. Results show that our methodology enables (a) fast simulation of POSIX applications, (b) accurate analysis of multi-threaded applications, and (c) co-design and fast preliminary hardware-software partitioning.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134102398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
X. Hu, A. Khitun, K. Likharev, M. Niemier, M. Bao, Kang L. Wang
It is well recognized that novel computational models, devices and technologies are needed in order to sustain the remarkable advancement of CMOS-based VLSI circuits and systems. Regardless of the models, devices and technologies, any enhancement/replacement to CMOS must show significant gains in at least one of the key metrics (including speed, power and cost) for at least a subset of application domains currently employing CMOS circuits. In addition, effective defect tolerant techniques are a critical factor for the successful adoption of any new computing device due to the fact that nano-scale structures will have defect rates much higher than today's CMOS chips. The task of identifying application domains that could benefit the most from a new model/device/technology and ensuring that the resultant system meets functional requirements in the presence of defects requires synergistic efforts of physical scientists, and circuit and system design researchers. This paper contains a collection of three contributions-each focusing on one particular emergent technology-presenting a basic introduction on the technologies, some of their unique features in contrast with CMOS, potential application domains for these technologies, and new opportunities that they may bring forward in defect tolerance design. The contributions include both traditional and nontraditional state representations which use either electronic or magnetic interactions.
{"title":"Design and defect tolerance beyond CMOS","authors":"X. Hu, A. Khitun, K. Likharev, M. Niemier, M. Bao, Kang L. Wang","doi":"10.1145/1450135.1450187","DOIUrl":"https://doi.org/10.1145/1450135.1450187","url":null,"abstract":"It is well recognized that novel computational models, devices and technologies are needed in order to sustain the remarkable advancement of CMOS-based VLSI circuits and systems. Regardless of the models, devices and technologies, any enhancement/replacement to CMOS must show significant gains in at least one of the key metrics (including speed, power and cost) for at least a subset of application domains currently employing CMOS circuits. In addition, effective defect tolerant techniques are a critical factor for the successful adoption of any new computing device due to the fact that nano-scale structures will have defect rates much higher than today's CMOS chips. The task of identifying application domains that could benefit the most from a new model/device/technology and ensuring that the resultant system meets functional requirements in the presence of defects requires synergistic efforts of physical scientists, and circuit and system design researchers.\u0000 This paper contains a collection of three contributions-each focusing on one particular emergent technology-presenting a basic introduction on the technologies, some of their unique features in contrast with CMOS, potential application domains for these technologies, and new opportunities that they may bring forward in defect tolerance design. The contributions include both traditional and nontraditional state representations which use either electronic or magnetic interactions.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125114649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper we present a framework for a distributed and very low-cost implementation of synchronization controllers and protocols for embedded multiprocessors. The proposed architecture effectively implements the queued-lock semantics in a completely distributed way. The proposed approach to synchronization implementation not only completely eliminates the overwhelming bus contention traffic when multiple cores compete for a synchronization variable, but also achieves very high energy efficiency as the local synchronization controller can efficiently determine, without any bus transactions or local cache spinning, the exact timing of when the lock is made available to the local processor. Application-specific information regarding synchronization variables in the local task is exploited in implementing the distributed synchronization protocol. The local synchronization controllers enable the system software or the thread library to implement various low-power policies, such as disabling the cache accesses or even completely powering down the local processor while waiting for a synchronization variable.
{"title":"Distributed and low-power synchronization architecture for embedded multiprocessors","authors":"Chenjie Yu, Peter Petrov","doi":"10.1145/1450135.1450153","DOIUrl":"https://doi.org/10.1145/1450135.1450153","url":null,"abstract":"In this paper we present a framework for a distributed and very low-cost implementation of synchronization controllers and protocols for embedded multiprocessors. The proposed architecture effectively implements the queued-lock semantics in a completely distributed way. The proposed approach to synchronization implementation not only completely eliminates the overwhelming bus contention traffic when multiple cores compete for a synchronization variable, but also achieves very high energy efficiency as the local synchronization controller can efficiently determine, without any bus transactions or local cache spinning, the exact timing of when the lock is made available to the local processor. Application-specific information regarding synchronization variables in the local task is exploited in implementing the distributed synchronization protocol. The local synchronization controllers enable the system software or the thread library to implement various low-power policies, such as disabling the cache accesses or even completely powering down the local processor while waiting for a synchronization variable.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125542613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-09-08DOI: 10.1109/CODES+ISSS.2004.28
Zhengting He, A. Mok
Transformative applications are a class of dataflow computation characterized by iterative behavior. The problem of partitioning a transformative application specification to a set of available hardware (HW) and software (SW) processing elements (PEs) and derivation of a job execution order (scheduling) on them has been quite well studied, but the problem of obtaining fast simulation of these applications poses different constraints. In this paper, we propose an efficient framework for a symmetric multi-processor (SMP) simulation host to achieve fast HW/SW co-simulation for transformative applications, given the partition solutions and the derived schedulers. The framework overcomes the limitations in existing Linux SMP kernel and requires only a reasonable amount of modifications to it. We also present a heuristic algorithm which effectively assigns simulation tasks to the processors on the simulation host, considering both average job simulation time on each processor and other simulation overhead. Our experiments show that the algorithm is able to find satisfactory suboptimal solutions with very little computation time. Based on the task assignment solution, the simulation time can be reduced by 25% to 50% from the obvious but naive approach.
{"title":"Fast Co-Simulation of Transformative Systems with OS Support","authors":"Zhengting He, A. Mok","doi":"10.1109/CODES+ISSS.2004.28","DOIUrl":"https://doi.org/10.1109/CODES+ISSS.2004.28","url":null,"abstract":"Transformative applications are a class of dataflow computation characterized by iterative behavior. The problem of partitioning a transformative application specification to a set of available hardware (HW) and software (SW) processing elements (PEs) and derivation of a job execution order (scheduling) on them has been quite well studied, but the problem of obtaining fast simulation of these applications poses different constraints. In this paper, we propose an efficient framework for a symmetric multi-processor (SMP) simulation host to achieve fast HW/SW co-simulation for transformative applications, given the partition solutions and the derived schedulers. The framework overcomes the limitations in existing Linux SMP kernel and requires only a reasonable amount of modifications to it. We also present a heuristic algorithm which effectively assigns simulation tasks to the processors on the simulation host, considering both average job simulation time on each processor and other simulation overhead. Our experiments show that the algorithm is able to find satisfactory suboptimal solutions with very little computation time. Based on the task assignment solution, the simulation time can be reduced by 25% to 50% from the obvious but naive approach.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115340643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Marwedel, D. Gajski, Erwin De Kock, Hugo De Man, M. Sami, I. Söderquist
The goal of this panel is to contrast existing approaches to embedded system education with the needs in industry.
该小组的目标是将现有的嵌入式系统教育方法与工业需求进行对比。
{"title":"Embedded systems education: how to teach the required skills?","authors":"P. Marwedel, D. Gajski, Erwin De Kock, Hugo De Man, M. Sami, I. Söderquist","doi":"10.1145/1016720.1016781","DOIUrl":"https://doi.org/10.1145/1016720.1016781","url":null,"abstract":"The goal of this panel is to contrast existing approaches to embedded system education with the needs in industry.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122646853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper concerns automatic hardware synthesis from data flow graph (DFG) specification for fast HW/SW cosynthesis. A node in DFG represents a coarse grain block such as FIR and DCT and a port in a block may consume multiple data samples per invocation, which distinguishes our approach from behavioral synthesis and complicates the problem. In the presented design methodology, a dataflow graph with specified algorithm can be mapped to various hardware structures according to the resource allocation and schedule information. This simplifies the management of the area/performance tradeoff in hardware design and widens the design space of hardware implementation of a dataflow graph compared with the previous approaches. Through experiments with some examples, the usefulness of the proposed technique is demonstrated.
{"title":"Hardware synthesis from coarse-grained dataflow specification for fast HW/SW cosynthesis","authors":"Hyunuk Jung, S. Ha","doi":"10.1145/1016720.1016730","DOIUrl":"https://doi.org/10.1145/1016720.1016730","url":null,"abstract":"This paper concerns automatic hardware synthesis from data flow graph (DFG) specification for fast HW/SW cosynthesis. A node in DFG represents a coarse grain block such as FIR and DCT and a port in a block may consume multiple data samples per invocation, which distinguishes our approach from behavioral synthesis and complicates the problem. In the presented design methodology, a dataflow graph with specified algorithm can be mapped to various hardware structures according to the resource allocation and schedule information. This simplifies the management of the area/performance tradeoff in hardware design and widens the design space of hardware implementation of a dataflow graph compared with the previous approaches. Through experiments with some examples, the usefulness of the proposed technique is demonstrated.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114909414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-09-08DOI: 10.1109/CODES+ISSS.2004.10
S. Mattisson
Summary form only given. Cellular handset technology system requirements and integration trends In ten years the cellular telephone has evolved from a tool for the professional to an indispensable consumer product with a very high market penetration. At the same time, the handset cost, weight, and standby time have been reduced by more than a factor often. These factors have been critical for the success story of the mobile phone. The technical aspects behind the rapid handset evolution are discussed. In particular, what advances in: the radio architecture, for example the zero-IF GSM receiver; the baseband (CMOS) technology; and the radio system design areas have meant for the reduction of size, weight, cost, and power consumption is discussed. Future challenges, like SW-DSP-digital-RF partitioning, linear multi-mode modulation with high linearity requirements, digital leakage issues, and power consumption limitations in multimedia handsets are discussed with future generation handsets in mind.
{"title":"Cellular Handset Technology System Requirements and Integration Trends","authors":"S. Mattisson","doi":"10.1109/CODES+ISSS.2004.10","DOIUrl":"https://doi.org/10.1109/CODES+ISSS.2004.10","url":null,"abstract":"Summary form only given. Cellular handset technology system requirements and integration trends In ten years the cellular telephone has evolved from a tool for the professional to an indispensable consumer product with a very high market penetration. At the same time, the handset cost, weight, and standby time have been reduced by more than a factor often. These factors have been critical for the success story of the mobile phone. The technical aspects behind the rapid handset evolution are discussed. In particular, what advances in: the radio architecture, for example the zero-IF GSM receiver; the baseband (CMOS) technology; and the radio system design areas have meant for the reduction of size, weight, cost, and power consumption is discussed. Future challenges, like SW-DSP-digital-RF partitioning, linear multi-mode modulation with high linearity requirements, digital leakage issues, and power consumption limitations in multimedia handsets are discussed with future generation handsets in mind.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126197918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The growing gap between transistor and global wire speeds in sub-100 nanometer technologies poses numerous challenges to computer architects and circuit designers. This challenge looks to be even more significant in far-future technologies such as molecular-scale wire transmission, whether using carbon nanotubes or quantum dots. While a fixed design scales as its area decreases with feature size reductions, future designs that use a constant area see rapidly increasing global latencies.Two approaches to address these latencies are (1) to use signaling and design techniques to reduce the actual latencies, and (2) to use architectural innovations to reduce the distance that signals must be propagated in the common case. In this talk, after an overview of the communication latency issue, I describe current research that aims to reduce the average distance communicated for processing and memory system signals. For processor designs, I will describe the Static Placement, Dynamic Issue (SPDI) execution model, which allows the compiler to place dependent instructions near one another, and which is being implemented in the TRIPS processor. I will also describe Non-Uniform Caches Access (NUCA) designs, which attempt to reduce average signal distance for cache accesses.
{"title":"Architectural versus physical solutions for on-chip communication challenges","authors":"D. Burger","doi":"10.1145/944645.944665","DOIUrl":"https://doi.org/10.1145/944645.944665","url":null,"abstract":"The growing gap between transistor and global wire speeds in sub-100 nanometer technologies poses numerous challenges to computer architects and circuit designers. This challenge looks to be even more significant in far-future technologies such as molecular-scale wire transmission, whether using carbon nanotubes or quantum dots. While a fixed design scales as its area decreases with feature size reductions, future designs that use a constant area see rapidly increasing global latencies.Two approaches to address these latencies are (1) to use signaling and design techniques to reduce the actual latencies, and (2) to use architectural innovations to reduce the distance that signals must be propagated in the common case. In this talk, after an overview of the communication latency issue, I describe current research that aims to reduce the average distance communicated for processing and memory system signals. For processor designs, I will describe the Static Placement, Dynamic Issue (SPDI) execution model, which allows the compiler to place dependent instructions near one another, and which is being implemented in the TRIPS processor. I will also describe Non-Uniform Caches Access (NUCA) designs, which attempt to reduce average signal distance for cache accesses.","PeriodicalId":300268,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134309496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}