Reohr, Navarro, Chan, Mayo, Curran, Krumm, M M Pelella, Lu, Bakhru, Kowalczyk, Rawlins, Carey, Wu
{"title":"Precharged Cache Hit Logic With Flexible Timing Control","authors":"Reohr, Navarro, Chan, Mayo, Curran, Krumm, M M Pelella, Lu, Bakhru, Kowalczyk, Rawlins, Carey, Wu","doi":"10.1109/VLSIC.1997.623793","DOIUrl":null,"url":null,"abstract":"Introduction: The design of fast and robust Cache hit logic was one of the fundamental hurdles overcome to achieve the reported 35OMHz for a S/390 Microprocessor in a O.&n Leff technology’. The 4 way set associative Cache mechanism is shown in Figure 1. It consists of the Cache memory’which holds a subset of the machine instructions and data, the Director), memory which holds a subset of absolute addresses (main memory addresses) corresponding to the Cache’s instructions and data entries, the absolute Translation Lookaside Bufer (TLB) which holds recently translated absolute address entries, and the logical TLB which holds the virtual addresses (internally generated machine addresses) corresponding to absolute TLB’s address entries. The hit logic, which links all these memories together, is used to resolve whether the Cache memory holds the instructions or data requested by the processor each cycle. If the Cache holds the correct entry, the hit logic selects the appropriate entry from 4 possible sets simultaneously read out of the Cache. While most of the processor logic was implemented in static CMOS, performance requirements dictated that the hit logic employ precharge techniques to achieve a single cycle directory lookup and set selection. This in turn caused a number of circuit issues to surface which relate to the proper interfacing of static and precharged logic families. Flexible timing control was incorporated around these interfaces, where races exist, to obtain functional hardware at all process and test comers. Timing Diagram: Figure 2 shows the timing diagram for the logic and memory circuits. The down going edge of the Global Clock triggers both the capture and launch of signals through the logic registers, begins the precharge of the hit logic (Hit L. precharge), and starts the internal decode of addresses in the memory macros. About half way through the cycle, the 7ZB andDirectory memories send their outputs to the conipamtor logic which in turn drives the Anding arid Coritiriuation logic. The hit signal produced by the sum of those actions finally selects one of four possible cache entries driven out to the Cache Output Register. All memory accesses and logic evaluations happen in a single processor cycle. Recharge occurs when circuits are not evaluating. Each functional block in the memory circuits generates its own precharge using a self-resetting scheme2 (SRCMOS). Direcfoq) and T U memories produce a wide enough output pulse, approximately a third of the cycle, to guarantee sufficient overlap between their signals such that bit “Anding” done in XOR circuits, a part of the Directory Comparator, functions reliably. Once the hit logic topples, data stay latched until the next cycle precharge. The single phase clocking scheme described is prone to short path timing problems introduced through the precharge of the hit logic. Clock skew may cause the hit logic to precharge before a latch capturing the hit logic’s state has time to close. Padding prevented this situation from occurring. Strobed Compare Equal Circuit: Figure 4 shows the Directory comparator circuit consisting of a bit by bit XOR followed by a strobed “Anding” plane. The primary complication of this comparator is that it requires a timing circuit to assert the strobe only after the XOR circuits have had a chance to detect a difference between the TLB and directory signals, and if a single bit miscompares, t r ig ger a NOR transistor to pull down dynl. A race condition exists between data arriving and strobe assertion. Activating the strobe too early causes the comparator to functionally fail with a constant signature of a high Compare-Equal while activating the strobe too late adds dead time to the circuit delay. Note also that all precharge circuits, figures 4 & 5, use keeper PFETs, PNl, 2, 3, to manage charge sharing and leakage on dynamic nodes and skewed static inverters (not shown) to remove coupling noise introduced on long signal lines. In Figure 3, E-beam results for the Strobe Compare Equal Circuit were obtained from a test site with visible circuit nodes and precise strobe timing control. The first waveforms show the 200pSec circuit performance from the pulsed input dir-Trising to the output Conipare-Equal rising. To achieve that performance, the second waveforms show that the Strobe is triggered before node dynl is pulled down. The third waveform shows how the overly aggressive Sbobe setting produces a noise glitch, for the miscompare case, which indicates dyn2 is partially discharged through transistors NI, N2, and NSTROBE. A half latch, transistor PNI, recovers the node dyn2 high once transistor NI shuts off. In a real design, such an aggressive strobe setting is not advised since mistracking between the strobe path and the data path can be introduced by long signal wires of varying length, power supply bounce, transistor mistracking, and coupling noise. A strobe timing circuit is shown attached to the comparator in figure 4. The Sfrube signal is developed by “Anding” (ANDI) both the TLB and Directory signals. In this way, the slowest memory is guaranteed to trigger the comparator strobe. An “OR’ (OR1 & 2) of a single bit’s true and complement signals (dir-7’and dir-C, tlb-T and t1b-q determines when a memory launches its data to the comparator, since by design, either the memory’s true or complement signal will go high. The strobe path replicates the comparator path all the way to the XOR input. In an early test version, differences between a memory’s writethrough (i.e. cell is written and read simultaneously) and read access times caused the strobe to trigger before all the comparator data arrived. Each strobe circuit drove two comparators each operating on separate Directory address data. In certain cases, one data field was written-through while lhe other was read. The problem was a fast write-through, in the dedicated timing bits of figure 4, caused a “fast” sfrobe to trigger a comparator, operating on “slow” read data, too early. For the case of a miscompare, the comparator failed with a high Compare-Equal when a low Conpare-Equal was expected. Fortunately, strobing was considered enough of a risk to add a timing control to the strobe circuit. Delaying the strobe cured the functional problem on early test hardware. The final fix included more margin in the strobe and added dedicated strobe circuits to each comparator. A stress mode now tightens the strobe timing during final test to ensure no marginally good hardware escapes. An explanation of how write-though can occur earlier than a read follows: Writing data directly onto a memory’s bit line pair, by pulling down one bit line, can trigger an inverter driving signals out of a sense amplifier without any assistance from the differential amplifier. If the signal swing into the sense amplifier is large enough, which in the write-through case it is, the inverter output goes high once the signal falls below the inverter’s switching threshold. Recall the only function of a sense amplifier in an SRAM is to rapidly amplify a slowly developing bit line difference produced by reading a memory cell. In the strobed comparator, ilexible timing control proved indispensible in fixing an unanticipated failing race. In general, timing control facilities may be incorporated into a design to stress or relax races during testing and burn-in3. Timing control consists of a variable delay element and scannable latches. Scannable latches hold the desired timing mode. They are set before the global clock is asserted. Small variable delay elements are constructed by selectively adding or removing transistor width to an inverter in a timing","PeriodicalId":175678,"journal":{"name":"Symposium 1997 on VLSI Circuits","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1997-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Symposium 1997 on VLSI Circuits","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/VLSIC.1997.623793","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Introduction: The design of fast and robust Cache hit logic was one of the fundamental hurdles overcome to achieve the reported 35OMHz for a S/390 Microprocessor in a O.&n Leff technology’. The 4 way set associative Cache mechanism is shown in Figure 1. It consists of the Cache memory’which holds a subset of the machine instructions and data, the Director), memory which holds a subset of absolute addresses (main memory addresses) corresponding to the Cache’s instructions and data entries, the absolute Translation Lookaside Bufer (TLB) which holds recently translated absolute address entries, and the logical TLB which holds the virtual addresses (internally generated machine addresses) corresponding to absolute TLB’s address entries. The hit logic, which links all these memories together, is used to resolve whether the Cache memory holds the instructions or data requested by the processor each cycle. If the Cache holds the correct entry, the hit logic selects the appropriate entry from 4 possible sets simultaneously read out of the Cache. While most of the processor logic was implemented in static CMOS, performance requirements dictated that the hit logic employ precharge techniques to achieve a single cycle directory lookup and set selection. This in turn caused a number of circuit issues to surface which relate to the proper interfacing of static and precharged logic families. Flexible timing control was incorporated around these interfaces, where races exist, to obtain functional hardware at all process and test comers. Timing Diagram: Figure 2 shows the timing diagram for the logic and memory circuits. The down going edge of the Global Clock triggers both the capture and launch of signals through the logic registers, begins the precharge of the hit logic (Hit L. precharge), and starts the internal decode of addresses in the memory macros. About half way through the cycle, the 7ZB andDirectory memories send their outputs to the conipamtor logic which in turn drives the Anding arid Coritiriuation logic. The hit signal produced by the sum of those actions finally selects one of four possible cache entries driven out to the Cache Output Register. All memory accesses and logic evaluations happen in a single processor cycle. Recharge occurs when circuits are not evaluating. Each functional block in the memory circuits generates its own precharge using a self-resetting scheme2 (SRCMOS). Direcfoq) and T U memories produce a wide enough output pulse, approximately a third of the cycle, to guarantee sufficient overlap between their signals such that bit “Anding” done in XOR circuits, a part of the Directory Comparator, functions reliably. Once the hit logic topples, data stay latched until the next cycle precharge. The single phase clocking scheme described is prone to short path timing problems introduced through the precharge of the hit logic. Clock skew may cause the hit logic to precharge before a latch capturing the hit logic’s state has time to close. Padding prevented this situation from occurring. Strobed Compare Equal Circuit: Figure 4 shows the Directory comparator circuit consisting of a bit by bit XOR followed by a strobed “Anding” plane. The primary complication of this comparator is that it requires a timing circuit to assert the strobe only after the XOR circuits have had a chance to detect a difference between the TLB and directory signals, and if a single bit miscompares, t r ig ger a NOR transistor to pull down dynl. A race condition exists between data arriving and strobe assertion. Activating the strobe too early causes the comparator to functionally fail with a constant signature of a high Compare-Equal while activating the strobe too late adds dead time to the circuit delay. Note also that all precharge circuits, figures 4 & 5, use keeper PFETs, PNl, 2, 3, to manage charge sharing and leakage on dynamic nodes and skewed static inverters (not shown) to remove coupling noise introduced on long signal lines. In Figure 3, E-beam results for the Strobe Compare Equal Circuit were obtained from a test site with visible circuit nodes and precise strobe timing control. The first waveforms show the 200pSec circuit performance from the pulsed input dir-Trising to the output Conipare-Equal rising. To achieve that performance, the second waveforms show that the Strobe is triggered before node dynl is pulled down. The third waveform shows how the overly aggressive Sbobe setting produces a noise glitch, for the miscompare case, which indicates dyn2 is partially discharged through transistors NI, N2, and NSTROBE. A half latch, transistor PNI, recovers the node dyn2 high once transistor NI shuts off. In a real design, such an aggressive strobe setting is not advised since mistracking between the strobe path and the data path can be introduced by long signal wires of varying length, power supply bounce, transistor mistracking, and coupling noise. A strobe timing circuit is shown attached to the comparator in figure 4. The Sfrube signal is developed by “Anding” (ANDI) both the TLB and Directory signals. In this way, the slowest memory is guaranteed to trigger the comparator strobe. An “OR’ (OR1 & 2) of a single bit’s true and complement signals (dir-7’and dir-C, tlb-T and t1b-q determines when a memory launches its data to the comparator, since by design, either the memory’s true or complement signal will go high. The strobe path replicates the comparator path all the way to the XOR input. In an early test version, differences between a memory’s writethrough (i.e. cell is written and read simultaneously) and read access times caused the strobe to trigger before all the comparator data arrived. Each strobe circuit drove two comparators each operating on separate Directory address data. In certain cases, one data field was written-through while lhe other was read. The problem was a fast write-through, in the dedicated timing bits of figure 4, caused a “fast” sfrobe to trigger a comparator, operating on “slow” read data, too early. For the case of a miscompare, the comparator failed with a high Compare-Equal when a low Conpare-Equal was expected. Fortunately, strobing was considered enough of a risk to add a timing control to the strobe circuit. Delaying the strobe cured the functional problem on early test hardware. The final fix included more margin in the strobe and added dedicated strobe circuits to each comparator. A stress mode now tightens the strobe timing during final test to ensure no marginally good hardware escapes. An explanation of how write-though can occur earlier than a read follows: Writing data directly onto a memory’s bit line pair, by pulling down one bit line, can trigger an inverter driving signals out of a sense amplifier without any assistance from the differential amplifier. If the signal swing into the sense amplifier is large enough, which in the write-through case it is, the inverter output goes high once the signal falls below the inverter’s switching threshold. Recall the only function of a sense amplifier in an SRAM is to rapidly amplify a slowly developing bit line difference produced by reading a memory cell. In the strobed comparator, ilexible timing control proved indispensible in fixing an unanticipated failing race. In general, timing control facilities may be incorporated into a design to stress or relax races during testing and burn-in3. Timing control consists of a variable delay element and scannable latches. Scannable latches hold the desired timing mode. They are set before the global clock is asserted. Small variable delay elements are constructed by selectively adding or removing transistor width to an inverter in a timing