home | about | pictures | reference | trade | links |
Transcript of HOTCHIPS VI presentation of
the 21164 microprocessor
Key attributes:
new design (not like 21064 -> 21064A)
4-way issue superscalar
Large on-chip L2 cache
7-stage integer pipeline
9-stage floating point pipeline
low latencies at high clock rate
high-throughput memory subsystem
Other properties:
40b physical address (1 Terabyte)
43b virtual address (8 Terabyte)
128b external cache interface
L3 cache controller integrated
Instruction translation buffer 48 entries
Data translation buffer 64 entries
16.5 mm x 18.1 mm die size (slightly smaller than original Pentium)
0.5 micron , 4 layer metal CMOS5 process
Execution pipelines:
Integer Pipeline 0: arith, logical, ld/st, shift
Integer Pipeline 1: arith, logical, ld, br/jmp Int mul
FP Pipeline 0: add, subtract, compare, FP branch
FP Pipeline 1: multiply
FP div hangs off FP pipe 0, but runs independently
Latencies:
Most int ops 1
CMOV 2
Int mul 8 - 16
Float ops 4
loads (L1 cache hit) 2
compare or logical op to
CMOV or conditional BR 0
Onchip data caches:
dual-ported L1 data cache (8Kbyte, write through, non-blocking)
On-Chip L2 cache (96Kbyte, 3-way set assoc., write back, pipelined)
Miss Address File (MAF), 6 entry, between L1 and L2
MAF merges loads to the same cache block
Up to 21 loads, multiple loads merge regardless of order
Up to two register file fills per cycle
Bus Address File (BAF), 2 entry, between L2 and external memory
L3 cache (off-chip)
Direct-mapped write-back superset of L2 cache
Up to 2 outstanding reads
Programmable wave pipelining
L3 cache is optional
Instruction prefetching
Aggressive prefetching from L2 cache,
At least three 32-byte blocks ahead of the current issue point
Continuous integer instruction issue out of L2 cache (2 per cycle)
60% of peak issue rate possible out of L2 cache (2.4 per cycle)
Latency and bandwidth of memory operations
Latency (cycles) Bandwidth (bytes/cycle)
L1 2 16
L2 8 16
L3 >= 12 <= 4
L1 cache block size 32 bytes
L2, L3 cache block sizes 64 bytes (with 32-byte block size option)
Cycle count improvements over the 21064/21064A
21164 | 21064/21064A | |
shifts/byte ops | 1 | 2 |
int mul | 8-16 | 19-23 |
cmp->branch | 0 | 1 |
float ops | 4 | 6 |
L1 data cache | 2 | 3 |