home | about | pictures | reference | trade | links |
INTEL P6 UNVEILED AT ISSCC
San Francisco, CA -- February 16, 1995 -- Intel's presentation at the
ISSCC conference today verified some of the early predictions of the
microarchitecture, and presented some vital statistics for their current
0.6um implementation. P6 will be a two-chip implementation, integrating
the P6 CPU and an L2 so that the bus between the two will run at the full
CPU clock speed.
The processor performs in-order-issue of x86 instructions, and generating
"uops" which are the atomic RISC-like operations supported on the underlying
hardware. Up to 3 uops can be decoded per cycle, with some restrictions
on the types that can be decoded simultaneously. A micro-instruction
sequencer hanging off the decoder provides the x86 instruction to uop
instruction mapping. The uops are then passed to a single 20-entry
reservation station. The reservation station decouples the in-order issue
with the out-of-order execution unit. When a uop is completed, it is placed
into the 40-entry reorder buffer. Completion is in-order, allowing for
precise interrupts and exceptions.
The instruction fetch and decode unit is 8 pipeline stages deep. It also
includes a 512-entry branch-target buffer. Of those 8 stages, 2.5 are
dedicated to the instruction-cache access, alone, with another 2.5 stages
used for the decoding of the x86 instruction. The out-of-order core is
3 pipeline stages deep for simple one-cycle execute instructions -- 2 cycles
are required to set-up the reservation station accesses. The floating point
execution pipelines are deeper. Lastly, the re-order buffer and retirement
unit is another 3 pipeline stages. The minimum number of cycles required to
complete an instruction is 14.
Intel is still fabricating with BiCMOS technology, and claimed that using
BiNMOS drivers for large-fanout gates provided a 15% performance increase.
Chip Vital Statistics
---------------------
Performance: | 200 SPECint92 (estimated). No SPECfp given. |
Clock: | 133 MHz |
Process: | 4 metal 0.6um BiCMOS. |
Vdd: | 2.9 V |
Power: | 14 Watts (estimated) |
Transistors: | 5.5 million |
Package: | Dual-cavity PGA. |
L1 Cache: | 8k I/D caches. Dual-ported D-cache supports one load and store per cycle. |
L2 Cache: | 256 kB, Unified. |
External Bus: | 64-bit, can operate at 1/2, 1/3, or 1/4 of the CPU processor speed, with one data transfer per cycle. |
Superscalar: | In-order issue, out-of-order execution, in-order retire. Supports speculative execution. Peak issue is 3 uops ( <= 3 x86 instructions ). |