Section Four: Unix and RISC, a New Hope
The basic design is scalable, from 32 to 48 and 64 bit designs, with 16 general purpose registers. It is a memory-data instruction set, but an elegant one. One early design was the Mitsubishi M32 (mid 1987), which optimised the simple and often used TRON instructions, much like the 80486 and 68040 did. It featured a 5 stage pipeline, dynamic branch prediction with a target branch buffer similar to that in the AMD 29K. It also featured an instruction prefetch queue, but being a prototype, had no MMU support or FPU.
Commercial versions such as the Gmicro/200 (1988) and other Gmicro/ from Fujitsu/Hitachi/Mitsubishi, and the Toshiba Tx1 were also introduced, and a 64 bit version (CHIP64) began development, but they didn't catch on in the non-Japanese market (definitive specifications or descriptions of the OS's actual operation were hard to come by, while research systems like Mach of BSD Unix were widely available for experimentation). In addition, newer techniques (such as load-store designs) overshadowed the TRON standard. Companies such as Hitachi switched to load-store designs, and many American companies (Sun, MIPS) licensed their (faster) designs openly to Japanese companies. TRON's promise of a unified architecture (when complete) was less important to companies than raw performance and immediate compatibility (Unix, MS-DOS/MS Windows, Macintosh), and has not become significant in the industry, though TRON operating system development continued as an embedded and distributed operating system (such as the Intelligent House project, or more recently the TiPO handheld digital assistant from Seiko (February 1997)) implemented on non-TRON CPUs.
NEC produced a similar memory-data design around the same time, the V60/V70 series, using thirty two registers, a seven stage pipeline, and preprocessed branches. NEC later developed the 32-bit load-store V800 series, and became a source of 64-bit MIPS load-store processors.
68000-based CPUs and a standard operating system, Unix. Research versions of load-store processors had promised a major step forward in speed [See Appendix A], but existing manufacturers were slow to introduce a RISC processor, so Sun went ahead and developed its own (based on Berkeley's design). In keeping with their open philosophy, they licensed it to other companies, rather than manufacture it themselves.
SPARC was not the first RISC processor. The AMD 29000 (see below) came before it, as did the MIPS R2000 (based on Stanford's experimental design) and Hewlett-Packard PA-RISC CPU, among others. The SPARC design was radical at the time, even omitting multiple cycle multiply and divide instructions (added in later versions), using repeated single-cycle "step" instructions instead (similar in idea to the square root step instruction in the Transputer T-800), while most RISC CPUs were more conventional.
SPARC usually contains about 128 or 144 integer registers, (memory-data designs typically had 16 or less). At each time 32 registers are available - 8 are global, the rest are allocated in a 'window' from a stack of registers. The window is moved 16 registers down the stack during a function call, so that the upper and lower 8 registers are shared between functions, to pass and return values, and 8 are local. The window is moved up on return, so registers are loaded or saved only at the top or bottom of the register stack. This allows functions to be called in as little as 1 cycle. later versions added a FPU with thirty-two (non-windowed) registers. Like most RISC processors, global register zero is wired to zero to simplify instructions, and SPARC is pipelined for performance (a new instruction can start execution before a previous one has finished), but not as deeply as others - like the MIPS CPUs, it has branch delay slots. Also like previous processors, a dedicated condition code register (CCR) holds comparison results.
SPARC is 'scalable' mainly because the register stack can be expanded (up to 512, or 32 windows), to reduce loads and saves between functions, or scaled down to reduce interrupt or context switch time, when the entire register set has to be saved. Function calls are usually much more frequent than interrupts, so the large register set is usually a plus, but compilers now can usually produce code which uses a fixed register set as efficiently as a windowed register set across function calls.
SPARC is not a chip, but a specification, and so there are various designs of it. It has undergone revisions, and now has multiply and divide instructions. Original versions were 32 bits, but 64 bit and superscalar versions were designed and implemented (beginning with the Texas Instruments SuperSparc in late 1992), but performance lagged behind other load-store and even Intel 80x86 processors until the UltraSPARC (late 1995) from Texas Instruments and Sun, and superscalar HAL/Fujitsu SPARC64 multichip CPU. Most emphasis by licensees other than Sun and HAL/Fujitsu has been on low cost, embedded versions.
The UltraSPARC is a 64-bit superscalar processor series which can issue up to four instructions at once (but not out of order) to any of nine units: two integer units, two of the five floating point/graphics units (add, add and multiply, divide and square root), the branch and two load/store units. The UltraSPARC also added a block move instruction which bypasses the caches (2-way 16K instr, 16K direct mapped data), to avoid disrupting it, and specialized pixel operations (VIS - the Visual Instruction Set) which can operate in parallel on 8, 16, or 32-bit integer values packed in a 64-bit floating point register (for example, four 8 X 16 -> 16 bit multiplications in a 64 bit word, a sort of simple SIMD/vector operation. More extensive than the Intel MMX instructions, or earlier HP PA-RISC MAX and Motorola 88110 graphics extensions, VIS also includes some 3D to 2D conversion, edge processing and pixes distance (for MPEG, pattern-matching support).
The UltraSPARC I/II were architecturally the same. The UltraSPARC III (mid-2000) did not add out-of-order execution, on the grounds that memory latency eliminates any out-of-order benefit, and did not increase instruction parallelism after measuring the instructions in various applications (although it could dispatch six, rather than four, to the functional units, in a fourteen-stage pipeline). It concentrated on improved data and instruction bandwidth.
The HAL/Fujitsu SPARC64 series (used in Fujitsu servers using Sun Solaris software) concentrates more on execution performance than bandwidth as the Sun versions do. The initial version can issue up to four in order instructions simultaneously to four buffers, which issue to four integer, two floating point, two load/store, and the branch unit, and may complete out of order unlike UltraSPARC (an instruction completes when it finishes without error, is committed when all instructions ahead of it have completed, and is retired when its resources are freed - these are 'invisible' stages in the SPARC64 pipeline). A combination of register renaming, a branch history table, and processor state storage (like in the Motorola 88K) allow for speculative execution while maintaining precise exceptions/interrupts (renamed integer, floating, and CC registers - trap levels are also renamed and can be entered speculatively). VIS extensions are not implemented, but are emulated by trapping to a handler routine.
The SPARC64 V (late 2002) is agressively out-of-order, concentrating on branch prediction more than load latency, although it does include data speculation (loaded data are used before they are known to be valid - if it turns out to be invalid, the load operation is repeated, but this is still a win if data is usually valid (in the L1 cache)). It can dispatch six to eight instructions to: four integer units, two FPU (one with VIS support), two load units and two store units (store takes one cycle, load takes at least two, so the units are separate, unlike other designs). It has a nine stage pipeline for single-cycle instructions (up to twelve for more complex operations), with integer and floating point registers part of the integer/floating point reorder buffers allowing operands to be fetched before dispatching instructions to execution pipe segments.
Instructions are predecoded in cache, incorporating some ideas from dataflow designs - source operands are replaced with references to the instructions which produce the data, rather than matching up an instructions source registers with destination registers of earlier instructions during result forwarding in the execution stage. The cache also performs basic block trace scheduling to form issue packets, something normally reserved for compilers.
Fujitsu uses these CPUs in its PRIMEPOWER servers which compete with mainframes, so they are designed with mainframe reliability features. Pairity or error checking and correction bits are used in internal busses, and the CPU will actually restart the instruction stream after certain errors (logged to registers which can be checked to indicate a failing CPU which should be replaced).
While UltraSPARC III has mediocre performance on benchmarks (emphacizing data throughput, with mixed success), SPARC64 V is among the top of the 64-bit CPUs in Spec benchmarks.
Berkeley RISC design (and the IBM 801 project), as a modern successor to the earlier 2900 bitslice series (beginning around 1981). Like the SPARC design that was introduced shortly later, the 29000 has a large set of registers split into local and global sets. But though it was introduced before the SPARC, it has a more elegant method of register management.
The 29000 has 64 global registers, in comparison to the SPARC's eight. In addition, the 29000 allows variable sized windows allocated from the 128 register stack cache. The current window or stack frame is indicated by a stack pointer (a modern version of the ISAR register in the Fairchild F8 CPU), a pointer to the caller's frame is stored in the current frame, like in an ordinary stack (directly supporting stack languages like C, a CISC-like philosophy). Spills and fills occur only at the ends of the cache, and registers are saved/loaded from the memory stack (normally implemented as a register cache separate from the execution stack, similar to the way FORTH uses stacks). This allows variable window sizes, from 1 to 128 registers. This flexibility, plus the large set of global registers, makes register allocation easier than in SPARC (optimised stack operations also make it ideal for a stack-oriented interpreted languages such as PostScript, making it popular as a laser printer controller).
There is no special condition code register - any general register is used instead, allowing several condition codes to be retained, though this sometimes makes code more complex. An instruction prefetch buffer (using burst mode) ensures a steady instruction stream. Branches to another stream can cause a delay, so the first four new instructions are cached - next time a cached branch (up to sixteen) is taken, the cache supplies instructions during the initial memory access delay.
Registers aren't saved during interrupts, allowing the interrupt routine to determine whether the overhead is worthwhile. In addition, a form of register access control is provided. All registers can be protected, in blocks of 4, from access. These features make the 29000 useful for embedded applications, which is where most of these processors are used, allowing it at one point to claim the title of 'the most popular RISC processor'. The 29000 also includes an MMU and support for the 29027 FPU. The 29030 added
The 80C166 has sixteen 16 bit registers, with the lower eight usable as sixteen 8 bit registers, which are stored in overlapping windows (like in the SPARC) in the on-chip RAM (or register bank), pointed to by the Context Pointer (CP) (similar to the SP in the AMD 29K). Unlike the SPARC, register windows can overlap by a variable amount (controlled by the CP), and the there are no spills or fills because the registers are considered part of the RAM address space (like in the TMS 9900), and could even extend to off chip RAM. This eliminates wasted registers of SPARC style windows.
Address space (18 to 24 bits) is segmented (64K code segments with a separate code segment register, 16K data segments with upper two bits of 16 bit address selecting one of four data segment registers).
The 80C166 has 32 bit instructions, while it's a 16 bit processor (compared to the Hitachi SH, which is a 32 bit CPU with 16 bit instructions). It uses a four stage pipeline, with a limited (one instruction) branch cache.
Stanford MIPS project, which stood for Microprocessor without Interlocked Pipeline Stages [See Appendix A], and was arguably the first commercial RISC processor (other candidates are the ARM and IBM ROMP used in the IBM PC/RT workstation, which was designed around 1981 but delayed until 1986). It was intended to simplify processor design by eliminating hardware interlocks between the five pipeline stages. This means that only single execution cycle instructions can access the thirty two 32 bit general registers, so that the compiler can schedule them to avoid conflicts. This also means that LOAD/STORE and branch instructions have a 1 cycle delay to account for. However, because of the importance of multiply and divide instructions, a special HI/LO pair of multiply/divide registers exist which do have hardware interlocks, since these take several cycles to execute and produce scheduling difficulties.
Like the AMD 29000 and DEC Alpha, the R2000 has no condition code register considering it a potential bottleneck. The PC is user readable. The CPU includes an MMU unit that can also control a cache, and the CPU was one of the first which could operate as a big or little endian processor. An FPU, the R2010, is also specified for the processor.
Newer versions included the R3000 (1988), with improved cache control, and the R4000 (1991) (expanded to 64 bits and is superpipelined (twice as many pipeline stages do less work at each stage, allowing a higher clock rate and twice as many instructions in the pipeline at once, at the expense of increased latency when the pipeline can't be filled, such as during a branch, (and requiring interlocks added between stages for compatibility, making the original "I" in the "MIPS" acronym meaningless))). The R4400 and above integrated the FPU with on-chip caches. The R4600 and later versions abandoned superpipelines.
The superscalar R8000 (1994) was optimised for floating point operation, issuing two integer or load/store operations (from four integer and two load/store units) and two floating point operations simultaneously (FP instructions sent to the independent R8010 floating point coprocessor (with its own set of thirty-two 64-bit registers and load/store queues)).
The R10000 and R12000 versions (early 1996 and May 1997) added multiple FPU units, as well as almost every advanced modern CPU feature, including separate 2-way I/D caches (32K each) plus on-chip secondary controller (and high speed 8-way split transaction bus (up to 8 transactions can be issued before the first completes)), superscalar execution (load four, dispatch five instructions (may be out of order) to any of two integer, two floating point, and one load/store units), dynamic register renaming (integer and floating point rename registers (thirty two in the R10K, fourty eight in the R12K)), and an instruction cache where instructions are partially decoded when loaded into the cache, simplifying the processor decode (and register rename/issue) stage. This technique was first implemented in the AT&T CRISP/Hobbit CPU, described later. Branch prediction and target caches are also included.
The six stage 2-way (int/float) superscalar R5000 (January, 1996) was added to fill the gap between R4600 and R10000, without any fancy features (out of order or branch prediction buffers). For embedded applications, MIPS and LSI Logic added a compact 16 bit instruction set which can be mixed with the 32 bit set (same as the ARM Thumb 16 bit extension), implemented in a CPU called TinyRISC (October 1996), as well as MIPS V and MDMX (MIPS Digital Multimedia Extensions, announced October 1996)). MIPS V added parallel floating point (two 32 bit fields in 64 bit registers) operations (compared to similar HP MAX integer or Sun VIS and Intel MMX floating point unit extensions), MDMX added integer 8 or 16 bit subwords in 64 bit FPU registers and a 24 and 48 bit subwords in a 192 bit accumulator for multimedia instructions (a MAC instruction on an 8-bit value can produce a 24-bit result, hence the large accumulator). Vector-scalar operations (ex: multiply all subwords in a register by subword 3 from another register) are also supported. These extensive instructions are partly derived from Cray vector instructions (Cray is owned by SGI, the parent company of MIPS), and are much more extensive than the earlier multimedia extensions of other CPUs. Future versions are expected to add Java virtual machine support.
MDMX instructions were never implemented in a CPU, because the MDMX and MIPS V extensions were superseeded by the MIPS64 instruction set, and MIPS-3D extensions for 3D operations.
Rumour has it that delays and performance limits, but more probably SGI's financial problems, meant that the R10000 and derivatives (R12K and R14K) were the end of the high performance line for the MIPS architecture. SGI scaled back high end development in favour of the promised IA-64 architecture announced by HP and Intel. MIPS was sold off by SGI, and the MIPS processor was retargeted to embedded designs where it's more successful. The R20K (early 2001) implemented the MIPS-3D extensions, and increased the number of integer units to six with a seven stage pipeline.
SiByte introduced a less parallel, high clock rate 64-bit MIPS CPU (SB-1, mid 2000) exceeding what marketing people enthusiastically call the "1GHz barrier" (not an actual barrier of any sort).
As part of an attempt to create a domestic computer industry in China, BLX IC Design Corp of China implemented a version of the MIPS architecture (without unaligned 32-bit load/store support, to avoid patent issues) called the Godson (known as Dragon in English) series (March 2003).
Nintendo used a version of the MIPS CPU in the N64 (along with SGI-designed 3-D hardware), accounting for around 3/4 of MIPS embedded business in 1999 until switching to a custom IBM PowerPC, and a graphics processor from ArtX (founded by ex-SGI engineers) for its successor named GameCube (codenamed "Dolphin"). Sony also uses it in its Playstation series.
microcode ROM with a simple 32 bit data path bolted to its side". Performance wasn't spectacular, but it was used in a pre-Unix workstation from HP. It led to the Vision, a fairly complex capability-based architecture. At the same time as Vision, the Spectrum project was started at HP labs based on the IBM 801, and further developed with implementation groups.
A new processor was needed to replace older 16-bit stack-based processors in HP-3000 MPE minicomputers. Initially a more complex replacement called Omega was started, but cancelled, and both Vision and Spectrum were proposed for Omega's replacement (code-named Alpha, not to be confused with the DEC Alpha). Spectrum was eventually selected, and became Precision Architecture, or PA-RISC. It also replaced Motorola 680x0 processors in the HP-9000 HP/UX Unix minicomputers and workstations.
A design typical of many load-store processors, it has an unusually large instruction set for a RISC processor (including a conditional (predicated) skip instruction similar to those in the ARM processor), partly because initial design took place before RISC philosophy was popular, and partly because careful analysis showed that performance benefited from the instructions chosen - in fact, version 1.1 added new multiple operation instructions combined from frequent instruction sequences, and HP was among the first to add multimedia instructions (the MAX-1 and MAX-2 instructions, similar to Sun VIS or Intel MMX). Despite this, it's a simple design - the entire original CPU had only 115,000 transistors, less than twice the much older 68000.
It's almost the cannonical load-store design, similar except in details to most other mainstream load-store processors like the Fairchild/Intergraph Clipper (1986), and the Motorola 88K in particular. It has a 5 stage pipeline, which (unlike early MIPS (R2000) processors) had hardware interlocks from the beginning for instructions which take more than one cycle, as well as result forwarding (a result can be used by a previous instruction without waiting for it to be stored in a register first).
Originally with a single instruction/data bus, it was later expanded to a Harvard architecture (separate instruction and data buses). It has thirty-two 32-bit integer registers (GR0 wired to constant 0, GR31 used as a link register for procedure calls), with seven 'shadow registers' which preserve the contents of a subset of the GR set during fast interrupts (also like ARM). Version 1.0 had sixteen 64-bit floating point registers, version 1.1 added features from the Apollo PRISM FPU after Hewlett-Packard acquired the company in 1988, resulting in thirty-two 64-bit floating point registers (also as sixty-four 32-bit and sixteen 128-bit), in an FPU (which could execute a floating point instruction simultaneously). Later versions (the PA-RISC 7200 in 1994) added a second integer unit (still dispatching only two instructions at a time to any of the three units). Addressing originally was 48 bits, and expanded to 64 bits, using a segmented addressing scheme.
The PA-RISC 7200 also included a tightly integrated cache and MMU, a high speed 64-bit 'Runway' bus, and a fast but complex fully associative 2KB on-chip assist cache, between the simpler direct-mapped data cache and main memory, which reduces thrashing (repeatedly loading the same cache line) when two memory addresses are aliased (mapped to the same cache line). Instructions are predecoded into a separate instruction cache (like the AT&T CRISP/Hobbit).
The PA-RISC 8000 (April 1996), intended to compete with the R10000, UltraSparc, and others) expands the registers and architecture to 64 bits (eliminating the need for segments), and adds aggressive superscalar design - up to 5 instructions out of order, using fifty six rename registers, to ten units (five pairs of: ALU, shift/merge, FPU mult/add, divide/sqrt, load/store). The CPU is split in two, with load/store (high latency) instructions dispatched from a separate queue from operations (except for branch or read/modify/write instructions, which are copied to both queues). It also has a deep pipeline and speculative execution of branches (many of the same features as the R10000, in a very elegant implementation).
The PA-RISC 8500 (mid 1998) broke with HP tradition (in a big way) and added on-chip cache - 1.5Mb L1 cache.
HP pioneered the addition of multimedia instructions with the MAX-1 (Multimedia Acceleration eXtension) extensions in the PA-7100LC (pre-1994) and 64-bit (version 2.0) MAX-2 extensions in the PA-8000, which allowed vector operations on two or four 16-bit subwords in 32-bit or 64-bit integer registers (this only required circuitry to slice the integer ALU (similar to bit-slice processors, such as the AMD 2901), adding only 0.1 percent to the PA-8000 CPU area - using the FPU registers like Sun's VIS and Intels MMX do would have required duplicating ALU functions. 8 and 32-bit support, multiplication, and complex instructions were also left out in favour of powerful 'mix' and 'permute' packing/unpacking operations).
A replacement VLIW version known as PA-RISC Wide-Word was used as a basis for the IA-64 CPU with Intel. Development on PA-RISC continued with the 8700, which used the same CPU bus as the HP-designed McKinley version of IA-64, allowing the processors to be interchangable during the introductory period of IA-64 (and possibly as a hedge against its failure by skeptical HP designers). A two-8700 chip was introduced in 2002, featuring an unusual off-chip DRAM shared level 2 cache (rather than faster SRAM) which allows a larger cache for lower cost, lower power, and smaller space. Although typically sporting fewer of the advanced (and promised) features of competing CPUs designs, a simple elegant design and effective instruction set has kept PA-RISC performance among the best in its class (of those actually available at the same time) since its introduction.
Harvard architecture (the same as the Fairchild/Intergraph Clipper C100 (1986) beat it by 2 years). Each bus has a separate cache, so simultaneous data and instruction access doesn't conflict. Except for this, it is similar to the Hewlett Packard Precision Architecture (HP/PA) in design (including many control/status registers only visible in supervisor mode), though the 88000 is more modular, has a small and elegant instruction set, no special status register (compare stores 16 condition code bits (equal, not equal, less-or-equal, any byte equal, etc.) in any general register, and branch checks whether one bit is set or clear), and lacks segmented addressing (limiting addressing to 32 bits, vs. 64 bits). The 88200 MMU unit also provides dual caches (including multiprocessor support) and MMU functions for the 88100 CPU (like the Clipper). The 88110 includes caches and MMU on-chip.
The 88000 has thirty-two 32 bit user registers, with up to 8 distinct internal function units - an ALU and a floating point unit (sharing the single register set) in the 88100 version, multiple ALU and FPU units (with thirty-two 80-bit FPU registers) and two-issue instuctions were added to the 88110 to produce one of the first superscalar designs (following the National Semiconductor Swordfish). Other units could be designed and added to produce custom designs for customers, and the 88110 added a graphics/bit unit which pack or unpack 4, 8 or 16-bit integers (pixels) within 32-bit words, and multiply packed bytes by an 8-bit value. But it was introduced late and never became as popular in major systems as the MIPS or HP processors. Development (and performance) has lagged as Motorola favoured the PowerPC CPU, coproduced with IBM.
Like the most modern processors, the 88000 is pipelined (with interlocks), and has result forwarding (in the 88110 one ALU can feed a result directly into another for the next cycle). Loads and saves in the 88110 are buffered so the processor doesn't have to wait, except when loading from a memory location still waiting for a save to complete. The 88110 also has a history buffer for speculatively executing branches and to make interrupts 'precise' (they're imprecise in the 88100). The history buffer is used to 'undo' the results of speculative execution or to restore the processor to 'state' when the interrupt occurred - a 1 cycle penalty, as opposed to 'register renaming' which buffers results in another register and either discards or saves it as needed, without penalty.
LSI-11 based and 80186-based graphics terminals, then NS32032-based workstations before moving to moving to an early RISC CPU. It continued development (C300 in 1988) and produced very advanced systems, but decided it couldn't compete alone in processor technology. After a brief joint development with Sun on the next generation SPARC, the company switched to Intel 80x86-based processors, and when a patent dispute between them erupted (Fairchild itself was bought by National Semiconductor which had a patent agreement with Intel, and Intel claimed rights to Clipper-related patents developed after the Clipper was sold to Intergraph), Intel restricted technical information to Intergraph, and Intergraph gave up on hardware, returning to software.
The C100 was a three-chip set like the Motorola 88000 (but predating it by two years), with a Harvard architecture CPU and separate MMU/cache chips for instruction and data. It differed from the 88K and HP PA-RISC in having sixteen 32-bit user registers and sixteen 64-bit FPU registers, rather than the more common thirty-two, and 16 and 32 bit instruction lengths. ROM macros implemented complex instructions. The C300 had improved floating point units, and increased clock speeds by increasing number of pipeline stages. The C400 (1990) was a two-issue superscalar version with separate address adder (eliminating ALU contention between address generation and execution in C100/C300). The following generation (C5, expected 1994) was dropped in 1993 to switch to SPARC.
The only other distinguishing features of the Clipper are a bank of sixteen supervisor registers which completely replace the user registers, (the ARM replaces half the user registers on an FIRQ interrupt) and the addition of some microcode instructions like in the Intel i960.
Berkeley experimental load-store design. It is simple, with a short 3-stage pipeline, and it can operate in big- or little-endian mode. A seven-member team created the first version in a year and a half, including four support chips.
The original ARM (ARM1, 2 and 3) was a 32 bit CPU, but used 26 bit addressing. The newer ARM6xx spec is completely 32 bits. It has user, supervisor, and various interrupt modes (including 26 bit modes for ARM2 compatibility). The ARM architecture has sixteen registers (including user visible PC as R15) with a multiple load/save instruction, though many registers are shadowed in interrupt modes (2 in supervisor and IRQ, 7 in FIRQ) so need not be saved, for fast response. The instruction set is reminiscent of the 6502, used in Acorns earlier computers.
A feature introduced in microprocessors by the ARM is that every instruction is predicated, using a 4 bit condition code (including 'never execute', not officially recommended), an idea later used in some HP PA-RISC instructions and the TI 320C6x DSP. Another bit indicates whether the instruction should set condition codes, so intervening instructions don't change them. This easily eliminates many branches and can speed execution. Another unique and useful feature is a barrel shifter which operates on the second operand of most ALU operations, allowing shifts to be combined with most operations (and index registers for addressing), effectively combining two or more instructions into one (similar to the earlier design of the funky Signetics 8x300).
These features make ARM code both dense (unlike most load-store processors) and efficient, despite the relatively low clock rate and short pipeline - it is roughly equivalent to a much more complex 80486 in speed.
The ARM6 series consisted of the ARM6 CPU core (35,000 transistors, which can be used as the basis for a custom CPU) the ARM60 base CPU, and the ARM600 which also includes 4K 64-way set-associative cache, MMU, write buffer, and coprocessor interface (for FPU, with eight 80-bit registers). The ARM7 series (Dec 1994), increased performance by optimising the multiplier, and adding DSP-like extensions including 32 bit and 64 bit multiply and multiply/accumulate instructions (operand data paths lead from registers through the multiplier, then the shifter (one operand), and then to the integer ALU for up to three independent operations). It also doubles cache size to 8K, includes embedded In Circuit Emulator (ICE) support, and raises the clock rate significantly.
A full DSP coprocessor (codenamed Piccolo, expected second half 1997) was to add an independent set of sixteen 32-bit registers (also accessable as thirty two 16 bit registers), four which can be used as 48 bit registers, and a complete DSP instruction set (including four level zero-overhead loop operations), using a load-store model similar to the ARM itself. The coprocessor had its own program counter, interacting with the CPU which performed data load/store through input/output buffers connected to the coprocessor bus (similar but more intelligent than the address unit in a typical DSP (such as the Motorola 56K) supporting the data unit). The coprocessor shared the main ARM bus, but used a separate instruction buffer to reduce conflict. Two 16 bit values packed in 32 bit registers could be computed in parallel, similar to the HP PA-RISC MAX-1 multimedia instructions. Unfortunately, this interesting concept didn't produce enough commercial interest to complete development and was difficult to produce a compiler for (essentially, it was two CPU executing two programs) - instead, DSP support instructions (more flexible MAC, saturation arithmetic, simple SIMD) were later added to the ARM9E CPU.
ARM10 (1998) added a vector floating point unit (VFP) coprocessor, with thirty two 32-bit floating point registers (usable as sixteen 64-bit registers) which can be loaded, stored, and operated on as two sixteen element vectors (vector-vector and vector-scalar operations) simultaneously. Vectors are computed one operation per cycle, compared to the Hitachi SH-4 which computes four per cycle).
DEC licensed the architecture, and developed the SA-110 (StrongARM) (February 1996), running a 5-stage pipeline at 100 to 233MHz (using only 1 watt of power), with 5-port register file, faster multiplier, single cycle shift-add, eight entry write buffer, and Harvard architecture (16K each 32-way I/D caches).
As part of a patent settlement with DEC, Intel took over the StrongARM, replacing the Intel i960 for embedded systems. The next version named XScale (2000) added low power enhancements and power management allowing the clock speed to be varied, added another stage to the memory pipeline (for eight stages, vs. seven for normal ALU instructions), and a 128 entry branch target buffer. A multiply-accumulate (MAC) unit added two 32-bit source registers, and one 40-bit accumulator. Single 16x16 or 16x32-bit results (16 bits from the hi/lo half of either register) or two 16x16-bit results (hi/hi and lo/lo source from each source register) to the accumulator.
To fill the gap between ARM7 and DEC/Intel StrongARM, ARM also developed the ARM8/800 which includes many StrongARM features, and the ARM9 with Harvard busses, write buffers, and flexible memory protection mapping.
Other companies such as Motorola, IBM and Texas Instruments have also licensed the basic ARM design, making it one of the most widely licensed embedded designs.
Like the Motorola Coldfire, ARM developed a low cost 16-bit version called Thumb, which recodes a subset of ARM CPU instructions into 16 bits (decoded to native 32-bit ARM instructions without penalty - similar to the CISC decoders in the newest 80x86 compatible and 68060 processors, except they decode native instructions into a newer one, while Thumb does the reverse). Thumb programs can be 30-40% smaller than already dense ARM programs. Native ARM code can be mixed with Thumb code when the full instruction set is needed.
Jazelle (announced October 2000) is a decoder similar to Thumb, but decodes simple Java Virtual Machine bytecode instructions to ARM (complex bytecodes are trapped and emulated by native ARM code, as is the JVM itself - different JVM software can be used).
The ARM CPU was chosen for the Apple Newton handheld system because of its speed, combined with the low power consumption, low cost and customizable design (the ARM610 version used by Apple includes a custom MMU supporting object oriented protection and access to memory for the Newton's NewtOS). The Newton was somewhat over ambitious, and was discontinued, but a large number of similar devices, as well as mobile phones, have been based on ARM CPUs for the same reasons.
An experimental asynchronous version of the ARM6 (operates without an external or internal clock signal) called AMULET has been produced by Steve Furber's research group at Manchester university. The first version (AMULET1, early 1993) is about 70% the speed of a 20MHz ARM6 on average (using the same fabrication process), but simple operations (multiplication is a big win at up to 3 times the speed) are faster (since they don't need to wait for a clock signal to complete). AMULET2e (October 1996, 93K transistor AMULET2 core plus four 1K fully associative cache blocks) is 30% faster (40 MIPS, 1/2 the performance of a 75MHz ARM810 using same fabrication), uses less power, and includes features such as branch prediction. AMULET 3i (September 2000), has been delayed, but simulations show it to be roughly equivalent to ARM9.
DSP, based on the earlier 320C20/10 16 bit fixed point DSPs (1982). It has eight 40 bit extended precision registers R0 to R7 (32 bits plus 8 guard bits for floating, 32 bits for fixed), eight 32 bit auxiliary registers AR0 to AR7 (used for pointers) with two separate arithmetic units for address calculation, and twelve 32 bit control registers (including status, an index register, stack, interrupt mask, and repeat block loop registers).
It includes on chip memory in the form of one 4K ROM block, and two 1K RAM blocks - each bus has its own bus, for a total of three (compared to one instruction and one data bus in a Harvard architecture), which essentially function as programer controlled caches. Two arguments to the ALU can be from memory or registers, and the result is written to a register, through a 4 stage pipeline.
The ALU, address controller and control logic are separate - much clearer in the AT&T DSP32, ADSP 2100 and Motorola 56000 designs, and is even reflected in the MIPS R8000 processor FPU and IBM POWER architecture with its Branch Unit loop counter. The idea is to allow the separate parts to operate as independently as possible (for example, a memory access, pointer increment, and ALU operation), for the highest throughput, so instructions accessing loop and condition registers don't take the same path as data processing instructions.
Like the TMS320C30, the 96002 has a separate program memory (RAM in this case, with a bootstrap ROM used to load the initial external program) and two blocks of data RAM, each with a separate data and address busses. The data blocks can also be switched to ROM blocks (such as sine and cosine tables). There's also a data bus for access to external memory. Separate units work independently, with their own registers (generally organised as three 32 bit parts of a single 96 bit register in the 96002 (where the '96' comes from).
The program control unit has a register containing 32 bit PC, status, and operating mode registers, plus 32 bit loop address and 32 bit loop counter registers (branches are 2 cycles, conditional branches are 3 cycles - with conditional execution support), and a fifteen element 64 bit stack (with separate 6 bit stack pointer).
The address generation unit has seven 96 bit registers, divided into three 32 bit (24 in the 56000/1) registers - R0-R7 address, N0-N7 offset, and M0-M7 modify (containing increment values) registers.
The Data Unit includes ten 96-bit floating point/integer registers, grouped as two 96 bit accumulators (A and B = three 32 bit registers each: A2, A1, A0 and B2, B1, B0) and two 64 bit input registers (X and Y = two 32 bit registers each: X1, X0 and Y1, Y0). Input registers are general purpose, but allow new operands to be loaded for the next instruction while the current contents are being used (accumulators are 8+24+24 = 56 bit in the 56000/1, where the '56' comes from). The DSP96000 was one of the first to perform fully IEEE floating point compliant operations.
The processor is not pipelined, but designed for single cycle independent execution within each unit (actually this could be considered a three stage pipeline). With multiple units and the large number of registers, it can perform a floating point multiply, add and subtract while loading two registers, performing a DMA transfer, and four address calculations within a two clock tick processor cycle, at peak speeds.
It's very similar to the Analog Devices ADSP2100 series - the latter has two address units, but replaces the separate data unit with three execution units (ALU, a multiplier, and a barrel shifter).
The DSP56K and 680xx CPUs have been combined in one package (similar idea as the TMS320C8x) in the Motorola 68456.
The DSP56K was part of the ill-fated NeXT system, as well as the lesser known Atari Falcon (still made in low volumes for music buffs).
TRON project produced processors competitive in performance (Fujitsu's(?) Gmicro/500 memory-data CPU (1993) was faster and used less power than a Pentium), the idea of a single standard processor never caught on, and newer concepts (such as RISC features) overtook the TRON design. Hitachi itself has supplied a wide variety of microprocessors, from Motorola and Zilog compatible designs to IBM System/360/370/390 compatible mainframes, but has also designed several of its own series of processors.
The Hitachi SH series was meant to replace the 8-bit and 16-bit H8 microcontrollers, a series of PDP-11-like (or National Semiconductor 32032/32016-like) memory-data CPUs with sixteen 16-bit registers (eight in the H8/300), usable as sixteen 8-bit or combined as eight 32-bit registers (for addressing, except H8/300), with many memory-oriented addressing modes. The SH is also designed for the embedded marked, and is similar to the ARM architecture in many ways. It's a 32 bit processor, but with a 16 bit instruction format (different than Thumb, which is a 16 bit encoding of a subset of ARM 32 bit instructions, or the NEC V800 load-store series, which mixes 16 and 32 bit instruction formats), and has sixteen general purpose registers and a load/store architecture (again, like ARM). This results in a very high code density, program sizes similar to the 680x0 and 80x86 CPUs, and about half that of the PowerPC. Because of the small instruction size, there is no load immediate instruction, but a PC-relative addressing mode is supported to load 32 bit values (unlike ARM or PDP-11, the PC is not otherwise visible). The SH also has a Multiply ACcumulate (MAC) instruction, and MACH/L (high/low word) result registers - 42 bit results (32 low, 10 high) in the SH1, 64 bit results (both 32 bit) in the SH2 and later. The SH3 includes an MMU and 2K to 8K of unified cache.
The SH4 (mid-1998) is a superscalar version with extensions for 3-D graphics support. It can issue two instructions at a time to any of four units: integer, floating point, load/store, branch (except for certain non-superscalar instructions, such as modifying control registers). Certain instructions, such as register-register move, can be executed by either the integer or load/store unit, two can be issued at the same time. Each unit has a separate pipeline, five stages for integer and load/store, five or six for floating point, and three for branch.
Hitachi designers chose to add 3-D support to the SH4 instead of parallel integer subword operations like HP MAX, SPARC VIS, or Intel MMX extensions, which mainly enhance rendering performance, because they felt rendering can be handled more efficiently by a graphics coprocessor. 3-D graphics support is added by supporting the vector and matrix operations used for manipulating 3-D points (see Appendix D. This involved adding an extra set of floating point registers, for a total of two sets of sixteen - one set as a 4X4 matrix, the other a set of four 4-element vectors. A mode bit selects which to use as the forground (register/vector) and background (matrix) banks. Register pair operations can load/store/move two registers (64 bits) at once. An inner product operation computes the inner product multiplication of two vectors (four simultaneous multiplies and one 4-input add), while a transformation instruction computes a matrix-vector product (issued as four consecutive inner product instructions, but using four internal work registers so intermediate results don't need to use data registers).
The SH4 allows operations to complete out of order under compiler control. For example, while a transformation is being executed (4 cycles) another can be stored (2 cycles using double-store instructions), then a third loaded (2 cycles) in preparation for the next transformation, allowing execution to be sustained at 1.4 gigaflops for a 200MHz CPU.
The SH5 is expected to be a 64-bit version. Other enhancements also planned include support for MPEG operations, which are supported in the SPARC VIS instructions. The SH5 adds a set of eight branch registers (like the Intel/HP IA-64), and a status bit which enables pre-loading of the target instructions when an address is placed in a branch register.
The SH is used in many of Hitachi's own products, as well as being a pioneer of wide popularity for a Japanese CPU outside of Japan. It's most prominently featured in the Sega Saturn video game system (which uses two SH2 CPUs) and Dreamcast (SH4) and many Windows CE handheld/pocket computers (SH3 chip set).
Part XIII: Motorola MCore, RISC brother to ColdFire (Early 1998) .To fill a gap in Motorola's product line, in the low cost/power consumption field which the PowerPC's complexity makes it impractical, the company designed a load/store CPU and core which contains features similar to the ARM, PowerPC, and Hitachi SH, beignning with the M200 (1997). Based on a four stage pipeline, The MCore contains sixteen 32-bit data registers, plus an alternate set for fast interupts (like the ARM, which only has seven in the second set), and a separate carry bit (like the TMS 1000). It also has an ARM-like (and 8x300-like before it) execution unit with a shifter for one operand, a shifter/multiply/divide unit, and an integer ALU in a series. It defines a 16-bit instruction set like the Hitachi SH and ARM Thumb, and separates the branch/program control unit from the execution unit, as the PowerPC does. The PC unit contains a branch adder which allows branches to be computed in parallel with the branch instruction decode and execute, so branches only take two cycles (skipped branches take one). The M300 (late 1998) added floating point support (sharing the integer registers) and dual instruction prefetch.
The MCore is meant for embedded applications where custom hardware may be needed, so like the ARM is has coprocessor support in the form of the Hardware Accellerator Interface (HAI) unit which can contain custom circuitry, and the HAI bus for external components.
Part XIV: TI MSP430 series, PDP-11 rediscovered (late 1998?) .Texas Instruments has been involved with microcontrollers almost as long as Intel, having introduced the TMS1000 microcontroller shortly after the Intel 4004/4040. TI concentrated mostly on embedded digital signal processors (DSPs) such as the TMS320Cx0 series, involved in microprocessors mainly as the manufacturer of 32-bit and 64-bit Sun SPARC designs. The MSP430 series Mixed Signal Microcontrollers are 16-bit CPUs for low cost/power designs.
Called "RISC like" (and consequently obliterating all remaining meaning from that term), the MSP430 is essentially a simplified version of the PDP-11 architecture. It has sixteen 16-bit registers, with R0 used as the program counter (PC), and R1 as the stack pointer (SP) (the PDP-11 had eight, with PC and SP in the two highest registers instead of two lowest). R2 is used for the status register (a separate register in the PDP-11) Addressing modes are a small subset of the PDP-11, lacking auto-decrement and pre-increment modes, but including register indirect, making this a memory-data processor (little-endian). Constants are loaded using post-increment PC relative addresses like the PDP-11 (ie. "@R0+"), but commonly used constants can be generated by reading from R2 or R3 (indirect addressing modes can generate 0, 1, 2, -1, 4, or 8 - different values for each register).
The MSP430 has fewer instructions than the PDP-11 (51 total, 27 core). Specifically multiplication is implemented as a memory-mapped peripheral - two operands (8 or 16 bits) are written to the input ports, and the multiplication result can be read from the output (this is a form of Transport Triggered Architecture, or TTA). As a low cost microcontroller, multiple on-chip peripherals (in addition to the multiplier) are standard in many available versions.
Future versions are expected to be available with two 4-bit segment registers (Code Segment Pointer for instructions, Data Page Pointer for data) to allow 20-bit memory addressing. Long branch and call instructions will be added as well.
Table of Contents