the Andrew Bailey

Registers of the x86 CPU architecture

I've never really looked closely at the individual registers upon which most of my computing is done. I stay comfortably above that stuff. But I've been curious of late, so I looked, and I got lost, but I don't regret it.

In the beginning, x86 was modeled after Intel's previous CPUs, which I will not talk about here. There are four sort-of general purpose registers: accumulator AX, base BX, counter CX, and data DX. High and low bytes can be directly addressed (AH, AL, BH, BL, CH, CL, DH, DL). There are four index/pointer registers: stack pointer SP (last stack frame), base pointer BP (bottom stack frame), source index SI (for arrays), and destination index DI (also arrays). Memory addresses are formed from six registers: stack segment SS, code segment CS, data segment DS, and extra segments ES, FS, and GS. There is an instruction pointer register, which points to the next instruction to be executed; it is not accessible by programs. There is also a FLAGS register, of which each bit serves a different purpose, and is way to complicated for the purposes of this article.

The 286 introduced four 16-bit registers: global descriptor table register GDTR, local descriptor table register LDTR, interrupt descriptor table register IDTR, and task register TR.

When hanging around old PC system requirements, you many have read something about x87, or floating point coprocessors. These chips were essentially hardware accelerators for floating point math, in much the same as 3D accelerators are/were later. The x87 spec created 8 registers of 80 bits (ST(0), ST(1), ST(2), ST(3), ST(4), ST(5), ST(6), ST(7)) that only held floating point values, whereas the other registers only held integers. Double precision floating point is 64-bit, but Intel decided to create a double extended type so that it would mitigate rounding issues when doing math. By the 486 era, x87 functionality was integrated onto the CPU itself; perhaps a spectre of things to come.

The 386 ushered in 32-bit computing for x86. With it, eight registers were extended to 32 bits (EAX, EBX, ECX, EDX, EBP, ESP, ESI, EDI, EIP). I guess the E prefix stands for Extended. The existing registers point to the lower 16 bits of these 32-bit registers (e.g. the lower 16 bits of EAX is AX). The segment registers were not expanded, and kept at 16 bits.

By this time, it was the early 90s, and multimedia had become a big thing for computing. Designs had improved enough to start processing a few things at a time. An easy way to exploit this is to do the same operation to an array of values; a Single Instruction to operate over Multiple Data items, if you will. The first of these extensions was MMX, first bolted onto Pentiums, then included on every CPU from the Pentium II onwards. In order to avoid compatibility problems with existing operating systems, the lower 64 bits of the x87 registers received new names (MM0, MM1, MM2, MM3, MM4, MM5, MM6, MM7), in the same way that the lower 16 bits of EAX is AX, the lower 64 bits of ST(5) is occupied by MM5. An obvious issue is that MMX only operates on integer data, not floating point, so mixing MMX with FP operations was a very bad idea if you wanted to get the most performance out of a CPU, because they share the same registers.

But then Intel realized that they had made a stupid decision, and introduced Streaming SIMD Extensions with the Pentium 3. Not content with the registers they had, they introduced 8 registers of 128 bits (XMM0, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6, XMM7). These would hold 4 single precision floating point numbers. You could for example, multiply the numbers in XMM5 by the numbers from XMM6 and store in XMM7 in a single instruction. They were undoubtedly useful for matrix transformations in 3D graphics at the time. SSE2 (released with Pentium 4) allowed more data types to be placed in those XMM registers: 2 double precision floating point numbers, two 64-bit integers, four 32-bit integers, eight 16-bit integers, or 16 bytes.

By this time, Intel was tired of having to share the x86 processor market with anyone else (like AMD), so they tried to hatch a plan during the 90s that, if it worked would create a new CPU market (with 64-bitness) all to themselves. AMD had something different in mind. They stole Intel's old ideas and simply expanded x86 to 64 bits. The existing general purpose registers were expanded to 64 bits (RAX, RBX, RCX, RDX, RBP, RSP, RSI, RDI, RIP). I guess the R prefix stands for Real, but I've never heard of a 64-bit data type being referred to as a real. The old 32-bit registers pointed to the lower 32 bits of these expanded ones (e.g. EDX is the lower 32 of RDX). Eight new SSE registers were added (XMM8, XMM9, XMM10, XMM11, XMM12, XMM13, XMM14, XMM15), and eight new 64-bit general purpose registers were added (R8, R9, R10, R11, R12, R13, R14, R15). The general purpose registers can be suffixed with D, W, or B to access the lower 32, 16, or 8 bits, respectively of a register (e.g. R14W is the lower 16 bits of R14). Due to wider registers and more of them, AMD64 can juggle far more data at once without having to retrieve from memory as often, speeding up certain calculations like encoding and encryption.

So then 2000s drag on, and holy crap, we need to process more data more effeciently. So Intel decided to create something called Advanced Vector extensions, which call for yet more register expansion. Sixteen new 256-bit registers have been added (YMM0-YMM15), with the XMM registers occupying the lower 128 bits of the same numbered YMM register. Again not satisfied, AVX-512 followed close behind, introducing thirty-two 512-bit registers (ZMM0-ZMM31), with same numbered YMM and XMM registers occupying the lower 256 and 128 bits of the ZMM register. As of January 2014, no consumer CPUs have been released that implement AVX-512. If running in 32-bit mode, there will only be 8 registers, regardless of the AVX flavor the CPU supports. That begs the question: if you need to throw around this many numbers, why are you not in 64-bit mode already?

As if that wasn't enough, there are several more registers that are used to debug things, but I won't get into those. Due to the way that branch prediction and out of order execution is implemented in modern x86 processors, there are likely far more physical registers than there are exposed by the instruction set, due to a feature called register renaming. This involves the CPU looking ahead and realizing that the data in some register can be independently calculated from the data that is in it right now, so it takes a spare register and calls it the same as another.

There seems to be a lot of subdividing going on, so let's recap this.

General purpose (integer) registers:

whole 64-bit* lower 32-bit^ lower 16-bit lower 8-bits
RAX EAX AX AH/AL
RBX EBX BX BH/BL
RCX ECX CX CH/CL
RDX EDX DX DH/DL
RSP ESP EP
RBP EBP BP
RSI ESI SI
RDI EDI DI
R8 R8D* R8W* R8B*
R9 R9D* R9W* R9B*
R10 R10D* R10W* R10B*
R11 R11D* R11W* R11B*
R12 R12D* R12W* R12B*
R13 R13D* R13W* R13B*
R14 R14D* R14W* R14B*
R15 R15D* R15W* R15B*
RFLAGS EFLAGS FLAGS
RIP EIP IP
SS
CS
DS
ES
FS
GS
GDTR+
LDTR+
IDTR+
TR+

*only available in 64-bit mode. ^only available on 386 and above. +only available on 286 and above.

Floating point x87 and MMX registers (only on 386 and above, lower 64-bit only on Pentium II, K6 and later, and Pentium MMX):

whole 80-bit lower 64-bit
ST(0) MM0
ST(1) MM1
ST(2) MM2
ST(3) MM3
ST(4) MM4
ST(5) MM5
ST(6) MM6
ST(7) MM7

SSE/AVX registers (contains integers and floating point):

whole 512-bit+ lower 256-bit^ lower 128-bit`
ZMM0 YMM0 XMM0
ZMM1 YMM1 XMM1
ZMM2 YMM2 XMM2
ZMM3 YMM3 XMM3
ZMM4 YMM4 XMM4
ZMM5 YMM5 XMM5
ZMM6 YMM6 XMM6
ZMM7 YMM7 XMM7
ZMM8* YMM8* XMM8*
ZMM9* YMM9* XMM9*
ZMM10* YMM10* XMM10*
ZMM11* YMM11* XMM11*
ZMM12* YMM12* XMM12*
ZMM13* YMM13* XMM13*
ZMM14* YMM14* XMM14*
ZMM15* YMM15* XMM15*
ZMM16* YMM16+* XMM16+*
ZMM17* YMM17+* XMM17+*
ZMM18* YMM18+* XMM18+*
ZMM19* YMM19+* XMM19+*
ZMM20* YMM20+* XMM20+*
ZMM21* YMM21+* XMM21+*
ZMM22* YMM22+* XMM22+*
ZMM23* YMM23+* XMM23+*
ZMM24* YMM24+* XMM24+*
ZMM25* YMM25+* XMM25+*
ZMM26* YMM26+* XMM26+*
ZMM27* YMM27+* XMM27+*
ZMM28* YMM28+* XMM28+*
ZMM29* YMM29+* XMM29+*
ZMM30* YMM30+* XMM30+*
ZMM31* YMM31+* XMM31+*

*not available in 32-bit mode. +only with AVX-512 instructions ^only on Sandy Bridge, Bulldozer architectures and later. `only on Pentium 3, Athlon XP and later.

Posted under Programming.

You can't complain about this anymore. It's perfect!