x86 CPU war
6th Generation CPU Comparisons.
by Paul Hsieh
AMD K6 Intel Pentium II Cyrix 6x86MX Common features Closing words Glossary

Update: 09/07/99 I have now started a seventh generation architecture web page.

The following is a comparative text meant to give people a feel for the differences in the various 6th generation x86 CPUs. For this little ditty, I've chosen the Intel P-II (aka Klamath, P6), the AMD K6 (aka NX686), and the Cyrix 6x86MX (aka M2). These are all MMX capable 6th generation x86 compatible CPUs, however I am not going to discuss the MMX capabilities at all beyond saying that they all appear to have similar functionality. (MMX never really took off as the software enabling technology Intel claimed it to be, so its not worth going into any depth on it.)

In what follows, I am assuming a high level of competence and knowledge on the part of the reader (basic 32 bit x86 assembly at least). For many of you, the discussion will be just slightly over your head. For those, I would recommend sitting through the 1 hour online lecture on the Post-RISC architecture by Charles Severence to get some more background on the state of modern processor technology. It is really an excellent lecture, that is well worth the time:

Much of the following information comes from online documentation from Cyrix, AMD and Intel. I have played a little with Pentiums and Pentium-II's from work, as well as my AMD-K6 at home. I would also like to thank, Dan Wax, Lance Smith and "Bob Instigator" from AMD who corrected me on several points about the K6, and both Andreas Kaiser and Lee Powell who also provided insightful information, and corrections gleened from first hand experiences with these CPUs. Also, thanks to Terje Mathisen who pointed out an error, and Brian Converse who helped me with my grammar.

Comments welcome.


The AMD K6
The K6 architecture seems to mix some of the ideas of the P-II and 6x86MX architectures. They made trade offs, and decisions that they believed would deliver the maximal performance over all potential software. They have emphasized short latencies (like the 6x86MX) but the K6 translates their x86 instructions into RISC operations that are queued in large instruction buffers and feed many (7 in all) independent units (like the P-II.) While they don't always have the best single implementation of any specific aspect, this was the result of conscious decisions that they believe helps strike a balance that hits a good performance sweet spot. Versus the P-II, they avoid situations of really deep pipelining which has high penalties when the pipeline has to be backed out. Versus the Cyrix, the AMD is a fully POST-RISC architecture which is not as susceptible to pipeline stalls which artificially back ups other stages.

General Architecture

The K6 is an extremely short and elegant pipeline. The AMD-K6 MMX Enhanced Processor x86 Code Optimization Application Note contains the following diagrams:

This seems remarkably simple considering the features that are claimed for the K6. The secret, is that most of these stages do very complicated things. The light blue stages execute in an out of order fashion (and were colored by me, not AMD.)

The fetch stage, is much like a typical Pentium instruction fetcher, and is able to present 16 cache aligned bytes of data per clock. Of course this means that some instructions that straddle 16 byte boundaries will suffer an extra clock penalty before reaching the decode stage, much like they do on a Pentium. (The K6 is a little clever in that if there are partial opcodes from which the predecoder can determine the instruction length, then the prefetching mechanism will fetch the new 16 byte buffer just in time to feed the remaining bytes to the issue stage.)

The decode stage attempts to simultaneously decode 2 simple, 1 long, and fetch from 1 ROM x86 instruction(s). If both of the first two fail (usually only on rare instructions), the decoder is stalled for a second clock which is required to completely decode the instruction from the ROM. If the first fails but the second does not (the usual case when involving memory, or an override), then a single instruction or override is decoded. If the first succeeds (the usual case when not involving memory or overrides) then two simple instructions are decoded. The decoded "OpQuad" is then entered into the scheduler.

Thus the K6's execution rate is limited to a maximum of two x86 instructions per clock. This decode stage decomposes the x86 instructions into RISC86 ops.

This last statement has been generally misunderstood in its importance (even by me!) Given that the P-II architecture can decode 3 instructions at once, it is tempting to conclude that the P-II can execute typically up to 50% faster than a K6. According to "Bob Instigator" (a technical marketroid from AMD) and "The Anatomy of a High-Performance Microprocessor A Systems Perspective" this just isn't so. Besides the back-end limitations and scheduler problems that clog up the P-II, real world software traces analyzed at Advanced Micro Devices indicated that a 3-way decoder would have added almost no benefit while severely limiting the clock rate ramp of the K6 given its back end architecture.

That said, in real life decode bandwidth limitation crops up every now and then as a limiting factor, but is rarely egregiously in comparison to ordinary execution limitations.

The issue stage accepts up to 4 RISC86 instructions from the scheduler. The scheduler is basically an OpQuad buffer that can hold up to 6 clocks of instructions (which is up to 12 dual issued x86 instructions.) The K6 issues instructions subject only to execution unit availability using an oldest unissued first algorithm at a maximum rate of 4 RISC86 instructions per clock (the X and Y ALU pipelines, the load unit, and the store unit.) The instructions are marked as issued, but not removed until retirement.

The operand fetch stage reads the issued instruction operands without any restriction other than register availability. This is in contrast with the P-II which can only read up to two retired register operands per clock (but is unrestricted in forwarding (unretired) register accesses.) The K6 uses some kind of internal "register MUX" which allows arbitrary accesses of internal and commited register space. If this stage "fails" because of a long data dependency, then according to expected availability of the operands the instruction is either held in this stage for an additional clock or unissued back into the scheduler, essentially moving the instruction backwards through the pipeline!

This is an ingenious design that allows the K6 to perform "late" data dependency determinations without over-complicating the scheduler's issue logic. This clever idea gives a very close approximation of a reservation station architecture's "greedy algorithm scheduling".

The execution stages perform in one or two pipelined stages (with the exception of the floating point unit which is not pipelined, or complex instructions which stall those units during execution.) In theory, all units can be executing at once.

Retirement happens as completed instructions are pushed out of the scheduler (exactly 6 clocks after they are entered.) If for some reason, the oldest OpQuad in the scheduler is not finished, scheduler advancement (which pushes out the oldest OpQuad and makes space for a newly decoded OpQuad) is halted until the OpQuad can be retired.

What we see here is the front end starting fairly tight (two instruction) and the back end ending somewhat wider (two integer execution units, one load, one store, and one FPU.) The reason for this seeming mismatch in execution bandwidth (as opposed to the Pentium, for example which remains two-wide from top to bottom) is that it will be able to sustain varying execution loads as the dependency states change from clock to clock. This at the very heart of what an out of order architecture is trying to accomplish, being wider at the back-end is a natrual consequence of this kind design.

Branch Prediction

The K6 uses a very sophisiticated branch prediction mechanism which delivers better prediction and fewer stalls than the P-II. There is a 8192 table of two bit prediction entries which combine historic prediction information for any given branch with a heuristic that takes into account the results of nearby branches. Even branches that have somehow left the branch prediction table, still can have the benefit of nearby branch activity data to help their prediction. According to published papers which studied these branch prediction implementations, this allows them to achieve a 95% prediction rate versus the P-II's 90% prediction rate.

Additional stalls are avoided by using a 16 entry times 16 byte branch target cache which allows first instruction decode to occur simultaneously with instruction address computation, rather than requiring (E)IP to be known and used to direct the next fetch (as is the case with the P-II.) This removes an (E)IP calculation dependency and instruction fetch bubble. (This is a huge advantage in certain algorithms such as computing a GCD; see my examples for the code) The K6 allows up to 7 outstanding unresolved branches (which seems like more than enough since the scheduler only allows up to 6 issued clocks of pending instructions in the first place.)

The K6 benefits additionally from the fact that it is only a 6 stage pipeline (as opposed to a 12 stage pipeline like the P-II) so even if a branch is incorrectly predicted it is only a 4 clock penalty as opposed to the P-II's 11-15 clock penalty.

One disadvantage pointed out to me by Andreas Kaiser is that misaligned branch targets still suffer an extra clock penalty and that attempts to align branch targets can lead to branch target cache tag bit aliasing. This is a good point, however it seems to me that you can help this along by hand aligning only your most inner loop branches.

Another (also pointed out to me by Andreas Kaiser) is that such a prediction mechanism does not work for indirect jump predictions (because the verification tables only compare a binary jump decision value, not a whole address.) This is a bit of a bummer for virtual C++ functions.

Back of envelope calculation

This all means that the average loop penalty is:

(95% * 0) + (5% * 4) = 0.2 clocks per loop

But because of the K6's limited decode bandwidth, branch instructions take up precious instruction decode bandwidth. There are no branch execution clocks in most situations, however, branching instructions end up taking a slot where there is essentially no calculations. In that sense K6 branches have a typical penalty of about 0.5 clocks. To combat this, the K6 executes the LOOP instruction in a single clock, however this instruction performs so badly on Intel CPUs, that no compiler generates it.

Floating Point

The common high demand, high performance FPU operations (FADD, FSUB, FMUL) all execute with a throughput and latency of 2 clocks (versus 1 or 2 clock throughput and 3-5 clock latency on the P-II.) Amazingly, this means that it can complete FPU operations faster than the P-II, however is worse on FPU code that is optimally scheduled for the P-II. Like the Pentium, in the P-II Intel has worked hard on fully pipelining the faster FPU operations which works in their favor. Central to this is FXCH which, in combination with FPU instruction operands allows two new stack registers to be addressed by each binary FPU operation. The P-II allows FXCH to execute in 0 clocks -- the early revs of the K6 took two clocks, while later revs based on the "CXT core" can execute them in 0 clocks. Unfortunately, the P-II derives much more benefit from this since its FPU architecture allows it to decode and execute at a peak rate of one new FPU instruction on every clock.

More complex instructions such as FDIV, FSQRT and so on will stall more of the units on the P-II than on the K6. However since the P-II's scheduler is larger it will be able to execute more instructions in parallel with the stalled FPU instruction (21 in all, however the port 0 integer unit is unavailable for the duration of the stalled FPU instruction) while the K6 can execute up to 11 other x86 instructions a full speed before needing to wait for the stalled FPU instruction to complete.

In a test I wrote (admittedly rigged to favor Intel FPUs) the K6 measured to only perform at about 55% of the P-II's performance. (Update: using the K6-2's new SIMD floating point features, the roles have reversed -- the P-II can only execute at about 70% of a K6-2's speed.)

An interesting note is that FPU instructions on the K6 will retire before they completely execute. This is possible because it is only required that they work out whether or not they will generate an exception, and the execution state is reset on a task switch, by the OS's built-in FPU state saving mechanism.

The state of floating point has changed so drastically recently, that its hard to make a definitive comment on this without a plethora of caveats. Facts: (1) the pure x87 floating point unit in the K6 does not compare favorably with that of the P-II, (2) this does not tend to always reflect in real life software which can be made from bad compilers, (3) the future of floating point clearly lies with SIMD, where AMD has clearly established a leadership role. (4) Intel's advantage was primarily in software that was hand optimized by assembly coders -- but that has clearly reversed roles since the introduction of the K6-2.

Cache

The K6's L1 cache is 64KB, which is twice as large as the P-II's L1 cache. But it is only 2 way set associative (as opposed to the P-II which is 4 way). This makes the replacement algorithm much simpler, but decreases its effectiveness in random data accesses. The increased size, however, more than compensates for the extra bit of associativity. For code that works with contiguous data sets, the K6 simply offers twice the working set ceiling of the P-II.

Like the P-II, the K6's cache is divided into two fixed caches for separate code and data. I am not as big a fan of split architectures (commonly referred to as the Harvard Architecture) because they set an artificial lower limit on your working sets. As pointed out to me by the AMD folk, this keeps them from having to worry about data accesses kicking out their instruction cache lines. But I would expect this to be dealt with by associativity and don't believe that it is worth the trade off of lower working set sizes.

Among the design benefits they do derive from a split architecture is that they can add pre-decode bits to just the instruction cache. On the K6, the predecode bits are used for determining instruction length boundaries. Their address tags (which appears to work out to 9 bits) point to a sector which contains two 32 byte long cache lines, which (I assume) are selected by standard associativity rules. Each cache line has a standard set of dirty bits to indicate accessibility state (obsolete, busy, loaded, etc).

Although the K6's cache is non-blocking, (allowing accesses to other lines even if a cache line miss is being processed) the K6's load/store unit architecture only allows in-order data access. So this feature cannot be taken advantage of in the K6. (Thanks to Andreas Kaiser for pointing this out to me.)

In addition, like the 6x86MX, the store unit of the K6 actually is buffered by a store Queue. A neat feature of the store unit architecture is that it has two operand fetch stages -- the first for the address, and the second for the data which happens one clock later. This allows stores of data that are being computed in the same clock as the store to occurr without any apparent stall. That is so darn cool!

But perhaps more fundamentally, as AMD have said themselves, bigger is better, and at twice the P-II's size, I'll have to give the nod to AMD (though a bigger nod to the 6x86MX; see below.)

The K6 takes two (fully pipelined) clocks to fetch from its L1 cache from within its load execution unit. Like the original P55C, the 6x86MX spends extra load clocks (i.e., address generation) during earlier stages of their pipeline. On the other hand this compares favorably with the P-II which takes three (fully pipelined) clocks to fetch from the L1 cache. What this means is that when walking a (cached) linked list (a typical data structure manipulation), the 6x86MX is the fastest, followed by the K6, followed by the P-II.

Update: AMD has released the K6-3 which, like the Celeron adds a large on die L2 cache. The K6-3's L2 cache is 256K which is larger than the Celeron's at 128K. Unlike Intel, however, AMD has recommended that motherboard continue to include on board L2 caches creating what AMD calls a "TriLevel cache" architecture (I recall that an earler Alpha based system did exactly this same thing.) Benchmarks indicate that the K6-3 has increased in performance between 10% and 15% over similarly clocked K6-2's! (Wow! I think I might have to get one of these.)

Other

Update: AMD's new "CXT Core" has enabled write combining.

As I have been contemplating the K6 design, it has really grown on me. Fundamentally, the big problem with x86 processors versus RISC chips is that they have too few registers and are inherently limited in instruction decode bandwidth due to natural instruction complexity. The K6 addresses both of these by maximizing performance of memory based in-cache data accesses to make up for the lack of registers, and by streamlining CISC instruction issue to be optimally broken down into RISC like sub-instructions.

It is unfortunate, that compilers are favoring Intel style optimizations. Basically there are several instructions and instruction mixes that compilers avoid due to their poor performance on Intel CPUs, even though the K6 executes them just fine. As an x86 assembly nut, it is not hard to see why I favor the K6 design over the Intel design.

Optimization

AMD realizing that there is a tremendous interest for code optimization for certain high performance applications, decided to write up some Optimization documentation for the K6 (and now K6-2) processor(s). The documentation is fairly good about describing general strategies as well as giving a fairly detailed description for modelling the exact performance of code. This documentation far exceeed the quality of any of Intel's "Optimization AP notes", fundamentally because its accurate and more thorough.

The reason I have come to this conclusion is that the architecture of the chip itself is much more straight forward than, say the P-II, and so there is less explanation necessary. So the volume of documentation is not the only determining factor to measuring its quality.

If companies were interested in writing a compiler that optimized for the K6 I'm sure they could do very well. In my own experiments, I've found that optimizing for the K6 is very easy.

Recommendations I know of: (1) Avoid vector decoded instructions including carry flag reading instructions and shld/shrd instructions, (2) Use the loop instruction, (3) Align branch targets and code in general as much as possible, (4) Pre-load memory into registers early in your loops to work around the load latency issue.

Brass Tacks

The K6 is cheap, supports super socket 7 (with 100Mz Bus), that has established itself very well in the market place, winning businnes from all the top tier OEMs (with the exception of Dell, which seems to have missed the consumer market shift entirely, and taken a serious step back from challenging Compaq's number one position.) AMD really changed the minds of people who thought the x86 market was pretty much an Intel deal (including me.)

Their marketting strategy of selling at a low price while adding features (cheaper Super7 infrastructure, SIMD floating point, 256K on chip L2 cache combined with motherboard L2 cache) has paid off in an unheard of level brand name recognition outside of Intel. Indeed, 3DNow! is a great counter to Intel Inside. If nothing else they helped create a real sub-$1000 PC market, and have dictated the price for retail x86 CPUs (Intel has been forced to drop even their own prices to unheard of lows for them.)

AMD has struggled more to meet the demand of new speeds as they come online (they seem predictably optimistic) but overall have been able to sell a boat load of K6's without being stepped on by Intel.

Previously, in this section I maintained a small chronical of AMD's acheivements as the K6 architecture grew, however we've gotten far beyond the question of "will the K6 survive?" (A question only idiots like Ashok Kumar still ask.) From the consumer's point of view, up until now (Aug 99) AMD has done a wonderful job. Eventually, they will need to retire the K6 core -- it has done its tour of duty. However, as long as Intel keeps Celeron in the market, I'm sure AMD will keep the K6 in the market. AMD has a new core that they have just introduced into the market: the K7. This processor has many significant advantages over "6th generation architectures".

The real CPU WAR has only just begun ...

AMD performance documentation links

The first release of their x86 Optimization guide is what triggered me to write this page. With it, I had documentation for all three of these 6th generation x86 CPUs. Unfortunately, they often elect to go with terse explanations that assume the reader is very familiar with CPU architecture and terminologies. This lead me to some misunderstandings from my initial reading of the documentation (I'm just a software guy.) On the other hand, the examples they give really help clarify the inner workings of the K6.

Update: The IEEE Computer Society has published a book called "The Anatomy of a High-Performance Microprocessor A Systems Perspective" based on the AMD K6-2 microprocessor. It gives inner details of the K6-2 that I have never seen in any other documentation on Microprocessors before. These details are a bit overwhelming for a mere software developer, however, for a hard core x86 hacker its a treasure trove of information.


The Intel P-II
This was the first processor (I knew of) to have completely documented post-RISC features such as dynamic execution, out of order execution and retirement. (PA-RISC predated it as far as implementing the technology, however; I am suspicious that HP told Intel to either work with them on Merced, or be sued up the wazoo.) This stuff really blew me away when I first read it. The goal is to allow longer latencies in exchange for high throughput (single cycle whenever possible.) The various stages would attempt to issue/start instructions with the highest possible probability as often as possible, working out dependencies, register renaming requirements, forwarding, resource contentions, as later parts of the instruction pipe by means of speculative and out of order execution.

Intel has enjoyed the status of "defacto standard" in the x86 world for some time. Their P6/P-II architecture, while not delivering the same performance boost of previous generational increments, solidifies their position. Its is the fastest, but it is also the most expensive of the lot.

General Architecture

The P-II is a highly pipelined architecture with an out of order execution engine in the middle. The Intel Architecture Optimization Manual lists the following two diagrams:

The two sections shown are essentially concatenated, showing 10 stages of in-order processing (since retirement must also be in-order) with 3 stages of out of order execution (RS, the Ports, and ROB write back colored in light blue by me, not Intel.)

Intel's basic idea was to break down the problem of execution into as many units as possible and to peel away every possible stall that was incurred by their previous Pentium architecture as each instructions marches forward down their assembly line. In particular, Intel invests 5 pipelined clocks to go from the instruction cache to a set of ready to execute micro-ops. (RISC architectures have no need for these 5 stages, since their fixed width instructions are generally already specified to make this translation immediate. It is these 5 stages that truly separate the x86 from ordinary RISC architectures, and Intel has essentially solved it with a brute force approach which costs them dearly in chip area.)

Each of Intel's "Ports" is used as a feeding trough for microops to various groupings of units as shown in the above diagram. So, Intel's 5 micro-op execution per clock bandwidth is a little misleading in the sense that two ports are required for any single storage operation. So, it is more realistic to consider this equivalent, to at most 4 K6 RISC86 ops issued per clock.

As a note of interest, Intel divides the execution and write back stages into two seperate stages (the K6 does not, and there is really no compelling reason for the P6's method that I can see.)

Although it is not as well described, I believe that Intel's reservation station and reorder buffer combinations serves substantially the same purpose as the K6's scheduler, and similarly the retire unit acts on instruction clusters in exactly the same way as they were issued (CPUs are not otherwise known to have sorting algorithms wired into them.) Thus the micro-op throughput is limited to 3 per clock (compared with 4 RISC86 ops for the K6.)

So when everything is working well, the P-II can take 3 simple x86 instructions and turn them into 3 micro-ops on every clock. But, as can be plainly seen in their comments, they have a bizzare problem: they can only read two physical input register operands per clock (rename registers are not constrained by this condition.) This means scheduling becomes very complicated. Registers to be read for multiple purposes will not cost very much, and data dependencies don't suffer from any more clocks than expected, however the very typical trick of spreading calculations over several registers (used especially in loop unrolling) will upper bound the pipeline to two micro-ops per clock because of a physical register read bottleneck.

In any event, the decoders (which can decode up to 6 micro-ops per clock) are clearly out-stripping the later pipeline stages which are bottlenecked both by the 3 micro-op issue and two physical register read operand limit. The front end easily outperforms the back end. This helps Intel deal with their branch bubble, by making sure the decode bandwidth can stay well ahead of the execution units.

Something that you cannot see in the pictures above is the fact that the FPU is actually divided into two partitioned units. One for addition and subtraction and the other for all the other operations. This is found in the Pentium Pro documentation and given the above diagram and the fact that this is not mentioned anywhere in the P-II documentation I assumed that in fact the P-II was different from the PPro in this respect (Intel's misleading documentation is really unhelpful on this point.) After I made some claims about these differences on USENET some Intel engineer (who must remain anonymous since he had a copyright statement insisting that I not copy anything he sent me -- and it made no mention of excluding his name) who claims to have worked on the PPro felt it his duty to point out that I was mistaken about this. In fact, he says, the PPro and P-II have an identical FPU architecture. So in fact the P-II and PPro really are the same core design with the exception of MMX, segment caching and probably some different glue logic for the local L2 caches.

This engineer also reiterated Intel's position on not revealing the inner works of their CPU architectures thus rendering it impossible for ordinary software engineers to know how to properly optimize for the P-II.

Branch Prediction

Central to facilitating the P-II's aggressive fire and forget execution strategy is full branch prediction. The functionality has been documented by Agner Fog, and can track very complex patterns of branching. They have advertised a prediction rate of about 90% (based on academic work using the same implementation.) This prediction mechanism was also incorporated into the Pentium MMX CPUs. Unlike the K6, the branch target buffer contains target addresses, not instructions and predictions only for the current branch. This means an extra clock is required for taken branches to be able to decode their branch target. Branches not in the branch target buffer are predicted statically (backward jumps taken, forward jumps not.) However, this "extra clock" is generally overlapped with execution clocks, and hence is not a factor except in short loops, or code loops with poorly translated code sequences (like compiled sprites.)

In order to do this in a sound manner, subsequent instructions must be executed "speculatively" under the proviso that any work done by them may have to be undone if the prediction turns out to be wrong. This is handled in part by renaming target write-back registers to shadow registers in a hidden set of extra registers. The K6 and 6x86MX have similar rename and speculation mechanisms, but with its longer latencies, it is a more important feature for the P-II. The trade off is that the process of undoing a mispredicted branch is slow (since the pipelines must completely flush) costing as much as 15 clocks (and no less than 11.) These clocks are non-overlappable with execution, of course, since the execution stream cannot be correctly known until the mispredict is completely processed. This huge penalty offsets the performance of the P-II, especially in code in which no P6/P-II optimizations considerations have been made.

The P-II's predictor always deals with addresses (rather than boolean compare results as is done in the K6) and so is applicable to all forms of control transfer such as direct and indirect jumps and calls. This is critical to the P-II given that the latency between the ALUs and the instruction fetch is so large.

In the event of a conditional branch both addresses are computed in parallel. But this just aids in making the prediction address ready sooner; there is no appreciable performance gained from having the mispredicted address ready early given the huge penalty. The addresses are computed in an integer execution port (seperate from the FPU) so branches are considered an ALU operation. The prefetch buffer is stalled for one clock until the target address is computed, however since the decode bandwidth out-performs the execution bandwidth by a fair margin, this is not an issue for non-trivial loops.

Back of envelope calculation

This all means that the average loop penalty is:

(90% * 0) + (10% * 13) = 1.3 clocks per loop

This is obviously a lot higher than the K6 penalty. (The zero as the first penalty assumes that the loop is sufficiently large to hide the one clock branch bubble.)

For programmers this means one major thing: Avoid mispredicted branches in your inner loops at all costs (make that 10% closer to 0%). Using tables or conditional move instructions are common methods, however since the predictor is used even in indirect jumps, there are situations with branching where you have no choice but to suffer from branch prediction penalties.

Floating Point

In keeping with their post-RISC architecture, the P-II's have in some cases increased the latency of some of the FPU instructions over the Pentium for sake of pipelining at high clock rates and with idea that it hopefully will not matter if the code is properly scheduled. Intel says that FXCH requires no execution cycles, but does not explicitly state whether or not throughput bubbles are introduced. Other than latency, the P-II is very similar to the Pentium in terms of performance characteristics. This is because all FPU operations go through port 0 except FXCH's which go to port 1, and the first stage of a multiply takes two non-pipelined clocks. This is pretty much identical to the P5 architecture.

The Intel floating point design has traditionally beat the Cyrix and AMD CPUs on floating point performance and this still appears to hold true as tests with Quake and 3D Studio have confirmed. (The K6 is also beaten, but not by such a large margin -- and in the case of Quake II on a K6-2 the roles are reversed.)

The P-II's floating point unit is issued from the same port as one of the ALU units. This means that it cannot issue two integer and 1 floating point operation on every clock, and thus is likely to be constrained to an issue rate similar to the K6. As Andreas Kaiser points out, this does not necessarily preclude later execution clocks (for slower FPU operations for eg) to execute in parallel from all three basic math units (though this same comment applies to the K6).

As I mentioned above, the P-II's floating point unit is actually two units, one is a fully pipelined add and subtract unit, and the other is a partially pipelined complex unit (including multiplies.) In theory this gives greater parallelism opportunities over the original Pentium but since the single port 0 cannot feed the units at a rate greater than 1 instruction per clock, the only value is design simplification. For most code, especially P5 optimized code, the extra multiply latency is likely to be the most telling factor.

Update: Intel has introduced the P-!!! which is nothing more than a 500Mhz+ P6 core with 3DNow!-like SIMD instructions. These instructions appear to be very similar in functionality and remarkably similar in performance to the 3DNow! instruction set. There are a lot of misconceptions about the performance of SSE versus 3DNow! The best analysis I've seen so far indicate that they are nearly identical by virtue of the fact that Intel's "4-1-1" issue rate restriction holds back the mostly meaty 2 micro-op SSE instructions. Furthermore, there are twice as many subscribers to the SSE units per instruction than 3DNow! which totally nullifies the doubled output width. In any event, its almost humorous to see Intel playing catch up to AMD like this. The clear winner: consumers.

Cache

The P-II's L1 cache is 32KB divided into two fixed 16KB caches for separate code and data. These caches are 4-way set associative which decreases thrashing versus the K6. But relatively speaking, this is quite small and inflexible when compared with the 6x86MX's unified cache. I am not a big fan of the P-II's smaller, less flexible L1 cache, and it appears as though they have done little to justify it being half the size of their competitors' L1 caches.

The greater associativity helps programs that are written indifferently with respect to data locality, but has no effect on code mindful of data locality (i.e., keeping their working sets contiguous and no larger than the L1 cache size.)

The P-II also has an "on PCB L2 cache". What this means is they do not need use the motherboard bus to access their L2 cache. As such the communications interface can (and does) have a much higher frequency. In current P-II's it is 1/2 the CPU clock rate. This is an advantage over K6, K6-2 and 6x86MX cpus which access motherboard based L2 caches at only 66Mhz or 100Mhz. (However the K6-III's on die L2 cache runs at the CPU clock rate, which is thus twice as fast as the P-II's)

Other

Optimization

Intel has been diligent in creating optimization notes and even some interactive tutorials that describe how the P-II microarchitecture works. But the truth is that they serve as much as CPU advertisements as they do for serious technical material. We found out with the Pentium CPU, Intel's notes were woefully inadequate to give an accurate characterization for modelling its behaviour with respect to performance (this opened the door for people like Michael Abrash and Agner Fog to write up far more detailed descriptions based on observation rather than Intel's anemic documentation.) They contain egregious omissions, without given a totally clear description of the architecture.

While they claim that hand scheduling has little or no effect on performance experiments I and others have conducted have convinced me that this simply is not the case. In the few attempts I've made using ideas I've recently been shown and studied myself, I can get between 5% and 30% improvement on very innocent looking loops via some very unintuitive modifications. The problem is that these ideas don't have any well described explanation -- yet.

With the P-II we find a nice dog and pony show, but again the documentation is inadequate to describe essential performance characteristics. They do steer you away from the big performance drains (branch misprediction and partial register stalls.) But in studying the P-II more closely, it is clear that there are lots of things going on under the hood that are not generally well understood. Here are some examples (1) since the front end out-performs the back-end (in most cases) the "schedule on tie" situation is extremely important, but there is not a word about it anywhere in their documentation (Lee Powell puts this succinctly by saying that the P-II prefers 3X superscalar code to 2X superscalar code.) (2) The partial register stall appears to, in some cases, totally parallelize with other execution in some cases (the stall is less than 7 clocks), while not at all in others (a 7 clock stall in addition to ordinary clock expenditures.) (3) Salting execution streams with bogus memory read instructions can improve performance (I strongly suspect this has something to do with (1)).

So why doesn't Intel tell us these things so we can optimize for their CPU? The theory that they are just telling Microsoft or other compiler vendors under NDA doesn't fly since the kinds of details that are missing are well beyond the capabilities of any conventional compiler to take advantage of (I can still beat the best compilers by hand without even knowing the optimization rules, but instead just by guessing at them!) I can only imagine that they are only divulging these rules to certain companies that perform performance critical tasks that Intel has a keen interest in seeing done well running on their CPUs (soft DVD from Zoran for example; I'd be surprised if Intel didn't give them either better optimization documentation or actual code to improve their performance.)

Intel has their own compiler that they have periodically advertised on the net as a plug in replacement for the Windows NT based version of MSVC++, available for evaluation purposes (it's called Proton if I recall correctly). However, it is unclear to me how good it is, or whether anyone is using it (I don't use WinNT, so I did not pursue trying to get on the beta list). Update: I have been told that Microsoft and Imprise (Borland) have licensed Intel's compiler source and has been using it are their compiler base.

Recommendations I know of: (1) Avoid mispredicted branches (using conditional move can be helpful in removing unpredictable branches), (2) Avoid partial register stalls (via the xor/sub reg,reg work arounds, or the movzx/movsx commands or by simply avoiding smaller registers.) (3) Remove artificial "schedule on tie" stalls by slowing down the front end x86 decoder (i.e., issuing a bogus bypassed load command by rereading out of the same address as the last memory read can often help.) For more information read the discussion on the sandpile.org discussion forum (4) Align branch target instructions; they are not fed to the second prefetch stage fast enough to automatically align in time.

Brass Tacks

When it comes right down to brass tacks though, the biggest advantage of their CPU is the higher clock rates that they have achieved. They have managed to stay one or two speed grades ahead of AMD. The chip also enjoys the benefit of the Intel Inside branding. Intel has spent a ton of money in brand name recognition to help lock its success over competitors. Like the Pentium, the P-II still requires a lot of specific coding practices to wring the best performance out of them, and its no doubt that many programmers will do this, and Intel has gone to some great lengths to write tutorials that explain how to do this (regardless of their lack of correctness, they will give programmers a false sense of empowerment).

During 1998, what consumers have been begging for for years, a transition to super cheap PCs has taken place. This is sometimes called the sub-$1000 PC market segment. Intel's P-II CPUs are simply too expensive (costing up to $800 alone) for manufacturers to build compelling sub-$1000 systems with them. As such, Intel has watched AMD and Cyrix pick up unprecedented market share.

Intel made a late foray into the sub-$1000 PC market. Their whole business model did not support such an idea. Intel's "value consumer line" the Celeron started out as a L2-cacheless piece of garbage architecture (read: about the same speed as P55Cs at the same clock rates), then switched to an integrated L2 cache architecture (stealing the K6-3's thunder). Intel was never really able to recover the bad reputation that stuck to the Celeron, but perhaps that was their intent all along. It is now clear that Intel is basically dumping Celerons in an effort to wipe out AMD and Cyrix, while trying to maintain their hefty margins in their Pentium-II line. For the record, there is little performance difference between a Pentium-II and a Celeron, and the clock rates for the Celeron were being made artificially slow so as not to eat into their Pentium line. This action alone has brought a resurgence into the "over clocking game" that some adventurous power users like to get into.

But Intel being Intel has managed to seriously dent what was exclusively an AMD and Cyrix market for a while. Nevertheless since the "value consumer" market has been growing so strongly, AMD and Cyrix have nevertheless been able to increase their volumes even with Intel's encroachment.

The P-II architecture is getting long in the tooth, but Intel keeps insisting on pushing it (demonstrating an uncooled 650Mhz sample in early 1999.) Mum's the word on Intel's seventh generation x86 architecture (the Williamette or Foster) probably because that architecture is not scheduled to be ready before late 2000. This old 6th generation part may prove to be easy pickings for Cyrix Jalapeno and AMD's K7, both of which will be available in the second half of 1999.

Intel performance documentation links

While Intel does have plenty of documentation on their web site, they quite simply do not sit still with their URLs. It is impossible to keep track of these URLs, and I suspect Intel keeps changing their URLs based on some ulterior motive. All I can suggest is: slog through their links starting at the top. I have provided a link to Agner Fog's assembly page where his famous Pentium optimization manual has been updated with a dissection of the P-II.
The Cyrix 6x86MX
Originally codenamed M2, this chip was very eagerly awaited. Cyrix had been making grand claims about their superior architecture and this was echoed by Cyrix's small but loyal following. Indeed it did come out, and indeed it is very interesting architecture, but to say the least, it was very late. Both AMD and Intel had announced their 6th generation MMX processors and enjoyed a couple months of extra revenue before Cyrix appeared on the scene. But as always, these Cyrix CPUs have hit the street at a substantially lower cost than its rivals, although recent price cuts from AMD may be marginalizing them at the high end.

The primary microarchitecture difference of the 6x86MX CPU versus the K6 and P-II CPUs is that it still does native x86 execution rather than translation to internal RISC ops.

General Architecture

The Cyrix remains a dual piped superscalar architecture, however it has significant improvements over the Pentium design:

By being able to swap the instructions, there is no concern about artificial stalls due to scheduling of instruction to the wrong pipeline. By introducing two address generation stages, they eliminate the all too common AGI stall that is seen in the Pentium. The 6x86MX relies entirely on up front dependency resolution via register renaming, and data forwarding; it does not buffer instructions in any way. Thus its instruction issue performance becomes bottlenecked by dependencies.

The out of order nature of the execution units are not very well described in Cyrix's documentation beyond saying that slower instructions will make way for faster instructions. Hence it is not clear what the execution model really looks like.

Branch Prediction

The Cyrix CPU uses a 512 entry 2 bit predictor and this does not have a prediction rate that rivals either the P-II or K6 designs. However, both sides of the branch will have first instruction decoded simultaneously in the same clock. In this way, the Cyrix hedges its bets so that it doesn't pay such a severe performance penalty when its prediction goes wrong. Beyond this, it appears as though Cyrix has gone full Post-RISC architecture and supports a branch predict and speculative execution model. This fits nicely with their aggressive register renaming, and data forwarding model from the original 6x86 design. Because of potential FPU exceptions, all FPU instructions are treated the same way as branch prediction. I would expect the same to be true of the P-II, but Intel has not documented this, whereas Cyrix has.

They have a fixed scheme of 4 levels of speculation, that are simply increased for every new speculative instruction issued (this is somewhat lower than the P-II and K6 which can have 20 or 24 live instructions at any one given time, and somewhat more outstanding branches.)

The 6x86MX architecture is more lock stepped than the K6, and as such their issue follows their latency timings more closely. Specifically their decode, issue and address generation stages are executed in lock step, with any stalls from resource contentions, complex decoding etc, backing up the entire instruction fetch stages. However their design makes it clear that they do everything possible to reduce these resource contentions as early as possible. This is to be contrasted with the K6 design which is not lock step at all, but due to its late resource contention resolution, may be in the situation of re-issuing instructions after potentially wasting an extra clock that it didn't need to in its operand fetch stage.

Floating Point

The 6x86MX has significantly slower floating point. The Cyrix's FADD, FMUL, FLD and FXCH instructions all take at least 4 clocks which puts it at one quarter of the P-II's peak FPU execution rate. The 6x86MX (and even older 6x86) tried to make up for it by having an FPU instruction FIFO. This means that most of the FPU clocks can be overlapped with integer clocks, and that a handful of FPU operations can be in flight at the same time, but in general requires hand scheduling and relatively light use of the FPU to maximally leverage it. Oddly enough, their FDIV and FSQRT performance is about as good if not better than the P-II implementation. This seems like an odd design decision, as optimizing FADD, FLD, FXCH and FMUL, are clearly of much higher importance.

Like AMD, Cyrix designed their 6x86MX floating point around the weak code that x86 compilers generate on FPU code. But, personally, I think Cyrix has gone way too far in ignoring FPU performance. Compilers only need to get a tiny bit better for the difference between the Cyrix and Pentium II to be very noticeable on FPU code.

Cache

The Cyrix's cache design is unifed at 64KB with a separate extra 256 byte instruction buffer. I prefer this design to the K6's and P-II's split code and data architecture, since it better takes into account the different dynamics of code usage versus data usage that you would expect in varied software designs. As an example, p-code interpreters, or just interpreted languages in general, you would expect to benefit more from a larger data cache. The same would apply to multimedia algorithms which would tend to apply simple transformations on large amounts of data (though in truth, your system benefits more from coprocessors for this purpose.) As another example, highly complex (compiled) applications that weave together the resources of many code paths (web browsers, office suite packages, and pre-emptive multitasking OSes in general) would prefer to have larger instruction caches. At both extremes, the Cyrix has twice the cache ceiling.

Thus the L1 cache becomes a sort of L2 cache for the 256 byte instruction line buffer, which allows the Cyrix design to complicate a much smaller cache structure with predecode bits and so on, and use the unified L1 cache more efficiently as described above. Although I don't know details, the prefetch units could try to see a cache miss comming and pre-load the instruction line cache in parallel with ordinary execution; this would compensate for the instruction cache's unusually small size, I would expect to the point of making it a mute point.

Other

Optimization

Cyrix's documentation is not that deep, but I get the feeling that neither are their CPU's. Nevertheless, they do not describe their out of order mechanism in sufficient detail to even evaluate it. Not having tried to optimize for a Cyrix CPU myself, I don't have enough data points to really evaluate how lacking the documentation is. But it does appear that Cyrix is somewhat behind both Intel and AMD here.

It is my understanding that Cyrix commissioned Green Hills to create a compiler for them, however I have not encountered or even heard of any target code produced by it (that I know of). Perhaps the MediaGX drivers are built with it.

Update: I've been recently pointed at Cyrix's Appnotes page, in particular note 106 which describes optimization techniques for the 6x86 and 6x86MX. It does provide a lot of good suggestions which are in line with what I know about the Cyrix CPUs, but they do not explain everything about how the 6x86MX really works. In particular, I still don't know how their "out of order" mechanism works.

It is very much like Intel's documentation which just tells software developers what to do without giving complete explanations as to how their CPU works. The difference is that its much shorter and more to the point.

One thing that surprised me is that the 6x86MX appears to have several extended MMX instructions! So in fact, Cyrix had actually beaten AMD to (nontrivially) extending the x86 instruction set (with the K6-2), they just didn't do a song and dance about it at the time. I haven't studied them yet, but I suspect that when Cyrix releases their 3DNOW! implementation they should be able to advertise the fact that they will be supplying more total extensions to the x86 instruction set with all of them being MMX based.

Brass Tacks

The 6x86MX design clearly has the highest instructions processed per clock on most ordinary tasks (read: WinStone.) I have been told various explanations for it (4-way 64K L1 cache, massive TLB cache, very aggressive memory strategies, etc), but without a real part to play with, I have not been able to verify this on my own.

Well, whatever it is, Cyrix learned an important lesson the hard way: clock rate is more important than architectural performance. Besides keeping Cyrix in the "PR" labelling game, their clock scalability could not keep up with either Intel or AMD. Cyrix did not simply give up however. Faced with a quickly dying architecture, a shared market with IBM, as well as an unsuccessful first foray into integrated CPUs, Cyrix did the only thing they could do -- drop IBM, get foundry capacity from National Semiconductor and sell the 6x86MX for rock bottom prices into the sub-$1000 PC market. Indeed here they remained out of reach of either Intel and AMD, though they were not exactly making much money with this strategy.

Cyrix's acquisition by National Semiconductor has kept their future processor designs funded, but Cayenne (a 6x86MX derivative with a faster FPU and 3DNow! support) has yet to appear. Whatever window of opportunity existed for it has almost surely disappeared if it cannot follow the clock rate curve of Intel and AMD. But the real design that we are all waiting for is Jalapeno. National Semiconducter is making an very credible effort to ramp up its 0.18 micron process and may beat both Intel and AMD to it. This will launch Jalapeno at speeds in excess of 600Mhz with the "PR" shackels removed, which should allow Cyrix to become a real player again.

Update: National has buckled under the pressure of keeping the Cyrix division alive (unable to produce CPUs with high enough clock rate) and has sold it off to VIA. How this affects Cyrix' ability to try to reenter the market, and release next generation products remains to be seen.

Cyrix performance documentation links


Common Features
The P-II and K6 processors require in-order retirement (for the Cyrix, retirement has no meaning; it uses 4 levels of speculation to retain order.) This can be reasoned out simply because of x86 architectural constraints. Specifically, in-order execution is required to do proper resource contention.

Within the scheduler the order of the instructions are maintained. When a micro-op is ready to retire it becomes marked as such. The retire unit then waits for micro-op blocks that correspond to x86 instructions to become entirely ready for retirement and removes them from the scheduler simultaneously. (In fact, the K6 retains blocks corresponding to all the RISC86 ops scheduled per clock so that one or two x86 instructions might retire per clock. The Intel documentation is not as clear about its retirement strategies.) As instructions are retired the non-speculative CS:EIP is updated.

The speculation aspect is the fact that the branch target of a branch prediction is simply fed to the prefetch immediately before the branch is resolved. A "branch verify" instruction is then queued up in place of the branch instruction and if the verify instruction checks out then it is simply retired (with no outputs except possibly to MSRs) like any ordinary instruction, otherwise a branch misprediction exception occurs.

Whenever an exception occurs (including branch mispredicts, page fault, divide by zero, non-maskable interrupt, etc) the currently pending instructions have to be "undone" in a sense before the processor can handle the exception situation. One way to do this is to simply rename all the registers with the older values up until the last retirement which might be available in the current physical register file, then send the processor into a kind of "single step" mode.

According to Agner Fog, the P-II retains fixed architectural registers which are not renamable and only updated upon retiring. This would provide a convenient "undo" state. This also jells with the documentation which indicates that the PII can only read at most two architectural registers per clock. The K6, however, does not appear to be similarly stymied, however it too has fixed architectural registers as well.

The out of orderedness is limited to execution and register write-back. The benefits of this are mostly in mixing multiple instruction micro-op types so that they can execute in parallel. It is also useful for mixing multi-clock, or otherwise high-latency instructions with low-clock instructions. In the absence of these opportunities there are few advantages over previous generation CPU technologies that aren't taken care of by compilers or hand scheduling.

Contrary to what has been written about these processors, however, hand tuning of code is not unnecessary. In particular, the Intel processors still handle carry flag based computation very well, even though compilers do not; the K6 has load latencies, all of these processors still have alignment issues and the K6 and 6x86MX prefer the LOOP instruction which compilers do not generate. XCHG is also still the fastest way to swap two integer registers on all these processors, but compilers continue to avoid that instruction. Many of the exceptions (partial register stall, vector decoding, etc.) are also unknown to most modern compilers.

In the past, penalties for cache misses, instruction misalignment and other hidden side-effects were sort of ignored. This is because on older architectures, they hurt you no matter what, with no opportunity for instruction overlap, so the rule of avoiding them as much as possible was more important than knowing the precise penalty. With these architectures its important to know how much code can be parallelized with these cache misses. Issues such as PCI bus, chip set and memory performance will have to be more closely watched by programmers.

The K6's documentation was the clearest about its cache design, and indeed it does appear to have a lot of good features. Their predecode bits are used in a very logical manner (which appears to buy the same thing that the Cyrix's instruction buffer buys them) and they have stuck with the simple to implement 2-way set associativity. A per-cache line status is kept, allowing independent access to separate lines.


Final words
With out of order execution, all these processors appear to promise the programmer freedom from complicated scheduling and optimization rules of previous generation CPUs. Just write your code in whatever manner pleases you and the CPU will take care of making sure it all executes optimally for you. And depending on who you believe, you can easily be lead to think this.

While these architectures are impressive, I don't believe that programmers can take such a relaxed attitude. There are still simple rules of coding that you have to watch out for (partial register stalls, 32 bit coding, for example) and there are other hardware limitations (at most 4 levels of speculation, a 4 deep FPU FIFO etc.) that still will require care on the part of the programmer in search of the highest levels of performance. I also hope that the argument that what these processors are doing is too complicated for programmers to model dies down as these processors are better understood.

Some programmers may mistakenly believe that the K6 and 6x86MX processors will fade away due to market dominance by Intel. I really don't think this is the case, as my sources tell me that AMD and Cyrix are selling every CPU they make, as fast as they can make them. The demand is definately there. 3Q97 PC purchases indicated an unusually strong sales for PCs at $1000 or less (dominated by Compaq machines powered by the Cyrix CPU), making up about 40% of the market.

The astute reader may notice that there are numerous features that I did not discuss at all. While its possible that it is an oversight, I have also intentionally left out discussion of features that are common to all these processors (data forwarding, register renaming, call-return prediction stacks, and out of order execution for example.) If you are pretty sure I am missing something that should be told, don't hesitate to send me feedback.

Update: Centaur, a subsidiary of IDT, has introduced a CPU called the WinChip C6. A brief reading of the documentation on their web site indicates that its basically a single pipe 486 with a 64K split cache dual MMX units, some 3D instruction extensions and generally more RISCified instructions. From a performance point of view their angle seems to be that the simplicity of the CPU will allow quick ramp up in clock rate. Their chip has been introduced at 225 and 240 MHz initially (available in Nov 97) with an intended ramp up to 266 and 300 Mhz by the second half of 1998. They are also targeting low power consumption, and small die size with an obvious eye towards the laptop market.

Unfortunately, even in the test chosen by them (WinStone; which is limited by memory, graphics and hard disk speed as much as CPU), they appear to score only marginally better than the now obsolete Pentium with MMX, and worse than all other CPUs at the same clock rate. These guys will have to pick their markets carefully and rely on good process technology to deliver the clock rates they are planning for.

Update: They have since announced the WinChip 2 which is superscalar and they expect to have far superior performance. (They claim that they will be able to clock them between 400 and 600 Mhz) We shall see; and we shall see if they explain their architecture to a greater depth.

Update: 05/29/98 RISE (a startup Technology Company), has announced that they too will introduce an x86 CPU, however are keeping a tight lid on their architecture.


Glossary of terms

Links
JC's PC News'n'Links
Ace's Hardware
The seventh generation of CPUs
Index Pentium Optimizations x86 Assembly General Programming Mail me!

This page hosted by Pair.com Updated 07/10/99
Copyright © 1997, 1998, 1999 Paul Hsieh All Rights Reserved.