Computer Architecture

Summary

Introduction

Main coursebook : Computer Organization and Design, David A. Patterson and John L. Hennessy, ed. Morgan Kaufmann.

80% of the marks will come from the exam, 20% from the assignments. For the assignements you can work either alone or in groups of two.

Assignments :

  1. Code two algorithms, one given, the other free, with 5 different assembler architectures given by the teacher : Accumulator machine, Indexed machine, Load/Store machine, Memory-Memory machine, and Stack machine. Note that the assembler you will use is a 1-pass assembler.
  2. Create a small MIPS virtual machine. Any language can be used (C is better).
  3. Write a tech report on a CPU or Bus architecture.

What we're going to learn here is how computers work, and how to analyse their performance. We will focus on modern CPU issues such as pipelining, and talk about computer software and hardware optimization and abstraction.

Instruction Set Architecture : interface between hardware and low-level software (abstraction layer), which tries to standardize instructions sets. 80x86, PPC, ARM, MIPS are ISAs. Java is a Virtual ISA ! The problem is when you want to make the ISA evolve (see 486-Pentium issues...).

Performance : technically, it is trying to measure how much effort is given by the computer on a given length of time. Problem : what criterias should one select ?

For computers, the answer is simple : TIME. We first think about response times (latency) or throughput (number of process running simultaneously), but the only real performance meausrement is provided by Execution Time.

This latter can be subdivided in two categories :

As it is difficult to measure efficiently system time, we mostly focus here on user CPU time.

For a given X program on a machine :
Performance(X) = 1 / Execution_Time(X)

And to say "program X is N time faster than program Y :
Perf.(X) / Perf.(Y) = N

Clock cycles : they are the basic unit of measurement. Each CPU instruction takes a certain number of clock cycles to be executed. MUL instructions are known to be slower than ADD ones, memory accesses slower than register accesses... The notion of CPI (Cycles Per Instruction) defines that fact.

To increase the performance of a system, we can either :

We talk of MIPS (Million Instruction Per Second) to show the average execution rate of a system. Notice this measurement is not adequate for benchmarking (because a million NOPs will go much faster than a million MULs...).

Linux used to show "bogoMIPS", it reprensent how many Millions Empty Loops Per Second the system can do. (bogo = bogus ?).

A benchmark is a standard set of workloads. Small benchmarks are good for designers, but don't give such an accurate representation of the system performance. A contrario, benchmarks such as SPEC (a corporations agreement on real programs and inputs) for System Performance Evaluation Cooperative are more interesting, even if they can be abused.

From Andahl's Law :
Execution_Time_After_Improvement = Execution_Time_Unaffected + (Execution_TIme_Affected / Amount_Of_Improvement)
were derived interesting postulates, such as make the common case fast (RISC CPUs try to do that).

Instructions

So, what's the point with all those notions of assembly languages, machine codes, etc. ? All those are low-level constructs, primitive (often one makes comparaison with basic), and restrictive (few instructions).

We're going to focus mainly on MIPS assembly language, used on SGI, NEC, and some Nintendo machines ; but mainly now on embedded systems. The CPU instruction design is straightforward : "keep it simple, stupid", favor regularity, for maximal performance, minimum cost, and low design time.

Arithmetics instructions syntax : [OPERATOR] [REG_OUT], [REG_IN1], (ex : add, sub)

Data fetching instructions syntax : [COMMAND] [REG], [OFFSET]([REG_INDEX]) (ex : lw, sw)

Machine code : MIPS instructions are all 32-bit wide. There are 32 regs, numeroted from 0 to 31. And instructions can be classed into three main categories :

So information in machine language is stored as a sequence of bits. Yes, that means it is data. Programs are considered as data, like arrays, integers, and everything else inside the computer.

The MIPS CPU uses a "fetch and execute" cycle to process instructions.

Assembler programs, however, are not always sequential. Very often the programmer needs to control the flow of exectuion. Two instructions exist to enable this : BNE (Branch if Not Equal) and BEQ (Branch if EQual). Both take 2 registers and a 16-bit immediate label. The syntax is the following : [COND] [REG1], [REG2], [LABEL]

To create inconditional jumps, the "J" (for Jump) instruction exist. Its syntax is straightforward : J [LABEL].

It is also possible to check if the content of a register is lesser or greater than the one of another. The instruction is called SLT, for "Set if Less Than", and works like this : SLT [R0], [R1], [R2] --> if (R1 < R2) R0 = 1 else R0 = 0;

It can then be combined with a BEQ.

More about jumps. The data bus of the MIPS CPU is 32-bit wide, however adresses in jumps are 26-bit, and in branches are 16-bit. The design assumes most branches are local, and those branches are relative to the Program Counter. Moreover, they are counted in words (4 bytes) and not in bytes, so a jump can be over 30-bit of real data. If a 32-bit jump is *really* needed, some trick with jr $ra should be used. This instruction permits jumping to the adress contained in the RA register.

One more jump instruction exist : jal [immediate value], for Jump And Link. This will jump to the value and set $ra to the original PC + 4. This can be useful for functions, for example :

...
jal toto
nop
toto: ...
jr $ra

toto here is a function. On jr $ra the program will branch back to nop.

The 32 registers are not all general-purpose, and there are conventions that one should follow :

So, what about constants ? We saw that in some particular cases immediate values could be used, and it is true that constants are used quite frequently in programes (some say 50% of the operands are constants).

MIPS uses I-TYPE instructions for 16-bit constants : ADDI, SLTI, ANDI, ORI.

The syntax is the following : [OPER] [REGD], [REGS], [16-bit constant]

What about 32-bit constants ? You need to load them in 2 times. First load the 16-higher bits with a LUI instruction (LUI [REGD], [16-bit constant]), that will place them in the upper 16-bits of the register, clearing the bottom 16-bit, then you can do a ORI on the same register with the lower 16-bit of the constant (ORI [REGD], [REGD], [16-bit constant]).

But never forget that even assembly language *is* a human representation and not machine code ! Even in assembler exist pseudo-instructions, macros, and easy ways to see numbers... For performance testings, one should count Machine instructions.

A little about some alternative architectures. We have studied here a RISC CPU (for Reduced Instruction Set Computer), but another design is CISC architecure (Complex Instruction Set...), for example :

The PowerPC : more access modes (indexed adressing, using 2 registers, update adressing to go through arrays, load multiple / store multiple values, special counter).

The infamous 80x86 :

Please notice modern CISC CPUs are in fact RISC CPUs emulating CISC instructions.

Virtual Machines

If you tell me you never used a Super nintendo emulator, I won't believe you.

A video game console emulator, on you personnal computer, is one of the application falling into the virtual machines category.

So, basically, a virtual machine is a program that gives an interface to simulate some hardware.

For example, the ISA (Instruction Set Architecture) is a virtual machine, as it works as a boundary between machine code and hardware... The same way, an operating system is usually a bunch of virtual machines layers.

Virtual machines can fake hardware to applications. For example, on PCs without an floating-point unit, the Linux operating system provided a virtual machine that simulated that unit.

Virtual machines can supply another ISA to some programs - that's the principle of emulators.

We'll talk about ABI (for Application Binary Interface) as virtual machines used to support same OSes on different platforms (the FX!32 Win NT 80x86->Alpha code translator is one of those).

This latter uses a technique called binary translation to speed up things. The principle is to transform each instruction to emulate to the target machine code before executing it.

Some virtual machines allow multiples OSes to run at the same time on the same platform, either by putting an abstraction layer below the OSes (vmWare does that) or by being launched from a running OS (wine).

High-level virtual machines, like Java, can be totally platform-independant. It is interesting to note Java VM is also an ABI.

And last but not least, we talk about co-designed virtual machines when those automagically transform one ISA to another, not for emulating, but for improving performance. They are designed in cooperation with the hardware (hence their name). For example, the Transmeta Crusoe CPU, IBM Daisy, or even the Pentium IV are such virtual machines in a sense.

We can sum up all the different types of virtual machines with a very nice scheme :

Virtual machines tree

Developing a virtual machine lead the programmer to a bunch of problems, among them :

The virtual machine study went quite far. Nowadays, the best solutions found in terms of performance are based on a software memory management known as translation cache. The original code is compiled "on the fly" and stored in cache by chunks, thus being directly reusable if a next pass occur. Such a translation chaining process, using routines as the basic units of translation, and hash-tables to fetch the code afterwards, has been implemented with success on FX!32, a software allowing 80x86 applications to run efficiently on DEC Alpha hardware.

Basics of Digital Electronics

Digital Logic Circuits (DLCs) can be of three different kinds :

Basics of boolean algebra : Let "+" mean OR and "." mean AND. ! mean NOT.

Some components to know : the multiplexer (2^X inputs, 1 output, X control lines, the output take the value of 1 of the inputs depending on the control lines) ; and the demultiplexer (opposite operation).

A word about speed and performance : speed is function of the propagation delay of the electronic signal between the gates and interconnections. We talk about fan-in when the number of inputs of a circuit affects its depth.

For Sequential Logic circuits, two main models coexist, and are equivalents : the Mealy Model, for which the outputs of a circuit are function of inputs and past states ; and the Moore model, for which outputs are only function of the present state (and inputs are almost not taken into account except for control).

Arithmetics

So, we have talked about CPU. But how to implement the real stuff ? Here are provided some steps towards a solution.

The first thing to take into account is that binary code means nothing by itself. When looking at a binary file, you can't say if something is code, data, etc. One needs conventions to define a relation between binary and numbers/instructions/and so on.

First, numbers are finite. Integers are 32-bit for a MIPS CPU. That mean overflow problems can occur (when the result of an operation is bigger than the number size). And we don't talk yet about fractions and real numbers.

How about negative numbers ? Well, the most widely used representation of such numbers is Two's complement : invert the bits of the numbers, and add 1.

For example : 3 (0011) --> 1100 (inversion) +1 = 1101 (-3).

For a given integer, extending its size means copying its most significant bit to the left : 1101 --> 11111101 (still -3).

How to detect overflow ?

Overflow can *not* occur when :

Overflow occurs when :

On a MIPS CPU, what we call an exception will occur. The CPU controls jumps to a predefined adress that contains an exception routine. It is up to the user to handle that.

But let's go deep into the Arithmetic Logic Unit (ALU). Its easiest gates to code are the OR and AND gates, as each bit of operands don't affect any other.

The ADDition can be creating with 32 1-bit ADD units. However, each ADD unit will calculate a carry that needs to be sent to the next ADD unit --> problem of the ripple counter.

The SUBstraction is easily implemented calculating the 2's complement of the second operand and using the addition.

The SLT instruction is computed using a substraction and an equality test. The equality test is a big "OR" gate after substraction, that returns "1" when an operation result is 0.

On an ALU, the key idea is to use multiplexors to select output from different operation. In fact, all gates are always working, and only one result is useful for one calculus.

It is possible to optimize, for example addition, using what we call a carry-lookahead adder. The principle, for A and B the two operands, is : generate carry on G(i)=A(i).B(i) ; and propagate it on P(i)=A(i)+B(i) . This cuts the time by half.

Multiplication is another hard matter. The basic principle, as we learnt in primary school, is shifting and adding. Shifting means we somehow need a bigger space to store the temporary results.

Three versions exist, we will only consider the last one : we use a 32-bit ALU to add and a 64-bit shift register, that initially contains the multipier. The multiplicand is passed to the ALU, and is added to the most significant 32-bit of the shift register if the LSB of the latter is equal to one, else nothing is done and the shift register is shifted right. The process ends after 32 shifts.

Now for something everyone waited for : the floating-point unit.

Floating point numbers are a convenient way to represent numbers with fractions, very small numbers, and very large numbers.

A floating point number is a bitfield, containing a sign bit, an exponent, and the significand (mantissa). We have :

N=(-1)^sign * significand * 2^exponent

Increase the significand size, and you will increase accuracy. Increase the exponent size, and you will increase the range.

The IEEE 754 norm defines FP numbers as follow :

The leading "1" of the significand is implicit, except for the number 0 - all the fields are to 0.

The exponent is biased to make some routines easier : bias of 127 for single-precision and of 1023 for double-precision.

We then have :

N=(-1)^sign * (1+significand) * 2^(exponent - bias)

In certain situations, please note that accuracy is a big problem. The IEEE provides 4 different rounding modes : trunk, closest, standard (round), and floor. Those modes are activated on the CPU by two extra bits : guard and round.

A division by zero will lead to "infinity" or "negative infinity", dividing zero by itself will give "NAN". All those are accessible on a bit pattern.

CPU : Datapath and Control

Implementing the MIPS CPU can be rather simple. We will work on a reduced subset of instructions (lw, sw, add, sub, and, or, slt, beq, j).

Executing an instruction is performed this way :

A CPU is basically made of two kinds of functional units : elements that operate on values (combinatorial logic), and elemnts that contain a state.

Our implementatin will use the edge-triggered methodology - an action is started by a clock counter switching from 0 to 1. Register file can be built using D flip-flops, and we use multiplexors to stitch components together.

The signals are part of the control path.

The Control selects operation to perform, controls flow of data in the CPU, using information from the instruction. It is made of simple combinatorial logic.

In a single-cycle approach, all the instructions are executed in one CPU cycle, thus the CPU cycle is the duration of the longest instruction. It requires having several ALUs, thus wasting CPU area, and is not efficient when complex instructions are in play.

The multi-cycle approach solves the issue by using a smaller cycle time and cutting the instructions in chunks of 1 cycle.

The principle is to reuse the same functionnal units as often as possible, and now the control signals are not determined only by the instruction. The control logic becomes a finite-state machine, with a set of states, a next state function, and an output function.

The cycles are the following :

  1. Fetch instruction : use PC to get instruction, increment PC by 4, and put result into IR (Instruction Register).
  2. Decode instruction and fetch registers : Read register contents from the register unit (register "file"), and compute the branch adress from the instruction,
  3. Execute : if we have an arithmetic instruction, compute the result, if we have a jump or a branch, modify PC, if we have a memory access, compute memory adress,
  4. Complete instruction : if we have an aritmetic instruction, write the computation to the given register, if we have a memory reference, if it is a store set memory location to value in register, if it is a load set MDR (Memory Data Register) to value at memory location,
  5. Write-back registers : if we have a load, set register to MDR.

Implementing the control

The control signals depend upon insruction being executed and step being performed. We can use this information to specify a finite state machine, either by a graph or by microprogramming.

The implementation is then derived from specification.

You can use either a ROM or a PLA to implement the program. PLA is most widely used ;).

A word about microprogramming : microinstructions represent how a CPU should behave given a certain instruction. It is a specification methodology specially appropriate if there are many instructions on the CPU, however it seems a little out of fashion right now. Note two implementations of the same architecture can have very different microcodes !

For each label, and for each sequence (phase of CPU), we define which controls have to be asserted.

Tradeoffs for microcode :

Pipelining

The goal of pipelining is to improve performance by increasing instruction throughput. It is easier to do if instructions are the same length, if there are not too many of them, and if the memory is accessed only by load-store instructions.

There are several problems with pipelining :

The basic idea is to split the datapath into stages, storing the stages temporary data in pipelined registers.

Our MIPS CPU will use a 5-stage pipeline implementation :

  1. Instruction fetch and PC increase,
  2. Instruction decode and register fetch,
  3. Execution,
  4. Memory access stage,
  5. Register writeback.

The pipeline control uses tags sent along with data in the pipelined registers.

This way, when an instruction is executed, another is decoded, an a third fetched, and so on... At any given time, there is up to 5 instructions in the CPU.

Data hazards

sub $2, $1, $3
add $2, $2, $3

The previous instructions create a data hazard - the second uses a register modified by the first.

Preventing such hazards can be made in software inserting "nops", or can be solved by the CPU itself using what we call forwarding, that means using temporary results of a function, before their write back on the register file, for the next function. To handle this we add a forwaring unit to the CPU.

However, it is not always possible to forward result. For example, load word can cause a hazard because it won't "know" its value before the 4th stage. We use here a "load delay slot" to solve the issue. A hazard detection unit is added to the CPU, and stalls the following instructions by keeping them in the same stage - we call that creating a bubble.

Control hazards

When we need to branch, other instructions have already been loaded into the pipeline, the CPU making the default assumption the branch will *not* be taken. Some hardware needs to be added to flush instructions if it is proven wrong.

The hazard detection unit can handle this too.

Improving performance

The performance of the CPU can be increased by avoiding stalls, and for branches by adding a branch delay slot (the instruction following the branch is always executed). Still for branches, we can create a "branch if equal likely" which would take the assumtion the branch will be taken.

Superscalar processors are able to start more than one instruction in the same cycle, reading a chunk of instructions from memory and seeing how many it can start.

Very Large Instruction Words (VLIW) CPU do the same by loading a constant "packet" of instructions per clock cycle.

Dynamic scheduling is a complex technique in which the hardware tries to "guess" which instructions it shall execute. Out of order execution must be possible on those machines. It can also predict dynamically if branches are likely to be taken, by using for example a branching history table.

Superpipelining is creating a LOT of pipeline stages in the CPU. We consider a processor as being superpipelined if it has more than 6 pipeline stages.

Optimizing an assembly language program for such CPUs is now a NP-Complete problem, very hard to manage for humans...

Memory

We're going to talk now about memory management, as for computers of course. Memory can be separated into two main classes : ROMs (for Read Only Memories) and RAMs (for Random Access Memories). The formers don't need any power supply to retain data, and are quite slow, the latters can be read/written any time, but "loose" the data they contain in case of power shortage.

RAMs can be separated into two classes : SRAMs, in which values are stored in pairs of inverting gates (fast memories, taking a lot of space) ; and DRAMs for which the data is stored with charge capacitors. Those are much slower than SRAMs.

Memory hierarchy

It is not possible, even today, to have a very large amount of fast memory. Instead, we usually have a small amount of very fast memory (close to the CPU speed, usually SRAM on the same chip), a wider amount of slower memory (generally the much cheaper DRAM, on a separate chip), and a tremendously big amount of butt-slow memory (I mean disks).

This hierarchy can be viewed as a kind of pyramid. The most frequently used data shall be very close to the CPU, and the least used on disk, based on the principle of locality : a referenced item will tend to be referenced again soon, and nearby items will probably be referenced sooner than farther items.

We shall talk of a block to describe the minimum quantity of data (the one that can be stored in the memory unit closes to CPU). A hit occurs when a requested data is found in that memory, and a miss occurs otherwise.

The cache principle

All modern computers use someting called a "cache" to organize the holding space for memory. It is a component able to "know" if it contains a requested data element, and if it is, it needs to find it.

The cache can be direct-mapped. A lots of items at lower memory level will share the same location in the upper-level cache, each item being referenced "modulo the number of blocks". This approach permits to handle temporal locality.

When a miss occurs, the CPU is stalled, and the data is fetched back from memory and put into the cache.

Data is also written using the cache. When a memory write action occurs from CPU, the memory in cache is written. If the main corresponding memory is also written at the same time, we'll talke about a write-through strategy, else it is a write-back. In case of a "miss" (some other memory location was referenced at the position we wanted to write to), the entire block is read into cache, and the write is performed.

Hardware issues

In some hardware implementations, memory banks can be used. It is a kind of interleaved memory organization, where a given memory area "points" to a given global memory bank, that can be switched on demand.

Referencing problems : increasing block size tends to minimize miss rates. As the code become bigger, using splitted caches can help dealing with that increased spatial locality.

Another way to decrease miss ratio is to used associativity. For example, a fully set associative cache permits any block of cache to point to anywhere in memory (and not only to one place)... A 2-block set associative will permit a block to point to 2 memory locations instead of one.

The higher the associativity, the better the hit ratio. The replacement strategy in an associative cache is generally the least-recently used algorithm.

To decrease miss penalty (stall times in case of a miss), a multilevel cache can be used. We can add a second-level cache to the CPU first-level cache, for example. The goal then is to optimize the hit time on the first-level cache, and to optimize the miss rate on second-level.

Virtual memory gives the user the illusion of having more physical memory. It can also be used for program relocation and segment protection. The operating system deals with page faults (pages pointing to unknown memory locations), for example by retrieving data from disk - which is a BIG hit penalty.

Input/Output

Input/Output (IO) is affected by many factors (expandability, resilience, and so on).

IO devices are very diverse.

Disks use two main technologies : IDE-ATA and SCSI (Small Computer System Interface). The former is known to have bad simultaneous transfer, the latter is plug and play, has a high transfer rate, and doesn't need CPU to work, but is much more expensive.

Buses are shared communication links, using one or more wires. Designing a bus can be hard : generally they are bottleneck of computer systems, moreover you have to think about length, number of devices attaches, costs...

The buses can be divided in three main types :

Synchronous buses use a clock to synchronise the devices, meanwhile asynchronous buses use handshaking.

The bus arbitratrion algorithms (which device gets the "right to use" the bus) is also a problem.

The operating system has several ways of waiting for new data : it can use polling (which is an active wait, tremendous waste of time, but sometimes used for embedded devices), interrupts which is a flagging system that can "break" the current CPU state to make it perform a programmer-defined sequence of instructions, and DMA (Direct Memory Access) which enable continuous communication with bus without the CPU use.

Vector machines

The general idea of a vector machine is to use a register file for each register. That mean, for example, having 32 registers of 32 registers each... Everyone heard about Cray :).

They facilitate scientific calculus, for instance linear algebra, and so on.

Most modern computers include some vector-machine principles, like MMX for Intel PCs and Altivec for macintoshes.

The third generation of vector machines are highly-configurable, even at a low level. Many optimizations are included, such as pipeline for floating-points operations, decomposing the FP calculation into 4 steps :

  1. Compute and compare exponents of operands,
  2. Shift mantissa of the operand having the lowest exponent,
  3. Perform addition,
  4. Normalize stuff.

Multiprocessing and parallel computing

Between a single machine having multiple processors and multiple machines connected together to perform a calculation, the difference can be sometimes very sparse.

The goal of multiprocessing is to solve larger problems in a shorter amount of time.

Clustering, which is the fact of taking a bunch of computers and linking them together, is much cheaper than developing a specialized machine. However, algorithms have to be redesigned to take advantage of parallel processing.

Flynn's classification :

Number of data streams
1 > 1
Number of instruction streams 1SISD (Von Neumann Machines)SIMD (vector machines)
> 1MISDMIMD (Clusters)

I stands for Instruction, D for Data, M for Multiple, S for Single.

MIMD can be divided into two main classes : clusters, which use message-passing over a distributed network, and shared memory machines, where several CPUs share the same data.

Message passingShared memory
How do they share data ? Each CPU has a private memory bank which is not shared. The memory is viewable from all CPUs. NUMA machines use non-uniform memory accesses (CPUs can be delayed to access memory), and UMA (SMP) machines provide all CPUs the same memory access times.
How do they cooperate and synchronize ? Message passing is used, for example through a network. The CPUs are synchronized with locks, defined by compiler or user program. A single bus connects them.
How many CPUs can be used ? Many, many... The number of CPUs is limited by the bus bandwidth.

The cache coherency problem.

Each CPU disposes of its own cache, however UMA and NUMAs share the same data. The cache coherency problem occurs when a CPU writes to the memory, and that the same memory is used by another CPU. The snooping solution involves either invalidating the corresponding cache entries in other CPUs, or updating all the CPUs with the new value, which can consume a lot of time.

Performance optimization :

Speedup(n) = t(1cpu) / t(ncpu) ; and should be < n.

Principe of load balance : each CPU should do around the same amount of work.

In a distributed network, extra work has to be done : communication, value sending, sometimes also redundant calculations, and so on... Much depends on the topology of the cluster.