CIS5220 Computer Organization Notes

1. Introduction

Textbook: The Architecture of Computer Hardware and Systems Software by Irv Englander.

2. Number Systems

Numbers are usually written in base 10. The digits of a number written in base B have weights:

  ... B³ B² B¹ B⁰ . B^-1 B^-2 B^-3 ...

For example, in base 8:

  123.45₈ = 1x8² + 2x8¹ + 3x8⁰ + 4x8^-1 + 5x8^-2
  = 1 x 64 + 2 x 8 + 3 x 1 + 4 x 1/8 + 5 x 1/64
  = 64 + 16 + 3 + 4/8 + 5/64
  = 83 37/64.

Important bases in computers are base 2 (binary) and base 16 (hexadecimal or hex). In hex, A=10, B=11, C=12, D=13, E=14, F=15. Examples:

  1AF₁₆  (sometimes written 0x1AF or 1AFh) 
  = 1 x 16² + 10 x 16¹ + 15 x 16⁰
  = 1 x 256 + 10 x 16 + 15 
  = 431.

  110101111₂
  = 1 x 2⁸ + 1 x 2⁷ + 0 x 2⁶ + 1 x 2⁵ + 0 x 2⁴ + 1 x 2³ + 1 x 2² + 1 x 2¹ + 1 x 2⁰
  = 256 + 128 + 32 + 8 + 4 + 2 + 1
  = 431

Each hex digit converts to 4 binary digits (4 bits).

  0001 1010 1111₂ = 1AF₁₆ = 431.

3. Data Formats

All data in computers are stored as sequences of bits. Important units:

  8 bits = 1 byte.
  1024 bytes = 1 kilobyte = 1 KB = 2¹⁰ bytes.
  1024 KB = 1 megabyte = 1 MB = 2²⁰ bytes = 1,048,576 bytes.
  1024 MB = 1 gigabyte 1 GB = 2³⁰ bytes.
  1024 GB = 1 terabyte 1 TB = 2⁴⁰ bytes.

File, disk, and memory sizes are expressed in bytes. However, network and modem speeds are often expressed in bits per second with a lowercase "b", e.g. 56 Kb/s = 56 kilobits per second = 7 KB per second.

Text

Plain text uses the ASCII character set. Each byte (as a binary number) represents one character, coded as follows:

  10 = linefeed (end of line)
  13 = carriage return (preceeds linefeed in Windows, but not UNIX)
  32 = space
  48..57 = '0'..'9'
  65..90 = 'A'..'Z'
  97..122 = 'a'..'z'
  Complete ASCII table (0 to 127)
  Multi-byte Unicode extensions for other languages.

Other data formats

Page display: pdf, ps,...
Images: jpeg, gif, png,...
Audio: wav, mp3,...
Video: mpeg,...

4. Integers

Integer types have fixed sizes, may be signed or unsigned. Signed numbers are binary 2's compliment: weight of first bit is negative. Example: 4 bit 2's compliment:

  1101₂
  = 1 x -2³ + 1 x 2² + 0 x 2¹ + 1 x 2⁰
  = -8 + 4 + 0 + 1
  = -3

Common data types.

  Size      C/C++              Java             Range
  8 bits    char               byte         -128 to 127         
  8 bits    unsigned char      -               0 to 255
  16 bits   short              short      -32768 to 32767
  16 bits   unsigned short     -               0 to 65535
  32 bits   int                int   -2147483648 to 2147383647
  32 bits   unsigned int       -               0 to 4294967295
  64 bits   long long          long    -9.223e18 to 9.223e18
  64 bits   unsigned long long -               0 to 1.844e19

Arithmetic producing a result outside the legal range "overflows" by dropping the leading bits. For n bit types the computed result is congruent modulo 2ⁿ to the true result, e.g. for 32 bit integers, add or subtract 2³² = 4294967296 until the result is within range:

  1000000000 + 2000000000 == -1294967296  (3000000000 - 2³²)
  0x20000 * 0x30000 == 0x0  (last 8 hex digits of 0x600000000)

Integer division rounds toward zero.

   8/3 == 2
  -8/3 == -2

5. Real Numbers

IEEE-754 real numbers are approximated by 3 binary numbers: sign x mantissa x 2^exponent. Sign is stored as 0 for positive or 1 for negative. n-bit exponent is stored as an unsigned number after adding 2^n-1-1 (excess 127 or 1023 format). Mantissa is in range 1.000...₂ to 1.111...₂ but only the bits after the point are stored. As a special case, 0.0 is stored as all 0 bits.

  Size     C/C++/Java type  Sign   Exponent  Mantissa  Range     Precision
  32 bits  float            1 bit  8 bits    23 bits   +-10³⁸     6 significant digits
  64 bits  double           1 bit  11 bits   52 bits   +-10³⁰⁸    14 significant digits

  Decimal  float                                double        (52 bits...)
  0.0      0 00000000 00000000000000000000000   0 00000000000 00000000...0
  1.0      0 01111111 00000000000000000000000   0 01111111111 00000000...0
  2.0      0 10000000 00000000000000000000000   0 10000000000 00000000...0
  3.0      0 10000000 10000000000000000000000   0 10000000000 10000000...0
  4.0      0 10000001 00000000000000000000000   0 10000000001 00000000...0
  5.0      0 10000001 01000000000000000000000   0 10000000001 01000000...0
 -5.0      1 10000001 01000000000000000000000   1 10000000001 01000000...0
  0.5      0 01111110 00000000000000000000000   0 01111111110 00000000...0
  0.25     0 01111101 00000000000000000000000   0 01111111101 00000000...0
  0.1      0 01111011 10011001100110011001101   0 01111111011 10011001...1

Floating point arithmetic is not always exact.

  10 * 0.1 - 1 != 0           (0.1 does not have an exact representation)
  sqrt(2) * sqrt(2) - 2 != 0  (sqrt(2) does not have an exact representation)

6. Little Man Computer

Flash animation (instruction set differs from book).

Calculator (ALU) - range -999 to 999, can enter numbers, add, subtract.
Memory - 100 storage locations, each can hold a 3 digit number 000 to 999.
Input - a 3 digit number from user.
Output - a 3 digit number to the user.
Program counter - a 2 digit number of the next instruction to fetch from memory. User can reset to 0 at any time.
Little Man (control unit) - fetches instructions from the memory location indicated by the program counter and interprets the contents as follows:

    000 = halt (until user presses reset)
    1xx = add number from location xx to calculator
    2xx = subtract number from location xx
    3xx = store calculator contents at location xx
    5xx = load contents of location xx into calculator
    6xx = enter xx into program counter (branch unconditional)
    7xx = enter xx into program counter if calculator holds 0 (branch if 0)
    8xx = enter xx into program counter if calculator holds 0 or a positive number    
    901 = read input, put into calculator
    902 = output value in calculator

After all instructions except branches (6xx, 7xx, 8xx) and halt (000) the little man adds 1 to the program counter. Example program to input 2 numbers, add them, output the sum:

  00 901  (input)
  01 399  (store at 99)
  02 901  (input)
  03 199  (add from 99)
  04 902  (output)
  05 000  (halt)

7. Computer Components

  Registers (in CPU)                Typical number  Purpose
    Integer                                8-32     Arithmetic, comparison, logical
    Floating point                         8-16     Arithmetic
    Flags                                  4-16     Comparison result
    Stack pointer                          1        Call, return
    Instruction pointer                    1        Fetch
    Segment                                4-8      Process protection

  RAM     Random Access Memory      Typical size in bytes
    DRAM    Dynamic RAM                    10⁸       Main memory
    SRAM    Static RAM                     10⁶       Cache
  ROM     Read-Only Memory                 10⁵       Boot code
    PROM    Programmable (one time)
    EPROM   Erasable PROM (in UV light)
    EEPROM (flash) electrically EPROM      10³       Infrequently written data
  Disk                                     10¹¹      Files

CPU talks to RAM, ROM, disk, and I/O on a bus:

Address lines (~32) - CPU selects memory location or I/O device.
Data lines (~32) - From CPU to memory (store) or I/O (output) or from memory to CPU (read) or I/O to CPU (input).
R/W (1) - From CPU to memory or I/O to control direction of data lines.
Interrupt (a few) - From I/O to tell CPU that I/O is completed.

Direct Memory Access (DMA) - Disk reads or writes memory (bypassing CPU) and interrupts CPU when done.

Instruction Set

  Move           =
  Arithmetic     + - * /
  Comparison     == != < > <= >=
  Logical        & | ^ ~  (and, or, xor, not)
  Shift          <<  >>  (shift left, right), rotate
  Branch         goto, if (...) goto
  Push, Pop      put on stack, remove
  Call, Return   push/pop instruction pointer
  Vector         vector arithmetic (MMX, SSE on x86)

Examples (x86)

Moves are to or from registers, not memory to memory.

  ; a = b
  mov eax, [b]  ; move from variable b to register eax
  mov [a], eax  ; move from register eax to variable a

There is no if, while, for, just compare and conditional branch.

  ; if (a == b) a = a + 1;
  mov eax, [a]
  cmp eax, [b]  ; compare b with a, save result in flags
  jne L1        ; jump if not equal
  add eax, 1    ; add 1 to copy of a
  mov [a], eax  ; save result
L1:             ; label to jump to

Logical and shift operators are faster than multiplication and division, but only work for powers of 2.

  x << 3;  // x * 8 (double x 3 times)
  x >> 3;  // x / 8 (halve x 3 times, rounding down)
  x & 7    // x % 8 (AND x with 00000111₂)

The stack is used both for passing parameters to functions and saving the return address between call and return. On the x86 the stack pointer is called esp. The stack grows downward by 4 bytes for each push.

  ; x = sum(1, 2);
  push 2        ; by convention, push right to left
  push 1
  call sum      ; push instruction pointer and jump to sum:
  mov [x], eax  ; result is returned in eax
  add esp, 8    ; the caller discards the arguments

  |       |  
  +-------+ Picture of stack in sum()
  | b = 2 |
  +-------+ esp+8
  | a = 1 |
  +-------+ esp+4
  | eip   |   <-- address of the instruction following "call"
  +-------+ <--- esp

  ; int sum(int a, int b) {return a+b;}
sum:
  mov eax, [esp+4]  ; a is 4 bytes above where esp points
  add eax, [esp+8]  ; b
  ret               ; pop instruction pointer (eip)

The Pentium MMX (1997) introduced 8 64-bit registers that could operate on packed 8, 16, or 32 bit integers 64 bits at a time. The Pentium 4 (2000) added 8 128-bit registers that could operate on 4 floats or 2 doubles at a time.

  ; // add two 4-element vectors of 16-bit integers using MMX
  ; short a[4], b[4];
  ; for (i=0; i<4; ++i)
  ;   a[i] = a[i] + b[i]

  movq mm0, [a]   ; load a[0..3] into MMX register mm0
  movq mm1, [b]   ; load b[0..3]
  paddw mm0, mm1  ; do 4 additions at once, result in mm0
  movq [a], mm0   ; save the result

8. Performance

Processor Categories

Von Neumann - Code and data share memory.
- RISC - Reduced Instruction Set Computer - Small instruction set, fixed length opcodes, many registers (PowerPC).
- CISC - Complex ISC - many instructions, variable length, many addressing modes (x86).
Parallel Processing
- SIMD - Single Instruction Multiple Data.
- MIMD - Multiple Instruction Multiple Data.

Memory Performance: Cache and Paging

CPU performance is limited by slow memory, so data is copied from large, slow, cheap memory up to small, fast, expensive memory. Load a line/page at a time (assume sequential access). When full, replace least recently used (LRU) (assume time locality). Modified data must be written back (tracked by hardware dirty bit).

                                            Size                Replacement
                                    Line/page  Total  Speed     Algorithm
                                    ---------  -----  -----     -----------
           +-----------+
           |    CPU    |              4-16B     10²B  200 ps     Renaming (hardware)
           | Registers |  
           +-----------+
                 |
          +-------------+
          |   L1 Cache  |             32B       10⁴B  1 ns       4 way LRU (hardware)
          +-------------+
                 |
        +------------------+
        | L2 Cache (lines) |          32-64B    10⁶B  5 ns       4 way LRU (hardware)
        +------------------+
                 |
     +------------------------+
     |      RAM (pages)       |       4KB       10⁹B 50 ns       Full LRU (OS)
     +------------------------+
                 |
  +------------------------------+
  |       Disk (sectors)         |    1K-4KB    10¹¹B 10 ms      N/A
  +------------------------------+

Sequential access is faster than random. Example: clearing a 2-D array:

  // faster                        // slower (cache misses, disk thrashing)
  int a[N][N];                     int a[N][N];
  for (int i=0; i<N; ++i)          for (int i=0; i<N; ++i)
    for (int j=0; j<N; ++j)          for (int j=0; j<N; ++j)
      a[i][j]=0;                       a[j][i]=0;

CPU Performance: Parallelism

Pipelining - The CPU can start executing the next instruction before the current one has finished.
Superscalar architecture - There can be more than one pipeline executing instructions in parallel.
Instruction reordering, speculative execution - while waiting for a result, the next instruction can be executed.
- Instructions must be completed in order.
- Pending results are held in hidden, temporary registers.
- Mispredicted branches can invalidate pending results.
- Instructions can stall if waiting on another result.
Branch prediction - guesses branch direction based on history.

Write code to allow parallelism. Could you reorder the statements and get the same result?

  // Compute y = a*b*c*d
  // Slow (multiplication has a long latency but is pipelined)
  y = a*b;
  y = y*c;  // must wait for a*b to finish
  y = y*d;  // must wait for y*c to finish

  // Faster, allows parallelism
  tmp1 = a*b;
  tmp2 = c*d;       // computes in parallel with a*b
  y = tmp1 * tmp2;  // must wait for a*b and c*d to finish

Avoid unpredictable branches. In the simplest model, branches are predicted to go the same way as last time. Newer processors use smarter models.

  // search for x in array
  for (i=0; i<N; ++i)  // OK, loops branch the same way except the last iteration
    if (a[i] == x)     // OK, false every time but once
      break;           // OK, unconditional branch is always predicted
  
  // compute maximum
  int max(int a, int b) {
    if (a > b) // unpredictable branch is slow
      return a;
    else 
      return b;
  }

  // faster, avoids a comparison
  int max(int a, int b) {
    int cmp = (a - b) >> 31;  // all 1 bits or all 0 bits
    return (~cmp & a) | (cmp & b);
  }

9. Input/Output

In a multitasking operating system such as Windows or UNIX, all I/O is through the operating system. I/O instructions are privileged.

  // Example: copying a file using UNIX system calls (no error checks)
  int in = open("in.txt", O_RDONLY);  // file descriptors
  int out = open("out.txt", O_WRONLY);
  char buf[BUF_SIZE];  // e.g. 4K bytes
  while (true) {
    int bytesRead = read(in, buf, BUF_SIZE);
    if (bytesRead > 0)
      write(out, buf, bytesRead);
    else
      break;  // end of input file
  }

A multitasking operating system allows many processes (jobs, programs) to run apparently in parallel by giving a few milliseconds of CPU time to each one, round robin. A process waiting for input gives up the CPU immediately.


                        Input Requested
                   +-------------------------+
                   |               Timeout   |
                   |            +------------+
                   V            V            |
              +---------+   +-------+   +---------+
              | Blocked |-->| Ready |-->| Running |-->  Exit 
              +---------+   +-------+   +---------+
                        Input
                        Ready

Process state graph. Only one process can be running at a time. Processes ready to run wait in a queue.


                   Context Switch
   +-------------------------------------------+
   |                                           |
   |  ===+=======+=======+=======+=========+   |
   +-->  | Ready | Ready | Ready | Running |-->+
      ===+=======+=======+=======+=========+

Round Robin Process Queue

Each process is assigned one or more memory segments. Reading, writing, or executing outside these segments cause a run time error (segmentation fault).

The operating system runs in privileged mode, allowing it to:

Execute input and output instructions.
Access arbitrary memory (including memory mapped I/O).
Assign memory segments.
Halt the processor.

I/O Devices

Keyboard
Mouse
Display (monitor)
Printer
Disk
Network
etc...

I/O Methods

Direct - OS reads or writes using LOAD/STORE (memory mapped) or IN/OUT instruction.
Polled - OS checks periodically if input is available.
DMA - for high speed (disk)
- Application requests read/write to OS
- OS outputs request "read/write N bytes from disk sector X to memory Y"
- OS suspends process and gives CPU to another process
- Disk reads/writes directly to memory bypassing CPU
- Disk interrupts CPU when done
- OS responds by putting process in ready state

Interrupts

An interrupt acts like a CALL but is triggered by an external event. Interrupts are always handled in the operating system by privileged code.

Switch to priveliged mode
Look up jump address by interrupt number in the interrupt vector table (in privileged memory)
Save instruction pointer
Jump to interrupt service routine
Save other registers (on stack or in a process control block)
Handle interrupt...
Restore registers (possibly of another process)
Return from interrupt (a privileged instruction)
Switch back to user mode

Some types of interrupts

Reset (reboot)
I/O completed
Timer (context switch)
Memory access violation (segmentation fault)
Privileged instruction in user mode (general protection fault)
Other run time error (invalid instruction, division by zero)
Page fault (load virtual memory page from swap file on disk)
Software interrupt (normal call to operating system)
Breakpoint (debugger)
Single step (debugger)

Interrupts form a hierarchy. An interrupt cannot be interrupted by a lower priority interrupt. There are also (privileged) instructions to disable and enable interrupts.

External buses

An external bus has a specified hardware interface to allow I/O devices from different vendors to be connected. Examples: older to newer:

RS232 - point to point serial port to connect a modem, mouse, or dumb terminal at 300b/s to 56Kb/s.
SCSI - usually to connect a disk drive (hard, floppy, CD). Can be daisy chained (devices connected in series).
Parallel port - point to point, connects a PC usually to a printer.
ISA (Industry Standard Architecture) - Older PC bus (memory, video card, network card, parallel or serial port, modem, other peripherals). Has 32 address and 16 data lines.
PCI - Newer PC bus - 32 address/data lines are multiplexed.
USB - Universal serial bus - connects up to 127 devices at 480Mb/s, multi-hub, only 4 wires (data, data return, power, ground), can be hot connected (plugged in with power on). Data is sent in packets, supports isochronous (constant rate) communication.
FireWire (or IEEE 1394 or i.link) - serial, 3.2Gb/s, supports hubs, splitters, bridges, decentralized communication.