CIS5220 Computer Organization Notes

1. Introduction

Textbook: The Architecture of Computer Hardware and Systems Software by Irv Englander.

2. Number Systems

Numbers are usually written in base 10. The digits of a number written in base B have weights:

  ... B3 B2 B1 B0 . B-1 B-2 B-3 ...
For example, in base 8:
  123.458 = 1x82 + 2x81 + 3x80 + 4x8-1 + 5x8-2
  = 1 x 64 + 2 x 8 + 3 x 1 + 4 x 1/8 + 5 x 1/64
  = 64 + 16 + 3 + 4/8 + 5/64
  = 83 37/64.
Important bases in computers are base 2 (binary) and base 16 (hexadecimal or hex). In hex, A=10, B=11, C=12, D=13, E=14, F=15. Examples:
  1AF16  (sometimes written 0x1AF or 1AFh) 
  = 1 x 162 + 10 x 161 + 15 x 160
  = 1 x 256 + 10 x 16 + 15 
  = 431.

  1101011112
  = 1 x 28 + 1 x 27 + 0 x 26 + 1 x 25 + 0 x 24 + 1 x 23 + 1 x 22 + 1 x 21 + 1 x 20
  = 256 + 128 + 32 + 8 + 4 + 2 + 1
  = 431
Each hex digit converts to 4 binary digits (4 bits).
  0001 1010 11112 = 1AF16 = 431.

3. Data Formats

All data in computers are stored as sequences of bits. Important units:
  8 bits = 1 byte.
  1024 bytes = 1 kilobyte = 1 KB = 210 bytes.
  1024 KB = 1 megabyte = 1 MB = 220 bytes = 1,048,576 bytes.
  1024 MB = 1 gigabyte 1 GB = 230 bytes.
  1024 GB = 1 terabyte 1 TB = 240 bytes.
File, disk, and memory sizes are expressed in bytes. However, network and modem speeds are often expressed in bits per second with a lowercase "b", e.g. 56 Kb/s = 56 kilobits per second = 7 KB per second.

Text

Plain text uses the ASCII character set. Each byte (as a binary number) represents one character, coded as follows:
  10 = linefeed (end of line)
  13 = carriage return (preceeds linefeed in Windows, but not UNIX)
  32 = space
  48..57 = '0'..'9'
  65..90 = 'A'..'Z'
  97..122 = 'a'..'z'
  Complete ASCII table (0 to 127)
  Multi-byte Unicode extensions for other languages.

Other data formats

4. Integers

Integer types have fixed sizes, may be signed or unsigned. Signed numbers are binary 2's compliment: weight of first bit is negative. Example: 4 bit 2's compliment:
  11012
  = 1 x -23 + 1 x 22 + 0 x 21 + 1 x 20
  = -8 + 4 + 0 + 1
  = -3
Common data types.
  Size      C/C++              Java             Range
  8 bits    char               byte         -128 to 127         
  8 bits    unsigned char      -               0 to 255
  16 bits   short              short      -32768 to 32767
  16 bits   unsigned short     -               0 to 65535
  32 bits   int                int   -2147483648 to 2147383647
  32 bits   unsigned int       -               0 to 4294967295
  64 bits   long long          long    -9.223e18 to 9.223e18
  64 bits   unsigned long long -               0 to 1.844e19
Arithmetic producing a result outside the legal range "overflows" by dropping the leading bits. For n bit types the computed result is congruent modulo 2n to the true result, e.g. for 32 bit integers, add or subtract 232 = 4294967296 until the result is within range:
  1000000000 + 2000000000 == -1294967296  (3000000000 - 232)
  0x20000 * 0x30000 == 0x0  (last 8 hex digits of 0x600000000)
Integer division rounds toward zero.
   8/3 == 2
  -8/3 == -2

5. Real Numbers

IEEE-754 real numbers are approximated by 3 binary numbers: sign x mantissa x 2exponent. Sign is stored as 0 for positive or 1 for negative. n-bit exponent is stored as an unsigned number after adding 2n-1-1 (excess 127 or 1023 format). Mantissa is in range 1.000...2 to 1.111...2 but only the bits after the point are stored. As a special case, 0.0 is stored as all 0 bits.
  Size     C/C++/Java type  Sign   Exponent  Mantissa  Range     Precision
  32 bits  float            1 bit  8 bits    23 bits   +-1038     6 significant digits
  64 bits  double           1 bit  11 bits   52 bits   +-10308    14 significant digits

  Decimal  float                                double        (52 bits...)
  0.0      0 00000000 00000000000000000000000   0 00000000000 00000000...0
  1.0      0 01111111 00000000000000000000000   0 01111111111 00000000...0
  2.0      0 10000000 00000000000000000000000   0 10000000000 00000000...0
  3.0      0 10000000 10000000000000000000000   0 10000000000 10000000...0
  4.0      0 10000001 00000000000000000000000   0 10000000001 00000000...0
  5.0      0 10000001 01000000000000000000000   0 10000000001 01000000...0
 -5.0      1 10000001 01000000000000000000000   1 10000000001 01000000...0
  0.5      0 01111110 00000000000000000000000   0 01111111110 00000000...0
  0.25     0 01111101 00000000000000000000000   0 01111111101 00000000...0
  0.1      0 01111011 10011001100110011001101   0 01111111011 10011001...1
Floating point arithmetic is not always exact.
  10 * 0.1 - 1 != 0           (0.1 does not have an exact representation)
  sqrt(2) * sqrt(2) - 2 != 0  (sqrt(2) does not have an exact representation)

6. Little Man Computer

Flash animation (instruction set differs from book).
    000 = halt (until user presses reset)
    1xx = add number from location xx to calculator
    2xx = subtract number from location xx
    3xx = store calculator contents at location xx
    5xx = load contents of location xx into calculator
    6xx = enter xx into program counter (branch unconditional)
    7xx = enter xx into program counter if calculator holds 0 (branch if 0)
    8xx = enter xx into program counter if calculator holds 0 or a positive number    
    901 = read input, put into calculator
    902 = output value in calculator
After all instructions except branches (6xx, 7xx, 8xx) and halt (000) the little man adds 1 to the program counter. Example program to input 2 numbers, add them, output the sum:
  00 901  (input)
  01 399  (store at 99)
  02 901  (input)
  03 199  (add from 99)
  04 902  (output)
  05 000  (halt)

7. Computer Components

  Registers (in CPU)                Typical number  Purpose
    Integer                                8-32     Arithmetic, comparison, logical
    Floating point                         8-16     Arithmetic
    Flags                                  4-16     Comparison result
    Stack pointer                          1        Call, return
    Instruction pointer                    1        Fetch
    Segment                                4-8      Process protection

  RAM     Random Access Memory      Typical size in bytes
    DRAM    Dynamic RAM                    108       Main memory
    SRAM    Static RAM                     106       Cache
  ROM     Read-Only Memory                 105       Boot code
    PROM    Programmable (one time)
    EPROM   Erasable PROM (in UV light)
    EEPROM (flash) electrically EPROM      103       Infrequently written data
  Disk                                     1011       Files
CPU talks to RAM, ROM, disk, and I/O on a bus: Direct Memory Access (DMA) - Disk reads or writes memory (bypassing CPU) and interrupts CPU when done.

Instruction Set

  Move           =
  Arithmetic     + - * /
  Comparison     == != < > <= >=
  Logical        & | ^ ~  (and, or, xor, not)
  Shift          <<  >>  (shift left, right), rotate
  Branch         goto, if (...) goto
  Push, Pop      put on stack, remove
  Call, Return   push/pop instruction pointer
  Vector         vector arithmetic (MMX, SSE on x86)

Examples (x86)

Moves are to or from registers, not memory to memory.
  ; a = b
  mov eax, [b]  ; move from variable b to register eax
  mov [a], eax  ; move from register eax to variable a
There is no if, while, for, just compare and conditional branch.
  ; if (a == b) a = a + 1;
  mov eax, [a]
  cmp eax, [b]  ; compare b with a, save result in flags
  jne L1        ; jump if not equal
  add eax, 1    ; add 1 to copy of a
  mov [a], eax  ; save result
L1:             ; label to jump to
Logical and shift operators are faster than multiplication and division, but only work for powers of 2.
  x << 3;  // x * 8 (double x 3 times)
  x >> 3;  // x / 8 (halve x 3 times, rounding down)
  x & 7    // x % 8 (AND x with 000001112)
The stack is used both for passing parameters to functions and saving the return address between call and return. On the x86 the stack pointer is called esp. The stack grows downward by 4 bytes for each push.
  ; x = sum(1, 2);
  push 2        ; by convention, push right to left
  push 1
  call sum      ; push instruction pointer and jump to sum:
  mov [x], eax  ; result is returned in eax
  add esp, 8    ; the caller discards the arguments

  |       |  
  +-------+ Picture of stack in sum()
  | b = 2 |
  +-------+ esp+8
  | a = 1 |
  +-------+ esp+4
  | eip   |   <-- address of the instruction following "call"
  +-------+ <--- esp

  ; int sum(int a, int b) {return a+b;}
sum:
  mov eax, [esp+4]  ; a is 4 bytes above where esp points
  add eax, [esp+8]  ; b
  ret               ; pop instruction pointer (eip)

The Pentium MMX (1997) introduced 8 64-bit registers that could operate on packed 8, 16, or 32 bit integers 64 bits at a time. The Pentium 4 (2000) added 8 128-bit registers that could operate on 4 floats or 2 doubles at a time.
  ; // add two 4-element vectors of 16-bit integers using MMX
  ; short a[4], b[4];
  ; for (i=0; i<4; ++i)
  ;   a[i] = a[i] + b[i]

  movq mm0, [a]   ; load a[0..3] into MMX register mm0
  movq mm1, [b]   ; load b[0..3]
  paddw mm0, mm1  ; do 4 additions at once, result in mm0
  movq [a], mm0   ; save the result

8. Performance

Processor Categories

Memory Performance: Cache and Paging

CPU performance is limited by slow memory, so data is copied from large, slow, cheap memory up to small, fast, expensive memory. Load a line/page at a time (assume sequential access). When full, replace least recently used (LRU) (assume time locality). Modified data must be written back (tracked by hardware dirty bit).
                                            Size                Replacement
                                    Line/page  Total  Speed     Algorithm
                                    ---------  -----  -----     -----------
           +-----------+
           |    CPU    |              4-16B     102B  200 ps     Renaming (hardware)
           | Registers |  
           +-----------+
                 |
          +-------------+
          |   L1 Cache  |             32B       104B  1 ns       4 way LRU (hardware)
          +-------------+
                 |
        +------------------+
        | L2 Cache (lines) |          32-64B    106B  5 ns       4 way LRU (hardware)
        +------------------+
                 |
     +------------------------+
     |      RAM (pages)       |       4KB       109B 50 ns       Full LRU (OS)
     +------------------------+
                 |
  +------------------------------+
  |       Disk (sectors)         |    1K-4KB    1011B 10 ms      N/A
  +------------------------------+

Sequential access is faster than random. Example: clearing a 2-D array:
  // faster                        // slower (cache misses, disk thrashing)
  int a[N][N];                     int a[N][N];
  for (int i=0; i<N; ++i)          for (int i=0; i<N; ++i)
    for (int j=0; j<N; ++j)          for (int j=0; j<N; ++j)
      a[i][j]=0;                       a[j][i]=0;

CPU Performance: Parallelism

Write code to allow parallelism. Could you reorder the statements and get the same result?
  // Compute y = a*b*c*d
  // Slow (multiplication has a long latency but is pipelined)
  y = a*b;
  y = y*c;  // must wait for a*b to finish
  y = y*d;  // must wait for y*c to finish

  // Faster, allows parallelism
  tmp1 = a*b;
  tmp2 = c*d;       // computes in parallel with a*b
  y = tmp1 * tmp2;  // must wait for a*b and c*d to finish
Avoid unpredictable branches. In the simplest model, branches are predicted to go the same way as last time. Newer processors use smarter models.
  // search for x in array
  for (i=0; i<N; ++i)  // OK, loops branch the same way except the last iteration
    if (a[i] == x)     // OK, false every time but once
      break;           // OK, unconditional branch is always predicted
  
  // compute maximum
  int max(int a, int b) {
    if (a > b) // unpredictable branch is slow
      return a;
    else 
      return b;
  }

  // faster, avoids a comparison
  int max(int a, int b) {
    int cmp = (a - b) >> 31;  // all 1 bits or all 0 bits
    return (~cmp & a) | (cmp & b);
  }

9. Input/Output

In a multitasking operating system such as Windows or UNIX, all I/O is through the operating system. I/O instructions are privileged.
  // Example: copying a file using UNIX system calls (no error checks)
  int in = open("in.txt", O_RDONLY);  // file descriptors
  int out = open("out.txt", O_WRONLY);
  char buf[BUF_SIZE];  // e.g. 4K bytes
  while (true) {
    int bytesRead = read(in, buf, BUF_SIZE);
    if (bytesRead > 0)
      write(out, buf, bytesRead);
    else
      break;  // end of input file
  }

A multitasking operating system allows many processes (jobs, programs) to run apparently in parallel by giving a few milliseconds of CPU time to each one, round robin. A process waiting for input gives up the CPU immediately.


                        Input Requested
                   +-------------------------+
                   |               Timeout   |
                   |            +------------+
                   V            V            |
              +---------+   +-------+   +---------+
              | Blocked |-->| Ready |-->| Running |-->  Exit 
              +---------+   +-------+   +---------+
                        Input
                        Ready


                   Context Switch
   +-------------------------------------------+
   |                                           |
   |  ===+=======+=======+=======+=========+   |
   +-->  | Ready | Ready | Ready | Running |-->+
      ===+=======+=======+=======+=========+

Each process is assigned one or more memory segments. Reading, writing, or executing outside these segments cause a run time error (segmentation fault).

The operating system runs in privileged mode, allowing it to:

I/O Devices

I/O Methods

Interrupts

An interrupt acts like a CALL but is triggered by an external event. Interrupts are always handled in the operating system by privileged code.

Some types of interrupts

Interrupts form a hierarchy. An interrupt cannot be interrupted by a lower priority interrupt. There are also (privileged) instructions to disable and enable interrupts.

External buses

An external bus has a specified hardware interface to allow I/O devices from different vendors to be connected. Examples: older to newer:

10. Peripherals

11. Networks

12. Three Examples