CSE3101 Fall 2004 Exam #3, open book, open notes.  Name __________________

1.  Under what conditions is MOVNTQ faster than MOVQ? (10 pts)

  ANSWER: MOVNTQ writes to memory bypassing the cache.  It is faster
  when writing to memory not in the cache and that would otherwise be
  flushed from cache before being read again, such as copying a large array.

2.  What is the difference between throughput and latency? (10 pts)

  ANSWER: Throughput is the minimum time between the start of successive
  instructions (allowing parallel execution).  Latency, which is longer,
  is the time from start of execution until the result is available.

3.  When should EMMS be used? (5 pts)

  ANSWER: After any MMX instructions before any FPU instructions.

4.  The following code reads an array of ECX signed words pointed to
by ESI, replacing negative values with 0.  Which of the two conditional
jumps is slower?  Rewrite it using a conditional move instead (15 pts).

      mov bx, 0
  L1: mov ax, [esi+ecx*2-2]
      cmp ax, bx
      jg L2
      xor eax, eax
  L2: mov [esi+ecx*2-2], ax
      sub ecx, 1
      jg L1


  ; ANSWER: jg L2 is slower because branching is unpredictable.
      mov bx, 0
  L1: mov ax, [esi+ecx*2-2]
      cmp ax, bx
      cmovle ax, bx     ; NOT cmovle ax, 0; no immediate operands
      mov [esi+ecx*2-2], ax
      sub ecx, 1
      jg L1
      

5.  Rewrite the code using MMX to perform this operation on 4 elements
at a time.  Assume ECX is a multiple of 4 and ESI is aligned on an
8 byte boundary.  (20 pts).

  ; ANSWER (others are possible)
  L1: movq mm0, [esi+ecx*2-8]   ; NOT -2
      pxor mm1, mm1             ; All 0
      pcmpgtw mm1, mm0          ; 0 words when positive, -1 when negative
      pandn mm1, mm0            ; replace -1 with 0 and 0 with positive mm0 words
      movq [esi+ecx*2-8], mm1
      sub ecx, 4
      jg L1

  ; The second and third instructions could also be:
      movq mm1, mm0
      psraw mm1, 15             ; 0 words when positive, -1 when negative

6.  MM0 contains 8 unsigned bytes.  Replace MM0 with 8 copies of the sum
of these bytes, or all 1 bits if the sum is greater than 255. (20 pts).

  ; ANSWER (others are possible):       let mm0 = a, b, c, d, e, f, g, h
  pshufw mm1, mm0, 01001110b    ; rotate 32 bits: e, f, g, h, a, b, c, d
  paddusb mm0, mm1              ; a+e, b+f, c+g, d+h, a+e, b+f, c+g, d+h
  pshufw mm1, mm0, 10010011b    ; rotate 16 bits
  paddusb mm0, mm1              ; x, y, x, y, x, y, x, y where x=a+c+e+g, y=b+d+f+h
  movq mm1, mm0
  psllq mm1, 8                  ; y, x, y, x, y, x, y, 0
  paddusb mm0, mm1              ; s, s, s, s, s, s, s, y where s=x+y
  pshufw mm0, mm0, 0ffh         ; s, s, s, s, s, s, s, s (copy high word 4 times)


7.  Show the contents of MM0 after each instruction (5 pts each).

                        ; ANSWER (shown as 4 words in hex, low word first)

  pcmpeqb mm0, mm0      ; ffff ffff ffff ffff  (regardless of previous value)

  psrlw mm0, 13         ; 0007 0007 0007 0007  (shift right 4 words)

  pmaddwd mm0, mm0      ; 0062 0000 0062 0000  (2 dwords containing 98 decimal)

  pshufw mm0, mm0, 0    ; 0062 0062 0062 0062  (4 copies of low word)