CSE3101 Fall 2005 Final Exam, open book, open notes. Name _________________ 1. When is CMOVE faster than JNE? (5 pts) ANSWER: when the result of the previous comparison is unpredictable (random, does not follow a regular pattern). 2. When is MOVQ faster than MOVNTQ? (5 pts) ANSWER: when writing to recently accessed memory, or memory that will be accessed in the near future (i.e. in cache). 3. Write the following in assember using only 32-bit registers. (long long is a signed 64 bit integer in little-endian format). (30 pts) void muladd(int a, int b, long long *sum) { *sum += (long long)a * b; } ; ANSWER _muladd: mov eax, [esp+4] ; a imul dword ptr [esp+8] ; edx:eax = a*b, signed mov ecx, [esp+12] ; sum (pointer) add [ecx], eax ; add low 32 bits adc [ecx+4], edx ; add high 32 bits with carry ret 4. Suppose a and b are doubles, i and j are ints. Declare these variables in assembler and compute a = sqrt((a-i) * (a-j)) / (b+3.0); (30 pts) ; ANSWER .data a real8 ? ; or a dq ? b real8 ? i dword ? ; or i dd ? j dword ? three real8 3.0 .code fld a ; or fld [a] or fld real8 ptr [a] fisub i fld a fisub j fmul fsqrt fld b fadd three fdiv fstp a 5. Write the following function using MMX looping at most n iterations. Assume n > 0 and sum < 65536 (30 pts) int f(unsigned char *a, unsigned char *b, int n) { int i, sum = 0; for (i=0; i b[i]) sum += a[i] - b[i]; return sum; } ; ANSWER _f: mov eax, [esp+4] ; a mov edx, [esp+8] ; b mov ecx, [esp+12] ; n, counts down to 1 pxor mm0, mm0 ; constant zero pxor mm1, mm1 ; 4 16-bit partial sums f1: movq mm2, [eax+ecx*8-8] ; 8 elements of a psubusb mm2, [edx+ecx*8-8] ; 8 elements of max(a-b, 0) movq mm3, mm2 ; zero extend mm2 to 8 16-bit words in mm3:mm2 punpcklbw mm2, mm0 punpckhbw mm3, mm0 paddw mm1, mm2 ; accumulate 4 partial sums in mm1 paddw mm1, mm3 sub ecx, 1 jnz f1 ; add up the 4 partial sums in mm1 (many ways to do this) movq mm2, mm1 psrlq mm2, 32 paddw mm1, mm2 ; 2 16-bit half-sums in low half of mm1 punpcklwd mm1, mm0 ; zero extend to 32 bits movd ecx, mm1 ; now add the two 32-bit halves and put in eax psrlq mm1, 32 movd eax, mm1 add eax, ecx emms ret