MMX is widely supported: Intel Pentium MMX, Pentium 2+, Celeron, AMD K6, Athlon, Duron, and higher ([PENT,MMX] in NASM docs).
Extended MMX instructions are supported by newer processors: Pentium 3+, Celeron 2+, Athlon and Duron ([KATMAI,MMX] in NASM docs) Table.
PADDUSW mm1, mm2 ^ ^ ^ ^ ^ ^ | | | | | | P = packed (all MMX instructions) --------------+ | | | | | Instruction (add) --------------------------------+ | | | | S = saturation, US = unsigned saturation -----------+ | | | B, W, D, Q = 8 bytes, 4 words, 2 dword, or 1 qword ---+ | | | | Destination is an MMX register, mm0-mm7 -----------------+ | | Source is an MMX register or 64 bit memory location ----------+
Saturation means that overflows are handled by replacing with the closest representable value, e.g. 0 to 255 for unsigned byte saturation, -32768 to 32767 for signed word, etc.
Unless noted, all instructions take two operands, a and b. a is mm0-mm7, b is mm0-mm7 or a 64 bit memory location. a[0], b[0] means the low order byte, word, or dword of a or b. x means B W D or Q indicating 8, 16, 32, or 64 bit size of input elements. r means eax, ebx, ecx, edx, esi, edi, esp, ebp. mm means one of mm0-mm7. mem means memory in any addressing mode (e.g. qword ptr [eax+ebx*8+offset]) i means an immediate constant (0, 1, 2...). * means extended MMX, not supported on some older processors. Instruction Input Sizes Notes ----------- ----------- ----- ; Required FPU reset after any MMX instruction before FPU instructions EMMS No operands ; Move MOVQ Q a = b, a may be memory if b is mm0-7 MOVNTQ mem, mm * Q a = b, fast (non temporal) store MOVD a, mm D a = mm[0], a is a 32 bit register or memory MOVD mm, b D to Q, mm = {b, 0}, b is a 32 bit register or memory ; Parallel arithmetic PADDx B W D a += b, discard carry PADDSx B W a += b, signed saturation PADDUSx B W a += b, unsigned saturation PSUBx B W D a -= b, discard borrow PSUBSx B W a -= b, signed saturation PSUBUSx B W a -= b, unsigned saturation PMULLW W a = (a * b) & 0xffff PMULHW W a = (a * b) >> 16, signed PMULHUW * W a = (a * b) >> 16, unsigned ; Complex multiply or vector product PMADDWD W to D, a = {a[0]*b[0]+a[1]*b[1], a[2]*b[2]+a[3]*b[3]}, signed ; Parallel compare, conditional store PCMPEQx B W D a = -(a == b) PCMPGTx B W D a = -(a > b), signed PMOVMSKB r, mm * B r = (mm < 0), 8 bits, zero extended MASKMOVQ * B if (b[i] < 0) [edi+i] = a[i], b must be mm0-7 ; Logical (element boundaries are irrelevant) PAND Q a &= b PANDN Q a = ~a & b POR Q a |= b PXOR Q a ^= b ; Shift PSLLx W D Q a <<= b[0], b may also be 0..63 PSRLx W D Q a >>= b[0], unsigned, b may also be 0..63 PSRAx W D a >>= b[0], signed, b may also be 0..31 ; Pack to smaller type PACKSSWB W to B, a = {a[0]..a[3],b[0]..b[3]}, signed saturation PACKUSWB W to B, a = {a[0]..a[3],b[0]..b[3]}, unsigned saturation PACKSSDW D to W, a = {a[0], a[1],b[0], b[1]}, signed saturation ; Unpack PUNPCKLBW B a = {a[0],b[0]...a[3],b[3]} PUNPCKHBW B a = {a[4],b[4]...a[7],b[7]} PUNPCKLWD W a = {a[0],b[0],a[1],b[1]} PUNPCKHWD W a = {a[2],b[2],a[3],b[3]} PUNPCKLDQ D a = {a[0],b[0]} PUNPCKHDQ D a = {a[1],b[1]} ; Parallel min, max, average PMINUB * B a = min(a, b), unsigned PMAXUB * B a = max(a, b), unsigned PMINSW * W a = min(a, b), signed PMAXSW * W a = max(a, b), signed PAVGx * B W a = (a + b + 1) >> 1 ; Reorder elements PSHUFW a, b, i * W a = {b[i], b[i>>2], b[i>>4], b[i>>6]} ; Extract single element PEXTRW r, mm, i * W r = mm[i], i=0..3, zero extend PINSRW mm, b, i * W mm[i] = x, i=0..3, b is 16/32 bit reg or memory ; Vector sum of absolute differences PSADBW * B to W, a[0] = sum(abs(a-b)), unsigned, zero extendedMOVNTQ and MASKMOVQ store non-temporal data - data which will not be reloaded for awhile and therefore should not be stored in cache. This frees cache memory.
Material prepared for CSE 3101 by Matt Mahoney, Oct. 30, 2004.