MMX Instructions

MMX is widely supported: Intel Pentium MMX, Pentium 2+, Celeron, AMD K6, Athlon, Duron, and higher ([PENT,MMX] in NASM docs).

Extended MMX instructions are supported by newer processors: Pentium 3+, Celeron 2+, Athlon and Duron ([KATMAI,MMX] in NASM docs) Table.

Instruction Set

                                                   PADDUSW mm1, mm2
                                                   ^ ^ ^ ^  ^    ^
                                                   | | | |  |    |
   P = packed (all MMX instructions) --------------+ | | |  |    |
   Instruction (add) --------------------------------+ | |  |    |
   S = saturation, US = unsigned saturation -----------+ |  |    |
   B, W, D, Q = 8 bytes, 4 words, 2 dword, or 1 qword ---+  |    |
                                                            |    |
   Destination is an MMX register, mm0-mm7 -----------------+    |
                                                                 |
   Source is an MMX register or 64 bit memory location ----------+

Saturation means that overflows are handled by replacing with the closest representable value, e.g. 0 to 255 for unsigned byte saturation, -32768 to 32767 for signed word, etc.

Unless noted, all instructions take two operands, a and b.
a is mm0-mm7, b is mm0-mm7 or a 64 bit memory location.
a[0], b[0] means the low order byte, word, or dword of a or b.
x means B W D or Q indicating 8, 16, 32, or 64 bit size of input elements.
r means eax, ebx, ecx, edx, esi, edi, esp, ebp.
mm means one of mm0-mm7.
mem means memory in any addressing mode (e.g. qword ptr [eax+ebx*8+offset])
i means an immediate constant (0, 1, 2...).
* means extended MMX, not supported on some older processors.

Instruction       Input Sizes   Notes
-----------       -----------   -----
; Required FPU reset after any MMX instruction before FPU instructions
EMMS                            No operands

; Move
MOVQ                        Q   a = b, a may be memory if b is mm0-7
MOVNTQ mem, mm      *       Q   a = b, fast (non temporal) store
MOVD a, mm                D     a = mm[0], a is a 32 bit register or memory
MOVD mm, b                D     to Q, mm = {b, 0}, b is a 32 bit register or memory

; Parallel arithmetic
PADDx                 B W D     a += b, discard carry
PADDSx                B W       a += b, signed saturation
PADDUSx               B W       a += b, unsigned saturation
PSUBx                 B W D     a -= b, discard borrow
PSUBSx                B W       a -= b, signed saturation
PSUBUSx               B W       a -= b, unsigned saturation
PMULLW                  W       a = (a * b) & 0xffff
PMULHW                  W       a = (a * b) >> 16, signed
PMULHUW             *   W       a = (a * b) >> 16, unsigned

; Complex multiply or vector product
PMADDWD                 W       to D, a = {a[0]*b[0]+a[1]*b[1], a[2]*b[2]+a[3]*b[3]}, signed

; Parallel compare, conditional store
PCMPEQx               B W D     a = -(a == b)
PCMPGTx               B W D     a = -(a > b), signed
PMOVMSKB r, mm      * B         r = (mm < 0), 8 bits, zero extended
MASKMOVQ            * B         if (b[i] < 0) [edi+i] = a[i], b must be mm0-7

; Logical (element boundaries are irrelevant)
PAND                        Q   a &= b
PANDN                       Q   a = ~a & b
POR                         Q   a |= b
PXOR                        Q   a ^= b

; Shift
PSLLx                   W D Q   a <<= b[0], b may also be 0..63
PSRLx                   W D Q   a >>= b[0], unsigned, b may also be 0..63
PSRAx                   W D     a >>= b[0], signed, b may also be 0..31

; Pack to smaller type
PACKSSWB                W       to B, a = {a[0]..a[3],b[0]..b[3]}, signed saturation
PACKUSWB                W       to B, a = {a[0]..a[3],b[0]..b[3]}, unsigned saturation
PACKSSDW                  D     to W, a = {a[0], a[1],b[0], b[1]}, signed saturation

; Unpack
PUNPCKLBW             B         a = {a[0],b[0]...a[3],b[3]}
PUNPCKHBW             B         a = {a[4],b[4]...a[7],b[7]}
PUNPCKLWD               W       a = {a[0],b[0],a[1],b[1]}
PUNPCKHWD               W       a = {a[2],b[2],a[3],b[3]}
PUNPCKLDQ                 D     a = {a[0],b[0]}
PUNPCKHDQ                 D     a = {a[1],b[1]}

; Parallel min, max, average
PMINUB              * B         a = min(a, b), unsigned
PMAXUB              * B         a = max(a, b), unsigned
PMINSW              *   W       a = min(a, b), signed
PMAXSW              *   W       a = max(a, b), signed
PAVGx               * B W       a = (a + b + 1) >> 1

; Reorder elements
PSHUFW a, b, i      *   W       a = {b[i], b[i>>2], b[i>>4], b[i>>6]}

; Extract single element
PEXTRW r, mm, i     *   W       r = mm[i], i=0..3, zero extend
PINSRW mm, b, i     *   W       mm[i] = x, i=0..3, b is 16/32 bit reg or memory

; Vector sum of absolute differences
PSADBW              * B         to W, a[0] = sum(abs(a-b)), unsigned, zero extended
MOVNTQ and MASKMOVQ store non-temporal data - data which will not be reloaded for awhile and therefore should not be stored in cache. This frees cache memory.

References

MMX Primer
NASM Documentation (starting with P)

Material prepared for CSE 3101 by Matt Mahoney, Oct. 30, 2004.