Microsoft and SSE2 Intrinsics, or: How to Drive a Cryptographer Mad

When you implement cryptographic algorithms, especially fast implementations, you work often with the Intel MMX/SSEx/AVX/AVX2 instructions sets. Recently, I lost a bit of time spotting a rather difficult-to-find bug. The moral of the story could be: never, ever neglect the compiler warnings!

Here is the context: my implementation was working on Mac OS X and Linux, when compiled in 32-bit and 64-bit modes, with gcc and CLANG/LLVM. This code made a heavy use of SSE intrinsics. Quite happily, I handed my code to the person responsible to compile, test and integrate it into a Microsoft environment, and of course, the code was not working anymore…

The cause was the way I used the  _mm_slli_epi64 (). This intrinsics represents the PSLLQ machine instruction which, given a 128-bit SSE register, shifts the left and right parts of the register in an independent way to the left, inserting zero bits where required. Basically, this routine takes a 128-bit value and an int  representing the shift count, and it returns the resulting doubly-shifted 128-bit value.

Now, let’s have a look at the following code:

It does two things: it loads the 128-bit constant \mathtt{0xDEADBEEFDEADBEEFDEADBEEFDEADBEEF} into a register, and shifts the two 64-bit parts by … 64 positions. In other words, it’s a way (like many other ones) to clear the register; by the way, don’t ask me why I am using this kind of weird mechanism, it’s classified. Second, it outputs the result.

Here is the result on a *nix system, when compiled with gcc:

Now, let’s compile the same thing on Microsoft Visual Studio 10. The result is now

Obviously, the Microsoft compiler has transformed the 64 constant into a 0 one. Actually, it is even polite enough to mention it as a warning:

Disassembling the result effectively confirms this way to do:

So, who’s right ?! Glancing into Intel’s manuals, one learns the following:

Note that the shift count is an imm8 value, i.e., that it is encoded on 8 bits. It means that the shift count can be as large as 255 and this clearly contradicts the explanations given into the Microsoft Visual Studio 10 documentation. Further, one sees how is this instruction supposed to be implemented:

Clearly, the way how Microsoft’s compiler translates this intrinsic does not match the Intel specifications (I can hardly wait people explaining me why it’s a feature…).

Now, one can wonder why the Microsoft developers took this choice. An explanation is maybe the following: for the logical left shift instruction (SHL) operating on normal registers, a mask if effectively applied by the hardware on the shift count: However, it is not the case when in a SSE instruction set context.