Microsoft and SSE2 Intrinsics, or: How to Drive a Cryptographer Mad

Jul 05 2012

When you implement cryptographic algorithms, especially fast implementations, you work often with the Intel MMX/SSEx/AVX/AVX2 instructions sets. Recently, I lost a bit of time spotting a rather difficult-to-find bug. The moral of the story could be: never, ever neglect the compiler warnings!

Here is the context: my implementation was working on Mac OS X and Linux, when compiled in 32-bit and 64-bit modes, with gcc and CLANG/LLVM. This code made a heavy use of SSE intrinsics. Quite happily, I handed my code to the person responsible to compile, test and integrate it into a Microsoft environment, and of course, the code was not working anymore…

The cause was the way I used the  _mm_slli_epi64 (). This intrinsics represents the PSLLQ machine instruction which, given a 128-bit SSE register, shifts the left and right parts of the register in an independent way to the left, inserting zero bits where required. Basically, this routine takes a 128-bit value and an int  representing the shift count, and it returns the resulting doubly-shifted 128-bit value.

Now, let’s have a look at the following code:

It does two things: it loads the 128-bit constant \mathtt{0xDEADBEEFDEADBEEFDEADBEEFDEADBEEF} into a register, and shifts the two 64-bit parts by … 64 positions. In other words, it’s a way (like many other ones) to clear the register; by the way, don’t ask me why I am using this kind of weird mechanism, it’s classified. Second, it outputs the result.

Here is the result on a *nix system, when compiled with gcc:

Now, let’s compile the same thing on Microsoft Visual Studio 10. The result is now

Obviously, the Microsoft compiler has transformed the 64 constant into a 0 one. Actually, it is even polite enough to mention it as a warning:

Disassembling the result effectively confirms this way to do:

So, who’s right ?! Glancing into Intel’s manuals, one learns the following:

Note that the shift count is an imm8 value, i.e., that it is encoded on 8 bits. It means that the shift count can be as large as 255 and this clearly contradicts the explanations given into the Microsoft Visual Studio 10 documentation. Further, one sees how is this instruction supposed to be implemented:

Clearly, the way how Microsoft’s compiler translates this intrinsic does not match the Intel specifications (I can hardly wait people explaining me why it’s a feature…).

Now, one can wonder why the Microsoft developers took this choice. An explanation is maybe the following: for the logical left shift instruction (SHL) operating on normal registers, a mask if effectively applied by the hardware on the shift count: However, it is not the case when in a SSE instruction set context.

If you enjoyed this post, please consider leaving a comment or subscribing to the RSS feed to have future articles delivered to your feed reader.

3 responses so far

  1. What VC likely is doing is applying the rule (for scalar types) that a shift of amount equal to or greater than the size of the value results in an undefined value to SIMD types. I’ve seen Clang do something similar for signed overflow in SSE2. I think part of the reason this happens is these days many compilers are not treating the intrinsics as black boxes of “see intrinsic X, emit instruction X” but instead modeling __m128i and so on as vectors of scalars and then performing optimizations on them as if they were scalar values, including taking advantage of things that the C/C++ standards say are undefined like large shifts, then emitting SSEx instructions in codegen. Take a look at Clang’s emmintrin.h sometime, most intrinsics are just using the generic vector notation with the assumption that the compiler will generate the right SSE ops from that.

    Given that, I’m actually a bit surprised that Clang does seem to zero the register with _mm_slli_epi64(…, 64), as with Clang 3.1:

    unsigned int x = 0xDEADBEEF;
    x >>= 32;

    results in x == 0xDEADBEEF.

  2. what I was going to say:
    can you give the actual bytes emitted by gcc/clang/msvc?
    perhaps there’s another intrinsic or way to force msvc to emit the correct version of psllq?

    however, after looking at the manual, there is no version of psllq which would be truncating the immediate in order to fit in the instruction encoding. You should file a bug with msvc.

  3. @Jack: The fact that Clang truncates the shift count for unsigned integers is totally right, as this will be translated to a SHR instruction, which does implement the truncation (see my last screenshot of the Intel manuals). In summary, Clang perfectly sticks to the Intel specs: truncate for SxR, do not truncate for PSLLQ.

    @Shawn: I have invested some time to find a way to communicate this to Microsoft, but it seems rather complicated :-/ I guess a direct contact would help.

Leave a Reply

*