Post by Bonita MonteroPost by Marcel MuellerPost by Bonita Monterobit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )
-((int)r&1) | 1
Marcel
movabsq $4607182418800017408, %rax
salq $63, %rdi
orq %rax, %rdi
movq %rdi, %xmm0
andl $1, %edi
pxor %xmm0, %xmm0
negl %edi
orl $1, %edi
cvtsi2sdl %edi, %xmm0
And I've seen that cvtsi2sdl has a six cylcle latency on my
Zen4-CPU whereas the movq rdi, xmm0 only takes one clock cycle.
That's the kind of thing that is at least vaguely relevant here. The
number of instructions means little (especially when it is not clear
that your code involves fewer bytes) - the time taken for these
instructions is the important thing for claims of "most performant" code.
But even with that, all you've got is a claim that one obscure
expression might be slightly faster than some other obscure expression,
on some particular machine. No one cares if a program runs a few
nanoseconds faster - certainly not enough to accept a re-write like
yours (or Marcel's).
What you need to do is find some reason for wanting the original
expression, where you need to evaluate it vast numbers of times.
Perhaps use it in a big vector calculation or DSP algorithm. Write the
full code out. Time it on a real machine, first with the original
expression (from the subject line), then with your version, then with
Marcel's. Compile with a good optimising compiler (such gcc or clang,
not MSVC) with vectorisation appropriate for the processor you are using
(use "-march=native"). Time the differences.
My guess is that in real code, the difference will be negligible, but
that the original simple and clear expression has the edge because it
lets the compiler do more optimisations.
Feel free to prove my guess wrong - it is, after all, just a guess.