Discussion:
Most performant way to have "r % 2 ? -1.0 : 1.0"
Add Reply
Bonita Montero
2024-04-30 09:34:45 UTC
Reply
Permalink
bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )

Is more performant until AVX10 comes, which has conditional moves
for floating point values.
Marcel Mueller
2024-05-01 14:11:14 UTC
Reply
Permalink
Post by Bonita Montero
bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )
There is no need for conditional:

-((int)r&1) | 1


Marcel
Bonita Montero
2024-05-01 15:05:42 UTC
Reply
Permalink
Post by Marcel Mueller
Post by Bonita Montero
bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )
-((int)r&1) | 1
Marcel
Looks simpler, but your code is one instruction more:

my code:
movabsq $4607182418800017408, %rax
salq $63, %rdi
orq %rax, %rdi
movq %rdi, %xmm0

your code:
andl $1, %edi
pxor %xmm0, %xmm0
negl %edi
orl $1, %edi
cvtsi2sdl %edi, %xmm0
Bonita Montero
2024-05-01 15:14:27 UTC
Reply
Permalink
Post by Marcel Mueller
Post by Bonita Montero
bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )
-((int)r&1) | 1
Marcel
        movabsq $4607182418800017408, %rax
        salq    $63, %rdi
        orq     %rax, %rdi
        movq    %rdi, %xmm0
        andl    $1, %edi
        pxor    %xmm0, %xmm0
        negl    %edi
        orl     $1, %edi
        cvtsi2sdl       %edi, %xmm0
And I've seen that cvtsi2sdl has a six cylcle latency on my
Zen4-CPU whereas the movq rdi, xmm0 only takes one clock cycle.
David Brown
2024-05-01 16:46:31 UTC
Reply
Permalink
Post by Bonita Montero
Post by Marcel Mueller
Post by Bonita Montero
bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )
-((int)r&1) | 1
Marcel
         movabsq $4607182418800017408, %rax
         salq    $63, %rdi
         orq     %rax, %rdi
         movq    %rdi, %xmm0
         andl    $1, %edi
         pxor    %xmm0, %xmm0
         negl    %edi
         orl     $1, %edi
         cvtsi2sdl       %edi, %xmm0
And I've seen that cvtsi2sdl has a six cylcle latency on my
Zen4-CPU whereas the movq rdi, xmm0 only takes one clock cycle.
That's the kind of thing that is at least vaguely relevant here. The
number of instructions means little (especially when it is not clear
that your code involves fewer bytes) - the time taken for these
instructions is the important thing for claims of "most performant" code.

But even with that, all you've got is a claim that one obscure
expression might be slightly faster than some other obscure expression,
on some particular machine. No one cares if a program runs a few
nanoseconds faster - certainly not enough to accept a re-write like
yours (or Marcel's).

What you need to do is find some reason for wanting the original
expression, where you need to evaluate it vast numbers of times.
Perhaps use it in a big vector calculation or DSP algorithm. Write the
full code out. Time it on a real machine, first with the original
expression (from the subject line), then with your version, then with
Marcel's. Compile with a good optimising compiler (such gcc or clang,
not MSVC) with vectorisation appropriate for the processor you are using
(use "-march=native"). Time the differences.

My guess is that in real code, the difference will be negligible, but
that the original simple and clear expression has the edge because it
lets the compiler do more optimisations.

Feel free to prove my guess wrong - it is, after all, just a guess.
Bonita Montero
2024-05-01 17:12:42 UTC
Reply
Permalink
That's the kind of thing that is at least vaguely relevant here.  The
number of instructions means little (especially when it is not clear
that your code involves fewer bytes) - the time taken for these
instructions is the important thing for claims of "most performant" code.
Every instruction for my code has a single cycle latency. Marcel's code
has single cycle latencies except for the last instruction, which takes
seven cyles on my Zen4-CPU. Take the numbers from Agner org, they're
similar for all modern x86-incarnations.
But even with that, all you've got is a claim that one obscure
expression might be slightly faster than some other obscure expression,
Optimized code is often less readable.
What you need to do is find some reason for wanting the original
expression, where you need to evaluate it vast numbers of times.
My code takes three clock cycles on all modern x86-CPUs.
David Brown
2024-05-02 14:00:36 UTC
Reply
Permalink
Post by Bonita Montero
That's the kind of thing that is at least vaguely relevant here.  The
number of instructions means little (especially when it is not clear
that your code involves fewer bytes) - the time taken for these
instructions is the important thing for claims of "most performant" code.
Every instruction for my code has a single cycle latency.
Surely you know that is not sufficient to justify claims of performance?
Latency is certainly a factor, but so is scheduling, parallel
execution, bandwidth for instruction caches and queues, pipeline hazards
and result forwarding, and a dozen other factors.
Post by Bonita Montero
Marcel's code
has single cycle latencies except for the last instruction, which takes
seven cyles on my Zen4-CPU. Take the numbers from Agner org, they're
similar for all modern x86-incarnations.
But even with that, all you've got is a claim that one obscure
expression might be slightly faster than some other obscure expression,
Optimized code is often less readable.
That makes it bad code. Sometimes it is acceptable because you need the
very fastest results for a piece of code, but that is very rare. This
is why I am asking you to justify the performance of your code in
practice - justify that it actually /is/ faster, and justify that it is
/usefully/ faster. Without that, it is smart-arse code, not smart code.

(If you are just playing around for fun and the challenge of making such
code sequences, that's fine.)
Post by Bonita Montero
What you need to do is find some reason for wanting the original
expression, where you need to evaluate it vast numbers of times.
My code takes three clock cycles on all modern x86-CPUs.
That is almost as far from answering my question as it is possible to get.
Bonita Montero
2024-05-02 14:25:14 UTC
Reply
Permalink
Post by David Brown
Surely you know that is not sufficient to justify claims of performance?
 Latency is certainly a factor, but so is scheduling, parallel
execution, bandwidth for instruction caches and queues, pipeline hazards
and result forwarding, and a dozen other factors.
Latency is the time until there's a result. Less instructions let other
instructions more likely occupy free execution units. So this can be
considered in the reduced way I did.
Post by David Brown
That makes it bad code. ...
Absolutely not. A Bubblesort is easier to read than a qucksort but no
one woul chose Bubblesort for readability. And for me such tricks are
readable but they overburden you.
David Brown
2024-05-03 09:16:49 UTC
Reply
Permalink
Post by Bonita Montero
Post by David Brown
Surely you know that is not sufficient to justify claims of
performance?   Latency is certainly a factor, but so is scheduling,
parallel execution, bandwidth for instruction caches and queues,
pipeline hazards and result forwarding, and a dozen other factors.
Latency is the time until there's a result. Less instructions let other
instructions more likely occupy free execution units. So this can be
considered in the reduced way I did.
I know what "latency" means, and why it is only one of many factors to
consider when looking at performance.
Post by Bonita Montero
Post by David Brown
That makes it bad code. ...
Please stop snipping relevant parts of posts - such as the reason why
your expression is bad code.
Post by Bonita Montero
Absolutely not. A Bubblesort is easier to read than a qucksort but no
one woul chose Bubblesort for readability. And for me such tricks are
readable but they overburden you.
/You/ were the one who said your code was less readable. (No one else
had to say it because it is so obvious.)

You posted an expression that is an incomprehensible alternative to a
comprehensible but apparently useless expression. You are unable to
demonstrate that your re-write is actually faster in practice - you can
merely show that compilers generate fewer instructions for it than a
different variation of the original expression. You can't give any uses
of the expression, nor any measurements showing it is better in
practice. Without seeing the effect in context, your code is utterly
pointless and worse than useless.

It's just smart-arse coding - the kind that impresses new programmers
that haven't learned how to write code properly.

If you want to show anyone that your expression is actually a good idea,
you know what you have to do to demonstrate it.
Bonita Montero
2024-05-03 11:35:55 UTC
Reply
Permalink
Post by David Brown
Post by Bonita Montero
Post by David Brown
Surely you know that is not sufficient to justify claims of
performance?   Latency is certainly a factor, but so is scheduling,
parallel execution, bandwidth for instruction caches and queues,
pipeline hazards and result forwarding, and a dozen other factors.
Latency is the time until there's a result. Less instructions let other
instructions more likely occupy free execution units. So this can be
considered in the reduced way I did.
I know what "latency" means, and why it is only one of many factors to
consider when looking at performance.
Latency is the time until there's a result, and this matters.
Post by David Brown
Post by Bonita Montero
Post by David Brown
That makes it bad code. ...
Please stop snipping relevant parts of posts - such as the reason why
your expression is bad code.
It's currently the fastest solution on x86 for this purpose.
So this isn't bad code.
Post by David Brown
Post by Bonita Montero
Absolutely not. A Bubblesort is easier to read than a qucksort but no
one woul chose Bubblesort for readability. And for me such tricks are
readable but they overburden you.
/You/ were the one who said your code was less readable.  (No one else
had to say it because it is so obvious.)
You posted an expression that is an incomprehensible alternative to a
comprehensible but apparently useless expression.  You are unable to
demonstrate that your re-write is actually faster in practice - you can
merely show that compilers generate fewer instructions for it than a
different variation of the original expression. ...
Compilers don't generate faster code for that. Try it on godbolt.
Post by David Brown
You can't give any uses of the expression, ...
I've one use for that and I extracted the code.
Post by David Brown
It's just smart-arse coding - the kind that impresses new programmers
that haven't learned how to write code properly.
You're overburdened with that style of code and you want to discuss
it away.
Post by David Brown
If you want to show anyone that your expression is actually a good idea,
you know what you have to do to demonstrate it.
David Brown
2024-05-03 12:38:58 UTC
Reply
Permalink
Post by Bonita Montero
Post by David Brown
Post by Bonita Montero
Post by David Brown
Surely you know that is not sufficient to justify claims of
performance?   Latency is certainly a factor, but so is scheduling,
parallel execution, bandwidth for instruction caches and queues,
pipeline hazards and result forwarding, and a dozen other factors.
Latency is the time until there's a result. Less instructions let other
instructions more likely occupy free execution units. So this can be
considered in the reduced way I did.
I know what "latency" means, and why it is only one of many factors to
consider when looking at performance.
Latency is the time until there's a result, and this matters.
We agree it matters - your mistake is thinking it is the only thing that
matters.
Post by Bonita Montero
Post by David Brown
Post by Bonita Montero
Post by David Brown
That makes it bad code. ...
Please stop snipping relevant parts of posts - such as the reason why
your expression is bad code.
It's currently the fastest solution on x86 for this purpose.
The only purpose you've demonstrated so far is for your bragging rights
as the biggest smart-arse in the group. (I know English is not your
first language, but to be clear, this is not a compliment.)
Post by Bonita Montero
So this isn't bad code.
Of course it is bad code - it is terrible, and would be rejected by any
code review for normal use. Obscure micro-optimisations like that are
worth considering in two situations - in compilers as code they might
generate for optimisations, and buried deep in very high performance
libraries as part of critical code. You are doing neither.
Post by Bonita Montero
Post by David Brown
Post by Bonita Montero
Absolutely not. A Bubblesort is easier to read than a qucksort but no
one woul chose Bubblesort for readability. And for me such tricks are
readable but they overburden you.
/You/ were the one who said your code was less readable.  (No one else
had to say it because it is so obvious.)
You posted an expression that is an incomprehensible alternative to a
comprehensible but apparently useless expression.  You are unable to
demonstrate that your re-write is actually faster in practice - you
can merely show that compilers generate fewer instructions for it than
a different variation of the original expression. ...
Compilers don't generate faster code for that. Try it on godbolt.
Godbolt does not show how fast code is any more than you do.

Do you understand the importance of context in code performance? Do you
understand about the interaction between different parts of code? Or
that compilers are not dumb translators handling single expressions in
isolation? Have you any idea at all about what is /relevant/ in real
programming?

I realise that it is cool and clever to have figured out a trick for
this expression. I've nothing against that - it's fun, and fun is good.
If that's all you are trying to show, then that's fine. But if you
want to claim it is "good" code, or useful, or makes a difference, then
you have to show that. At the moment, what you have is of no use in
programming.
Post by Bonita Montero
Post by David Brown
You can't give any uses of the expression, ...
I've one use for that and I extracted the code.
Then show the rest of the code, along with the timing comparisons I
suggested.
Post by Bonita Montero
Post by David Brown
It's just smart-arse coding - the kind that impresses new programmers
that haven't learned how to write code properly.
You're overburdened with that style of code and you want to discuss
it away.
Post by David Brown
If you want to show anyone that your expression is actually a good
idea, you know what you have to do to demonstrate it.
Bonita Montero
2024-05-03 12:51:56 UTC
Reply
Permalink
Post by David Brown
We agree it matters - your mistake is thinking it is the only thing that
matters.
Of course it is bad code - it is terrible, and would be rejected by any
code review for normal use. ...
Look at the glibc math-Code, it's doing such tricks all over the lib.
Post by David Brown
Godbolt does not show how fast code is any more than you do.
This can be easily estimated if you know the processing time of the
instructions.
Post by David Brown
Do you understand the importance of context in code performance?  Do you
understand about the interaction between different parts of code?  Or
that compilers are not dumb translators handling single expressions in
isolation?  Have you any idea at all about what is /relevant/ in real
programming?
My use case was some code that alternately multiplies with 1.0 and
-1.0 depending on an i and performance was crucial with that. For
me such tricks are easy to read and they don't overburden me like
you.
Post by David Brown
Then show the rest of the code, along with the timing comparisons I
suggested.
That's the current version of the statement:
chain[i] = bit_cast<double>( bit_cast<uint64_t>( poly * reverseFacs[i]
) ^ (uint64_t)(i % 2) << 63 );
I even dropped the multiplication by alternately inverting the
sign bit. According to my measurements this makes the whole
routine about 11% fastet than the solution before.
David Brown
2024-05-03 15:16:59 UTC
Reply
Permalink
Post by Bonita Montero
Post by David Brown
We agree it matters - your mistake is thinking it is the only thing
that matters.
Of course it is bad code - it is terrible, and would be rejected by
any code review for normal use. ...
Look at the glibc math-Code, it's doing such tricks all over the lib.
There is not much point discussing anything with you until you learn the
basics of communication.
Bonita Montero
2024-05-03 15:21:44 UTC
Reply
Permalink
Post by David Brown
Post by Bonita Montero
Post by David Brown
We agree it matters - your mistake is thinking it is the only thing
that matters.
Of course it is bad code - it is terrible, and would be rejected by
any code review for normal use. ...
Look at the glibc math-Code, it's doing such tricks all over the lib.
There is not much point discussing anything with you until you learn the
basics of communication.
You said that teams wouldn't chose such a style which includes such
tricks and I gave you an example of a very popular open source project.
Have a look at the current implementation of fmod() in the glibc, which
does a lot of such bit-fiddling.
Bonita Montero
2024-05-03 15:36:28 UTC
Reply
Permalink
Post by Bonita Montero
Post by David Brown
Post by Bonita Montero
Post by David Brown
We agree it matters - your mistake is thinking it is the only thing
that matters.
Of course it is bad code - it is terrible, and would be rejected by
any code review for normal use. ...
Look at the glibc math-Code, it's doing such tricks all over the lib.
There is not much point discussing anything with you until you learn
the basics of communication.
You said that teams wouldn't chose such a style which includes such
tricks and I gave you an example of a very popular open source project.
Have a look at the current implementation of fmod() in the glibc, which
does a lot of such bit-fiddling.
Here's the code:
https://github.com/bminor/glibc/blob/master/sysdeps/ieee754/dbl-64/e_fmod.c
red floyd
2024-05-02 01:23:15 UTC
Reply
Permalink
Post by Marcel Mueller
Post by Bonita Montero
bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )
-((int)r&1) | 1
[redacted]
What the heck, I'll throw this in:

-1 + ((r & 1) << 1)
red floyd
2024-05-02 01:25:19 UTC
Reply
Permalink
Post by red floyd
Post by Marcel Mueller
Post by Bonita Montero
bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )
-((int)r&1) | 1
[redacted]
-1 + ((r & 1) << 1)
Oops. Sign reversal.

1 - ((r & 1) << 1)
Bonita Montero
2024-05-02 03:49:19 UTC
Reply
Permalink
Post by red floyd
Post by Marcel Mueller
Post by Bonita Montero
bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )
-((int)r&1) | 1
[redacted]
-1 + ((r & 1) << 1)
The problem with that is that the result is converted to a floating
point value which is a rather slow operation.
David Brown
2024-05-02 14:04:37 UTC
Reply
Permalink
Post by Bonita Montero
Post by red floyd
Post by Marcel Mueller
Post by Bonita Montero
bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )
-((int)r&1) | 1
[redacted]
-1 + ((r & 1) << 1)
The problem with that is that the result is converted to a floating
point value which is a rather slow operation.
That's only a problem if it is a problem in real code. You haven't
shown any situation where it would be an issue.

In common real-world situations where you have a variable that is set to
-1 or 1 (integer or floating point), the next thing you will do is use
that to multiply other data which you will then accumulate. If the
-1/+1 expression is written simply, the compiler may optimise it away
entirely - turning things into a conditional code that either adds or
subtracts the later data. With your mess, it won't do that - it will
have to use real multiplication and miss out on optimisation opportunities.
Bonita Montero
2024-05-02 14:27:22 UTC
Reply
Permalink
Post by David Brown
That's only a problem if it is a problem in real code.
You haven't shown any situation where it would be an issue.
There are situation where this is imaginable; this is sufficient.
Post by David Brown
In common real-world situations where you have a variable that is set to
-1 or 1 (integer or floating point), the next thing you will do is use
that to multiply other data which you will then accumulate.  If the
-1/+1 expression is written simply, the compiler may optimise it away
entirely - turning things into a conditional code that either adds or
subtracts the later data.
For floating point operations this is likely to induce a jump which
isn't good to predict as there is no floating point conditional move
with x86. The upcoming AVX10.2 will have conditional moves for that.
Marcel Mueller
2024-05-03 05:32:20 UTC
Reply
Permalink
Post by Bonita Montero
And I've seen that cvtsi2sdl has a six cylcle latency on my
Zen4-CPU whereas the movq rdi, xmm0 only takes one clock cycle.
You are right, int to float conversions are not that optimized. And they
require a dynamic shift operation for the integral logarithm of base 2.

So the only optimization you can do is r&1 ? -1. : 1. to avoid the
division. The compiler may optimize this to a conditional store.


Marcel
Bonita Montero
2024-05-03 06:37:55 UTC
Reply
Permalink
Post by Marcel Mueller
So the only optimization you can do is r&1 ? -1. : 1. to avoid the
division. The compiler may optimize this to a conditional store.
There are currently no conditional moves for floating point values
with x86. clang does a table lookup for the above statement which
is also quite o.k., but a load from L1-cache is four cycles.
Marcel Mueller
2024-05-03 14:46:28 UTC
Reply
Permalink
Post by Bonita Montero
Post by Marcel Mueller
So the only optimization you can do is r&1 ? -1. : 1. to avoid the
division. The compiler may optimize this to a conditional store.
There are currently no conditional moves for floating point values
with x86.
Interesting. The GPU of a Raspberry Pi Model 1 can do Job. :-)


Marcel
Scott Lurndal
2024-05-01 15:39:49 UTC
Reply
Permalink
Post by Bonita Montero
Post by Marcel Mueller
Post by Bonita Montero
bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )
-((int)r&1) | 1
Marcel
movabsq $4607182418800017408, %rax
salq $63, %rdi
orq %rax, %rdi
movq %rdi, %xmm0
andl $1, %edi
pxor %xmm0, %xmm0
negl %edi
orl $1, %edi
cvtsi2sdl %edi, %xmm0
the pxor is "executed" in the fetch stage, so adds no latency
(it is effectively a NOP).
Bonita Montero
2024-05-01 15:48:21 UTC
Reply
Permalink
Post by Scott Lurndal
Post by Bonita Montero
Post by Marcel Mueller
Post by Bonita Montero
bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )
-((int)r&1) | 1
Marcel
movabsq $4607182418800017408, %rax
salq $63, %rdi
orq %rax, %rdi
movq %rdi, %xmm0
andl $1, %edi
pxor %xmm0, %xmm0
negl %edi
orl $1, %edi
cvtsi2sdl %edi, %xmm0
the pxor is "executed" in the fetch stage, so adds no latency
(it is effectively a NOP).
According to Agner's instruction tables it has a one cycle
latency and there could be four PXOR-instrucitons executed
at the same time for my Zen4-CPU.
Loading...