Discussion:
C++20 futex with heavy contention slower than mutex
(too old to reply)
Bonita Montero
2024-03-29 13:14:11 UTC
Permalink
The following program simulates constant locking und unlocking of one
to jthread::hardware_concurrency() threadas with a std::mutex and a
futex. On my 16 core / 32 thread Zen4 system a futex is faster up to
5 threads constantly contending, but beyond the CPU time of the futex
explodes and the conventional mutex is faster with Windows as and with
Linux.

#include <iostream>
#include <thread>
#include <mutex>
#include <atomic>
#include <functional>
#include <chrono>
#include <vector>

using namespace std;
using namespace chrono;

int main()
{
constexpr int64_t ROUNDS = 10000;
auto bench = [&]( char const *head, auto fn )
{
cout << head << endl;
atomic_int64_t tSum( 0 );
vector<jthread> threads;
int hc = jthread::hardware_concurrency();
for( int nThreads = 1; nThreads <= hc; ++nThreads )
{
for( unsigned t = nThreads; t; --t )
threads.emplace_back( [&]()
{
auto start = high_resolution_clock::now();
int64_t cmp = 0;
for( int64_t r = ROUNDS; r; --r )
cmp = fn( cmp );
tSum += duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count();
} );
threads.resize( 0 );
cout << "\t" << nThreads << ": " << tSum /
((double)nThreads * ROUNDS) << endl;
}
};
{
mutex mtx;
bench( "mutex: ",
[&]( int64_t )
{
mtx.lock();
mtx.unlock();
return 0;
} );
}
{
atomic_int64_t futex( 0 );
constexpr int64_t HIBIT = numeric_limits<int64_t>::min();
bench( "futex:",
[&]( int64_t cmp )
{
int64_t niu;
for( ; ; )
if( cmp >= 0 )
if( futex.compare_exchange_weak( cmp, niu = cmp
| HIBIT, memory_order_acquire, memory_order_relaxed ) )
{
cmp = niu;
break;
}
else;
else
if( futex.compare_exchange_weak( cmp, niu = cmp
+ 1, memory_order_relaxed, memory_order_relaxed ) )
{
futex.wait( niu, memory_order_acquire );
cmp = futex.load( memory_order_relaxed );
}
for( ; ; )
if( cmp & ~HIBIT )
if( futex.compare_exchange_weak( cmp, niu =
(cmp - 1) & ~HIBIT, memory_order_release, memory_order_relaxed ) )
{
cmp = niu;
futex.notify_one();
break;
}
else;
else
if( futex.compare_exchange_weak( cmp, 0,
memory_order_release, memory_order_relaxed ) )
{
cmp = 0;
break;
}
return cmp;
} );
}
}

This are the results under Ubuntu 20.04 LTS:

mutex:
1: 3.9134
2: 5.8706
3: 33.1753
4: 48.63
5: 71.6777
6: 109.556
7: 151.271
8: 208.871
9: 288.031
10: 353.536
11: 446.28
12: 563.841
13: 701.134
14: 833.08
15: 983.734
16: 1138.48
17: 1297.53
18: 1481.15
19: 1664.51
20: 1857.88
21: 2053.83
22: 2285.07
23: 2546.67
24: 2782.53
25: 3065.25
26: 3349.4
27: 3652.06
28: 3971.25
29: 4339.13
30: 4727.31
31: 5129.95
32: 5499.05
futex:
1: 3.3654
2: 5.02155
3: 6.69107
4: 15.2522
5: 16.8824
6: 185.235
7: 162.176
8: 360.421
9: 2272.83
10: 4805.11
11: 8987.24
12: 14225.5
13: 20076.4
14: 27199.4
15: 36711.8
16: 49480.8
17: 57641.8
18: 75633.2
19: 97196.6
20: 122223
21: 147617
22: 180377
23: 216084
24: 256032
25: 299681
26: 345518
27: 398785
28: 445499
29: 506351
30: 572440
31: 643455
32: 721461

With all threads locking and unlocking the conventional mutex is 131
timea faster.
Chris M. Thomasson
2024-03-29 22:28:16 UTC
Permalink
Post by Bonita Montero
The following program simulates constant locking und unlocking of one
to jthread::hardware_concurrency() threadas with a std::mutex and a
futex. On my 16 core / 32 thread Zen4 system a futex is faster up to
5 threads constantly contending, but beyond the CPU time of the futex
explodes and the conventional mutex is faster with Windows as and with
Linux.
[...]

A futex is not a mutex!
Chris M. Thomasson
2024-03-29 22:30:48 UTC
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
The following program simulates constant locking und unlocking of one
to jthread::hardware_concurrency() threadas with a std::mutex and a
futex. On my 16 core / 32 thread Zen4 system a futex is faster up to
5 threads constantly contending, but beyond the CPU time of the futex
explodes and the conventional mutex is faster with Windows as and with
Linux.
[...]
A futex is not a mutex!
I have to take a look at the logic you used for your "mutex" based on a
futex, not enough time right now. Perhaps, the std::mutex is just way
more efficient that your use of a futex? Keep in mind futexs are tricky:

https://cis.temple.edu/~giorgio/cis307/readings/futex.pdf

;^)

It depends on the logic you use.
Chris M. Thomasson
2024-03-29 23:29:43 UTC
Permalink
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by Bonita Montero
The following program simulates constant locking und unlocking of one
to jthread::hardware_concurrency() threadas with a std::mutex and a
futex. On my 16 core / 32 thread Zen4 system a futex is faster up to
5 threads constantly contending, but beyond the CPU time of the futex
explodes and the conventional mutex is faster with Windows as and with
Linux.
[...]
A futex is not a mutex!
I have to take a look at the logic you used for your "mutex" based on a
futex, not enough time right now. Perhaps, the std::mutex is just way
https://cis.temple.edu/~giorgio/cis307/readings/futex.pdf
;^)
It depends on the logic you use.
You generally want a "waitbit" type of logic... I posted a mutex based
on a futex here a while back. I just need to find it. Iirc, it used a
Windows "futex"... If I can find it, you should test against it. It used
NO compare and swap. Ugghhhh.... Try to avoid that.
Chris M. Thomasson
2024-03-30 02:46:29 UTC
Permalink
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by Bonita Montero
The following program simulates constant locking und unlocking of one
to jthread::hardware_concurrency() threadas with a std::mutex and a
futex. On my 16 core / 32 thread Zen4 system a futex is faster up to
5 threads constantly contending, but beyond the CPU time of the futex
explodes and the conventional mutex is faster with Windows as and with
Linux.
[...]
A futex is not a mutex!
I have to take a look at the logic you used for your "mutex" based on
a futex, not enough time right now. Perhaps, the std::mutex is just
way more efficient that your use of a futex? Keep in mind futexs are
https://cis.temple.edu/~giorgio/cis307/readings/futex.pdf
;^)
It depends on the logic you use.
You generally want a "waitbit" type of logic... I posted a mutex based
on a futex here a while back. I just need to find it. Iirc, it used a
Windows "futex"... If I can find it, you should test against it. It used
NO compare and swap. Ugghhhh.... Try to avoid that.
Another key, try to avoid waiting on a futex, strive for that. Maximize
the fast paths, and minimize the slow paths.
Bonita Montero
2024-03-30 08:25:26 UTC
Permalink
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by Bonita Montero
The following program simulates constant locking und unlocking of one
to jthread::hardware_concurrency() threadas with a std::mutex and a
futex. On my 16 core / 32 thread Zen4 system a futex is faster up to
5 threads constantly contending, but beyond the CPU time of the futex
explodes and the conventional mutex is faster with Windows as and with
Linux.
[...]
A futex is not a mutex!
I have to take a look at the logic you used for your "mutex" based on
a futex, not enough time right now. Perhaps, the std::mutex is just
way more efficient that your use of a futex? Keep in mind futexs are
https://cis.temple.edu/~giorgio/cis307/readings/futex.pdf
;^)
It depends on the logic you use.
You generally want a "waitbit" type of logic...
The sign-bit of my uint64 shows that signs that the lock is occupied
and the 63 lower bits sign the number of threads waiting to get owner-
ship. But i simplified the code a lot by using just an atomic_bool.
Now under Windows the futex is superior from 12 threads on. But under
Linux the futex is slower except for one or two threads.

Windows:

1: 248%
2: 54%
3: 27%
4: 28%
5: 33%
6: 31%
7: 37%
8: 37%
9: 53%
10: 42%
11: 82%
12: 101%
13: 116%
14: 142%
15: 153%
16: 160%
17: 175%
18: 191%
19: 192%
20: 194%
21: 206%
22: 211%
23: 221%
24: 220%
25: 223%
26: 220%
27: 220%
28: 223%
29: 222%
30: 220%
31: 219%
32: 216%

Linux:

1: 121%
2: 106%
3: 67%
4: 99%
5: 87%
6: 18%
7: 13%
8: 12%
9: 10%
10: 10%
11: 9%
12: 9%
13: 8%
14: 8%
15: 8%
16: 8%
17: 8%
18: 8%
19: 8%
20: 8%
21: 7%
22: 7%
23: 8%
24: 7%
25: 7%
26: 7%
27: 7%
28: 7%
29: 7%
30: 7%
31: 7%
32: 6%

#include <iostream>
#include <thread>
#include <atomic>
#include <chrono>
#include <vector>
#include <mutex>

using namespace std;
using namespace chrono;

int main()
{
unsigned hc = jthread::hardware_concurrency();
auto test = [&]( char const *head, auto fn, vector<double> &times )
{
cout << head << endl;
constexpr size_t ROUNDS = 10'000;
atomic_int64_t tSum, roundsSum;
vector<jthread> threads;
for( unsigned nThreads = 1; nThreads <= hc; ++nThreads )
{
auto st = []<typename T, integral X>( atomic<T> &a, X v ) { return
a.store( (T)v, memory_order_relaxed ); };
st( tSum, 0 );
st( roundsSum, 0 );
atomic_bool stop( false );
auto ld = []<typename T>( atomic<T> &a ) { return a.load(
memory_order_relaxed ); };
for( unsigned t = nThreads; t; --t )
threads.emplace_back( [&]
{
auto start = high_resolution_clock::now();
int64_t rounds = 0;
for( ; !ld( stop ); fn(), ++rounds );
auto add = []( atomic_int64_t &a, int64_t add ) { a.fetch_add(
add, memory_order_relaxed ); };
add( tSum, duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count() );
add( roundsSum, rounds );
} );
this_thread::sleep_for( 100ms );
st( stop, true );
threads.resize( 0 );
times.emplace_back( ld( tSum ) / (double)ld( roundsSum ) );
cout << "\t" << nThreads << ": " << times.back() << endl;
}
};
vector<double> mutexTimes, futexTimes;
{
mutex mtx;
test( "mutex:",
[&]
{
mtx.lock();
mtx.unlock();
}, mutexTimes );
}
{
atomic_bool futex( false );
test( "futex:",
[&]
{
for( bool cmp; !futex.compare_exchange_weak( cmp = false, true,
memory_order_acquire, memory_order_relaxed ); )
futex.wait( true, memory_order_relaxed );
futex.store( false, memory_order_release );
futex.notify_one();
}, futexTimes );
}
cout << "relation: " << endl;
for( size_t i = 0; i != hc; ++i )
cout << "\t" << i + 1 << ": " << (int)(100.0 * mutexTimes[i] /
futexTimes[i] + 0.5) << "%" << endl;
}
Chris M. Thomasson
2024-03-30 18:51:59 UTC
Permalink
{
                for( bool cmp; !futex.compare_exchange_weak( cmp =
false, true, memory_order_acquire, memory_order_relaxed ); )
                    futex.wait( true, memory_order_relaxed );
                futex.store( false, memory_order_release );
                futex.notify_one();
            }, futexTimes );
You need to introduce some sort of waitbit logic to minimize calls to
futex.notify_one() and futex.wait(). I posted an example but am having a
hard time finding the damn thing in comp.lang.c++. I will try to find it.
Chris M. Thomasson
2024-03-30 19:02:28 UTC
Permalink
Post by Chris M. Thomasson
{
                 for( bool cmp; !futex.compare_exchange_weak( cmp =
false, true, memory_order_acquire, memory_order_relaxed ); )
                     futex.wait( true, memory_order_relaxed );
                 futex.store( false, memory_order_release );
                 futex.notify_one();
             }, futexTimes );
You need to introduce some sort of waitbit logic to minimize calls to
futex.notify_one() and futex.wait(). I posted an example but am having a
hard time finding the damn thing in comp.lang.c++. I will try to find it.
I found my older experimental code, take a deep look at ct_futex_mutex:

https://groups.google.com/g/comp.lang.c++/c/1MZvhswJ6DQ/m/qyaYH-i0CgAJ
Bonita Montero
2024-03-30 20:18:06 UTC
Permalink
Post by Chris M. Thomasson
You need to introduce some sort of waitbit logic to minimize calls to
futex.notify_one() and futex.wait(). I posted an example but am having a
hard time finding the damn thing in comp.lang.c++. I will try to find it.
Futex notify and wait are at least very cheap with Windows.
Chris M. Thomasson
2024-03-31 20:11:52 UTC
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
You need to introduce some sort of waitbit logic to minimize calls to
futex.notify_one() and futex.wait(). I posted an example but am having
a hard time finding the damn thing in comp.lang.c++. I will try to
find it.
Futex notify and wait are at least very cheap with Windows.
Well, I have not taken a look at Microsoft's Futex implementation.
Afaict, your code showcases that aspect wrt calling into notify on every
mutex unlock, and wait on every point of contention wrt lock. Interesting.
Bonita Montero
2024-04-01 06:32:08 UTC
Permalink
Post by Chris M. Thomasson
Well, I have not taken a look at Microsoft's Futex implementation.
Afaict, your code showcases that aspect wrt calling into notify on every
mutex unlock, and wait on every point of contention wrt lock. Interesting.
I'm using notify and wait only when there's contention.
Chris M. Thomasson
2024-04-01 06:40:28 UTC
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Well, I have not taken a look at Microsoft's Futex implementation.
Afaict, your code showcases that aspect wrt calling into notify on
every mutex unlock, and wait on every point of contention wrt lock.
Interesting.
I'm using notify and wait only when there's contention.
Even here?
_____________
for( bool cmp; !futex.compare_exchange_weak( cmp =
false, true, memory_order_acquire, memory_order_relaxed ); )
futex.wait( true, memory_order_relaxed );
futex.store( false, memory_order_release );
futex.notify_one();
}, futexTimes );
____________

What am I missing?

futex.store( false, memory_order_release );
futex.notify_one();

?
Bonita Montero
2024-04-01 08:57:32 UTC
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
Well, I have not taken a look at Microsoft's Futex implementation.
Afaict, your code showcases that aspect wrt calling into notify on
every mutex unlock, and wait on every point of contention wrt lock.
Interesting.
I'm using notify and wait only when there's contention.
Even here?
_____________
                 for( bool cmp; !futex.compare_exchange_weak( cmp =
false, true, memory_order_acquire, memory_order_relaxed ); )
                     futex.wait( true, memory_order_relaxed );
                 futex.store( false, memory_order_release );
                 futex.notify_one();
             }, futexTimes );
____________
What am I missing?
                 futex.store( false, memory_order_release );
                 futex.notify_one();
?
Ok, I was talking about my older version.
But notify is fast.
Bonita Montero
2024-04-01 09:01:05 UTC
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
Well, I have not taken a look at Microsoft's Futex implementation.
Afaict, your code showcases that aspect wrt calling into notify on
every mutex unlock, and wait on every point of contention wrt lock.
Interesting.
I'm using notify and wait only when there's contention.
Even here?
_____________
                  for( bool cmp; !futex.compare_exchange_weak( cmp =
false, true, memory_order_acquire, memory_order_relaxed ); )
                      futex.wait( true, memory_order_relaxed );
                  futex.store( false, memory_order_release );
                  futex.notify_one();
              }, futexTimes );
____________
What am I missing?
                  futex.store( false, memory_order_release );
                  futex.notify_one();
?
Ok, I was talking about my older version.
But notify is fast.
An uncontended notify is 2.5ns on my Zen4 computer.
Chris M. Thomasson
2024-04-01 19:27:31 UTC
Permalink
Post by Bonita Montero
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
Well, I have not taken a look at Microsoft's Futex implementation.
Afaict, your code showcases that aspect wrt calling into notify on
every mutex unlock, and wait on every point of contention wrt lock.
Interesting.
I'm using notify and wait only when there's contention.
Even here?
_____________
                  for( bool cmp; !futex.compare_exchange_weak( cmp =
false, true, memory_order_acquire, memory_order_relaxed ); )
                      futex.wait( true, memory_order_relaxed );
                  futex.store( false, memory_order_release );
                  futex.notify_one();
              }, futexTimes );
____________
What am I missing?
                  futex.store( false, memory_order_release );
                  futex.notify_one();
?
Ok, I was talking about my older version.
But notify is fast.
An uncontended notify is 2.5ns on my Zen4 computer.
For the Microsoft version, right? I don't have Linux installed at the
moment.
Bonita Montero
2024-04-02 02:37:47 UTC
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
Well, I have not taken a look at Microsoft's Futex implementation.
Afaict, your code showcases that aspect wrt calling into notify on
every mutex unlock, and wait on every point of contention wrt
lock. Interesting.
I'm using notify and wait only when there's contention.
Even here?
_____________
                  for( bool cmp; !futex.compare_exchange_weak( cmp =
false, true, memory_order_acquire, memory_order_relaxed ); )
                      futex.wait( true, memory_order_relaxed );
                  futex.store( false, memory_order_release );
                  futex.notify_one();
              }, futexTimes );
____________
What am I missing?
                  futex.store( false, memory_order_release );
                  futex.notify_one();
?
Ok, I was talking about my older version.
But notify is fast.
An uncontended notify is 2.5ns on my Zen4 computer.
For the Microsoft version, right? I don't have Linux installed at the
moment.
Linux with libstdc++ is 1.7ns.
Chris M. Thomasson
2024-04-12 21:16:52 UTC
Permalink
On 4/1/2024 7:37 PM, Bonita Montero wrote:
[...]
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Bonita Montero
Ok, I was talking about my older version.
But notify is fast.
An uncontended notify is 2.5ns on my Zen4 computer.
Keep in mind that notify and wait are slow paths. One needs to strive to
minimize calls to notify and wait. Calling notify and/or wait when they
do _have_ to be called is a rather inefficient design.
Post by Bonita Montero
Post by Chris M. Thomasson
For the Microsoft version, right? I don't have Linux installed at the
moment.
Linux with libstdc++ is 1.7ns.
Bonita Montero
2024-04-13 01:00:21 UTC
Permalink
Post by Chris M. Thomasson
[...]
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Bonita Montero
Ok, I was talking about my older version.
But notify is fast.
An uncontended notify is 2.5ns on my Zen4 computer.
Keep in mind that notify and wait are slow paths.
There's no slow path with futexes since you've to notify _always_.
And 1.7ns isn't slow.
Post by Chris M. Thomasson
One needs to strive to minimize calls to notify and wait. Calling notify and/or wait when they
do _have_ to be called is a rather inefficient design.
Post by Bonita Montero
Post by Chris M. Thomasson
For the Microsoft version, right? I don't have Linux installed at the
moment.
Linux with libstdc++ is 1.7ns.
Chris M. Thomasson
2024-04-13 19:26:55 UTC
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
[...]
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Bonita Montero
Ok, I was talking about my older version.
But notify is fast.
An uncontended notify is 2.5ns on my Zen4 computer.
Keep in mind that notify and wait are slow paths.
There's no slow path with futexes since you've to notify _always_.
Huh? Nope! Where did you get that idea from? calling notify is a slow
path, and calling wait is a slow path.
Post by Bonita Montero
And 1.7ns isn't slow.
You do not want to _always_ call notify and/or wait when you do not have to.
Post by Bonita Montero
Post by Chris M. Thomasson
One needs to strive to  minimize calls to notify and wait. Calling
notify and/or wait when they do _have_ to be called is a rather
inefficient design.
Post by Bonita Montero
Post by Chris M. Thomasson
For the Microsoft version, right? I don't have Linux installed at
the moment.
Linux with libstdc++ is 1.7ns.
Bonita Montero
2024-04-14 04:19:16 UTC
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
There's no slow path with futexes since you've to notify _always_.
Huh? Nope! Where did you get that idea from? calling notify is a slow
path, and calling wait is a slow path.
Unlocking a futex always involves a notify since the notify doesn't
depend on a state like a wait().
Post by Chris M. Thomasson
Post by Bonita Montero
And 1.7ns isn't slow.
You do not want to _always_ call notify and/or wait when you do not have to.
You don't unterstand futexes. Show me your mutex basing on C++20 futexes.
Chris M. Thomasson
2024-04-14 04:26:12 UTC
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
There's no slow path with futexes since you've to notify _always_.
Huh? Nope! Where did you get that idea from? calling notify is a slow
path, and calling wait is a slow path.
Unlocking a futex always involves a notify since the notify doesn't
depend on a state like a wait().
Post by Chris M. Thomasson
Post by Bonita Montero
And 1.7ns isn't slow.
You do not want to _always_ call notify and/or wait when you do not have to.
You don't unterstand futexes. Show me your mutex basing on C++20 futexes.
Did you read it? using windwos futexes:

https://groups.google.com/g/comp.lang.c++/c/1MZvhswJ6DQ/m/qyaYH-i0CgAJ

Notice how it does not unlock all the time?

I can port it to c++20.
Bonita Montero
2024-04-14 04:30:52 UTC
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
You don't unterstand futexes. Show me your mutex basing on C++20 futexes.
https://groups.google.com/g/comp.lang.c++/c/1MZvhswJ6DQ/m/qyaYH-i0CgAJ
Notice how it does not unlock all the time?
I can port it to c++20.
Show me your mutex with a futex.
It's impossible to write that without notifying *always*.
Bonita Montero
2024-04-14 04:34:09 UTC
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
You don't unterstand futexes. Show me your mutex basing on C++20 futexes.
https://groups.google.com/g/comp.lang.c++/c/1MZvhswJ6DQ/m/qyaYH-i0CgAJ
Notice how it does not unlock all the time?
I can port it to c++20.
Show me your mutex with a futex.
It's impossible to write that without notifying *always*.
That's the simplest code in C++20:

struct fute_xchg_mutex
{
fute_xchg_mutex();
void lock();
void unlock();
private:
atomic_bool m_locked;
};

fute_xchg_mutex::fute_xchg_mutex() :
m_locked( false )
{
}

void fute_xchg_mutex::lock()
{
while( m_locked.exchange( true, memory_order_acquire ) )
m_locked.wait( true, memory_order_relaxed );
}

void fute_xchg_mutex::unlock()
{
m_locked.store( false, memory_order_release );
m_locked.notify_one();
}
Chris M. Thomasson
2024-04-14 05:21:42 UTC
Permalink
Post by Bonita Montero
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
You don't unterstand futexes. Show me your mutex basing on C++20 futexes.
https://groups.google.com/g/comp.lang.c++/c/1MZvhswJ6DQ/m/qyaYH-i0CgAJ
Notice how it does not unlock all the time?
I can port it to c++20.
Show me your mutex with a futex.
It's impossible to write that without notifying *always*.
struct fute_xchg_mutex
{
    fute_xchg_mutex();
    void lock();
    void unlock();
    atomic_bool m_locked;
};
    m_locked( false )
{
}
void fute_xchg_mutex::lock()
{
    while( m_locked.exchange( true, memory_order_acquire ) )
        m_locked.wait( true, memory_order_relaxed );
}
void fute_xchg_mutex::unlock()
{
    m_locked.store( false, memory_order_release );
    m_locked.notify_one();
^^^^^^^^^^

NONONNOONONO!
Post by Bonita Montero
}
Chris M. Thomasson
2024-04-14 05:21:22 UTC
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
You don't unterstand futexes. Show me your mutex basing on C++20 futexes.
https://groups.google.com/g/comp.lang.c++/c/1MZvhswJ6DQ/m/qyaYH-i0CgAJ
Notice how it does not unlock all the time?
I can port it to c++20.
Show me your mutex with a futex.
It's impossible to write that without notifying *always*.
I beg to disagree.
Chris M. Thomasson
2024-04-14 05:22:29 UTC
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
You don't unterstand futexes. Show me your mutex basing on C++20 futexes.
https://groups.google.com/g/comp.lang.c++/c/1MZvhswJ6DQ/m/qyaYH-i0CgAJ
Notice how it does not unlock all the time?
I can port it to c++20.
Show me your mutex with a futex.
It's impossible to write that without notifying *always*.
this is a complete troll, right?
Bonita Montero
2024-04-14 05:30:59 UTC
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
You don't unterstand futexes. Show me your mutex basing on C++20 futexes.
https://groups.google.com/g/comp.lang.c++/c/1MZvhswJ6DQ/m/qyaYH-i0CgAJ
Notice how it does not unlock all the time?
I can port it to c++20.
Show me your mutex with a futex.
It's impossible to write that without notifying *always*.
this is a complete troll, right?
You don't know futexes.
Show me your code with a mutex basing on a futex.
Chris M. Thomasson
2024-04-14 05:32:27 UTC
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
You don't unterstand futexes. Show me your mutex basing on C++20 futexes.
https://groups.google.com/g/comp.lang.c++/c/1MZvhswJ6DQ/m/qyaYH-i0CgAJ
Notice how it does not unlock all the time?
I can port it to c++20.
Show me your mutex with a futex.
It's impossible to write that without notifying *always*.
this is a complete troll, right?
You don't know futexes.
Show me your code with a mutex basing on a futex.
I did. Read again, well Windows Futex:

https://groups.google.com/g/comp.lang.c++/c/1MZvhswJ6DQ/m/qyaYH-i0CgAJ

Do you really want me to port this to c++20?
Bonita Montero
2024-04-14 05:36:25 UTC
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
You don't unterstand futexes. Show me your mutex basing on C++20 futexes.
https://groups.google.com/g/comp.lang.c++/c/1MZvhswJ6DQ/m/qyaYH-i0CgAJ
Notice how it does not unlock all the time?
I can port it to c++20.
Show me your mutex with a futex.
It's impossible to write that without notifying *always*.
this is a complete troll, right?
You don't know futexes.
Show me your code with a mutex basing on a futex.
https://groups.google.com/g/comp.lang.c++/c/1MZvhswJ6DQ/m/qyaYH-i0CgAJ
Do you really want me to port this to c++20?
Show it here with correct formatting.
Chris M. Thomasson
2024-04-14 05:42:57 UTC
Permalink
[...]
Post by Bonita Montero
Post by Chris M. Thomasson
https://groups.google.com/g/comp.lang.c++/c/1MZvhswJ6DQ/m/qyaYH-i0CgAJ
Do you really want me to port this to c++20?
Show it here with correct formatting.
Do you mean, indentation? ;^)

Anyway, you commented in the thread I linked you to.

So, you can read my code, right?
Bonita Montero
2024-04-14 05:46:08 UTC
Permalink
Post by Chris M. Thomasson
[...]
Post by Bonita Montero
Post by Chris M. Thomasson
https://groups.google.com/g/comp.lang.c++/c/1MZvhswJ6DQ/m/qyaYH-i0CgAJ
Do you really want me to port this to c++20?
Show it here with correct formatting.
Do you mean, indentation? ;^)
Anyway, you commented in the thread I linked you to.
So, you can read my code, right?
You don't understand futexes since you call notifying with
the slow path slow. But notifying is even not slow if there
are contenders with futexes.
Chris M. Thomasson
2024-04-14 05:48:45 UTC
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
[...]
Post by Bonita Montero
Post by Chris M. Thomasson
https://groups.google.com/g/comp.lang.c++/c/1MZvhswJ6DQ/m/qyaYH-i0CgAJ
Do you really want me to port this to c++20?
Show it here with correct formatting.
Do you mean, indentation? ;^)
Anyway, you commented in the thread I linked you to.
So, you can read my code, right?
You don't understand futexes since you call notifying with
the slow path slow. But notifying is even not slow if there
are contenders with futexes.
What? I am quite sure you read this right?
____________________
void unlock()
{
if (InterlockedExchange(&m_state, 0) == 2)
{
WakeByAddressSingle(&m_state);
}
}
____________________

Take careful notice how I try to avoid a call into WakeByAddressSingle?
See? ;^o
Bonita Montero
2024-04-14 09:25:33 UTC
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
[...]
Post by Bonita Montero
Post by Chris M. Thomasson
https://groups.google.com/g/comp.lang.c++/c/1MZvhswJ6DQ/m/qyaYH-i0CgAJ
Do you really want me to port this to c++20?
Show it here with correct formatting.
Do you mean, indentation? ;^)
Anyway, you commented in the thread I linked you to.
So, you can read my code, right?
You don't understand futexes since you call notifying with
the slow path slow. But notifying is even not slow if there
are contenders with futexes.
What? I am quite sure you read this right?
____________________
void unlock()
{
    if (InterlockedExchange(&m_state, 0) == 2)
    {
        WakeByAddressSingle(&m_state);
    }
}
____________________
Take careful notice how I try to avoid a call into WakeByAddressSingle?
See? ;^o
This is from a benchmark of two constantly contending threads:

atomic with locker counter + std::binary_semaphore:
10.4597
atoomic with locker counter + shown futephore:
14.9949
Chris Code:
18.6445
shown atomic_bool with always wakeup:
24.7084
std::mutex:
33.8646

The numbers are in nanoseconds for a locking unlocking iteration.
Bonita Montero
2024-04-14 13:12:17 UTC
Permalink
That's my code:

#include <iostream>
#include <mutex>
#include <thread>
#include <chrono>
#include <semaphore>
#include <algorithm>
#include <vector>

using namespace std;
using namespace chrono;

int main()
{
using kv = pair<char const *, double>;
vector<kv> results;
auto bench = [&]( char const *what, auto fn )
{
constexpr size_t ROUNDS = 10'000'000;
atomic_int64_t tSum = 0;
auto thr = [&]()
{
auto start = high_resolution_clock::now();
for( size_t r = ROUNDS; r; --r )
fn();
tSum.fetch_add( duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count() );
};
jthread jtA( thr ), jtB( thr );
jtA.join(), jtB.join();
results.emplace_back( what, tSum / ((double)ROUNDS * 2.0) );
};
// atomic coutner with futex-semaphore
atomic_uint32_t
lockCounter( 0 ),
futSem( 0 );
bench( "atomic coutner with futex-semaphore", [&]
{
if( lockCounter.fetch_add( 1, memory_order_acquire ) > 0 )
for( uint32_t before = futSem.load(
memory_order_relaxed ); ; )
if( !before )
futSem.wait( 0, memory_order_relaxed ),
before = futSem.load( memory_order_relaxed );
else if( futSem.compare_exchange_strong( before,
before - 1, memory_order_acquire, memory_order_relaxed ) ) [[likely]]
break;
if( lockCounter.fetch_sub( 1, memory_order_relaxed ) > 1 )
for( uint32_t before = futSem.load(
memory_order_relaxed ); ; )
if( futSem.compare_exchange_strong( before, before
+ 1, memory_order_release, memory_order_relaxed ) )
{
futSem.notify_one();
break;
}
} );
// std::mutex
mutex mtx;
bench( "std::mutex", [&] { mtx.lock(), mtx.unlock(); });
// atomic coutner with a binary_semaphore
binary_semaphore sem( 0 );
bench( "atomic coutner with binary_semaphore", [&]
{
if( lockCounter.fetch_add( 1, memory_order_acquire ) > 0 )
sem.acquire();
if( lockCounter.fetch_sub( 1, memory_order_release ) > 1 )
sem.release();
} );
// xchg-mutex and always wakeup
atomic_bool lockFlag( false );
bench( "xchg-mutex and always wakeup", [&]
{
while( lockFlag.exchange( true, memory_order_acquire ) )
lockFlag.wait( true, memory_order_relaxed );
lockFlag.store( false, memory_order_release );
lockFlag.notify_one();
} );
// Chris' mutex
atomic_int lockState( 0 );
bench( "Chris' mutex", [&]
{
if( lockState.exchange( 1, memory_order_acquire ) )
while( lockState.exchange( 2, memory_order_acquire ) )
lockState.wait( 2, memory_order_acquire );
if( lockState.exchange( 0, memory_order_release ) == 2 )
lockState.notify_one();
} );
sort( results.begin(), results.end(), []( kv &lhs, kv &rhs ) {
return lhs.second < rhs.second; } );
for( kv &result : results )
cout << result.first << ": " << result.second << endl;
}
Chris M. Thomasson
2024-04-14 19:02:18 UTC
Permalink
[...]

Might as well throw my semaphore into the mix:

https://vorbrodt.blog/2019/02/05/fast-semaphore/

I will code up a little program, most likely today, to check out those
neat C++20 futexes.
Chris M. Thomasson
2024-04-14 19:06:07 UTC
Permalink
Post by Chris M. Thomasson
[...]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Error! It's not my semaphore!
Post by Chris M. Thomasson
https://vorbrodt.blog/2019/02/05/fast-semaphore/
I will code up a little program, most likely today, to check out those
neat C++20 futexes.
OOPS! I misstated, I did not create this algo. Afaict, the original can
be found here:

https://www.haiku-os.org/legacy-docs/benewsletter/Issue1-26.html

Sorry everybody. Now, here is _my_ rwlock:

https://vorbrodt.blog/2019/02/14/read-write-mutex

All me! :^)
Bonita Montero
2024-04-14 20:57:30 UTC
Permalink
Post by Chris M. Thomasson
[...]
https://vorbrodt.blog/2019/02/05/fast-semaphore/
I will code up a little program, most likely today, to check out those
neat C++20 futexes.
A semaphore with a mutex, you're funny ...
That ain't fast.
Chris M. Thomasson
2024-04-15 03:23:28 UTC
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
[...]
https://vorbrodt.blog/2019/02/05/fast-semaphore/
I will code up a little program, most likely today, to check out those
neat C++20 futexes.
A semaphore with a mutex, you're funny ...
That ain't fast.
The fast_semaphore::semaphore is meant to be a kernel semaphore. So, a
Windows one will work. A Linux one will work, and the portable mutex
condvar semaphore impl emulation will work. This semaphore is fast wrt
its fast paths. Btw, you have heard of the benaphore before, right?

https://www.haiku-os.org/legacy-docs/benewsletter/Issue1-26.html#Engineering1-26

Its been there for a while... ;^) It's something that every threading
nerd should know about. Remember when you called me a nerd several times
in the past? lol. ;^)

Anyway, take a close look at the fast paths vs the slow paths:

https://pastebin.com/raw/Q7p6e7Xc

_________________
class fast_semaphore
{
public:
fast_semaphore(int count) noexcept
: m_count(count), m_semaphore(0) {}

void post()
{
std::atomic_thread_fence(std::memory_order_release);
int count = m_count.fetch_add(1, std::memory_order_relaxed);
if (count < 0)
m_semaphore.post();
}

void wait()
{
int count = m_count.fetch_sub(1, std::memory_order_relaxed);
if (count < 1)
m_semaphore.wait();
std::atomic_thread_fence(std::memory_order_acquire);
}

private:
std::atomic<int> m_count;
semaphore m_semaphore;
};
_________________
Chris M. Thomasson
2024-04-15 03:26:13 UTC
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
[...]
https://vorbrodt.blog/2019/02/05/fast-semaphore/
I will code up a little program, most likely today, to check out
those neat C++20 futexes.
A semaphore with a mutex, you're funny ...
That ain't fast.
[...]

I think you might have a problem differentiating a fast-path from a
slow-path... Humm...
Bonita Montero
2024-04-15 04:15:55 UTC
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
[...]
https://vorbrodt.blog/2019/02/05/fast-semaphore/
I will code up a little program, most likely today, to check out
those neat C++20 futexes.
A semaphore with a mutex, you're funny ...
That ain't fast.
[...]
I think you might have a problem differentiating a fast-path from a
slow-path... Humm...
Absolutely not, but you consider a futex wake as slow.
Chris M. Thomasson
2024-04-15 05:03:24 UTC
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
[...]
https://vorbrodt.blog/2019/02/05/fast-semaphore/
I will code up a little program, most likely today, to check out
those neat C++20 futexes.
A semaphore with a mutex, you're funny ...
That ain't fast.
[...]
I think you might have a problem differentiating a fast-path from a
slow-path... Humm...
Absolutely not, but you consider a futex wake as slow.
Yes. I consider a futex notify and wake as "slow" paths. One should
strive to avoid them. I don't know about the current impl of futex wrt
Linux and Windows. I just treat them as kernel calls that should be on
the slow path. Not on the fast path. The damn thing might even lock an
internal hashed mutex on the address for all I know wrt an unknown impl
of a futex.
Bonita Montero
2024-04-15 06:20:51 UTC
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
Absolutely not, but you consider a futex wake as slow.
Yes. I consider a futex notify and wake as "slow" paths.
I've mentioned that a futex wake with no contenders is 1.7ns or about
10 clock cycles on my Zen4-CPU; that's nothing worth to think about.
Chris M. Thomasson
2024-04-15 19:08:57 UTC
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
Absolutely not, but you consider a futex wake as slow.
Yes. I consider a futex notify and wake as "slow" paths.
I've mentioned that a futex wake with no contenders is 1.7ns or about
10 clock cycles on my Zen4-CPU; that's nothing worth to think about.
I don't know what that futex notify is actually doing under the covers.
One impl might be faster than another. Whatever. I still consider a
futex notify to be a slow path because it denotes contention. Trying to
avoid a call into it is still rather "prudent", well according to me.
Why call a notify when we do not have to?
Bonita Montero
2024-04-16 04:32:11 UTC
Permalink
Post by Chris M. Thomasson
I don't know what that futex notify is actually doing under the covers.
In 10 clock cycles you can't do much.
Post by Chris M. Thomasson
One impl might be faster than another.
With Windows it's ony 2.7ns and with Linux 1.7ns.
Bonita Montero
2024-04-16 09:31:34 UTC
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
I don't know what that futex notify is actually doing under the covers.
In 10 clock cycles you can't do much.
Post by Chris M. Thomasson
One impl might be faster than another.
With Windows it's ony 2.7ns and with Linux 1.7ns.
Try it yourself:

#include <iostream>
#include <atomic>
#include <chrono>

using namespace std;
using namespace chrono;

int main()
{
atomic_bool ab;
auto start = high_resolution_clock::now();
for( size_t i = 1'000'000'000; i; --i )
ab.notify_one();
cout << duration_cast<nanoseconds>( high_resolution_clock::now() -
start ).count() / 1.0e9 << endl;
}

With WSL2 on a Zen4-CPU it's only 1.2ns, i.e. seven clock cycles.
Chris M. Thomasson
2024-04-16 21:07:35 UTC
Permalink
Post by Bonita Montero
Post by Bonita Montero
Post by Chris M. Thomasson
I don't know what that futex notify is actually doing under the covers.
In 10 clock cycles you can't do much.
Post by Chris M. Thomasson
One impl might be faster than another.
With Windows it's ony 2.7ns and with Linux 1.7ns.
#include <iostream>
#include <atomic>
#include <chrono>
using namespace std;
using namespace chrono;
int main()
{
    atomic_bool ab;
    auto start = high_resolution_clock::now();
    for( size_t i = 1'000'000'000; i; --i )
        ab.notify_one();
    cout << duration_cast<nanoseconds>( high_resolution_clock::now() -
start ).count() / 1.0e9 << endl;
}
With WSL2 on a Zen4-CPU it's only 1.2ns, i.e. seven clock cycles.
I will, however I am working on another project right now, heavy
graphics based. Basically, I would want to see how notify_one is
actually implemented. Bust out the disassembler.
Bonita Montero
2024-04-17 08:24:20 UTC
Permalink
Post by Chris M. Thomasson
I will, however I am working on another project right now, heavy
graphics based. Basically, I would want to see how notify_one is
actually implemented. Bust out the disassembler.
It will practically not matter how it is implemented if it is that fast.
Chris M. Thomasson
2024-04-17 18:48:42 UTC
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
I will, however I am working on another project right now, heavy
graphics based. Basically, I would want to see how notify_one is
actually implemented. Bust out the disassembler.
It will practically not matter how it is implemented if it is that fast.
It just might take an internal hashed mutex to check for contention...
Humm... I don't know until I look at it. I don't even want it to do a
CAS, or execute any membars on a "fast-path". Since I don't know what
its doing under the covers, I still think its "prudent" to try to avoid
calling into it when we do absolutely have to.
Chris M. Thomasson
2024-04-17 18:52:13 UTC
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
I will, however I am working on another project right now, heavy
graphics based. Basically, I would want to see how notify_one is
actually implemented. Bust out the disassembler.
It will practically not matter how it is implemented if it is that fast.
It just might take an internal hashed mutex to check for contention...
Humm... I don't know until I look at it. I don't even want it to do a
CAS, or execute any membars on a "fast-path". Since I don't know what
its doing under the covers, I still think its "prudent" to try to avoid
calling into it when we do absolutely have to.
^^^^^^^^^^^^^^

God damn typos! Sorry Bonita!

I still think its "prudent" to try to avoid calling into it, _except_
when we absolutely have to.

Argh! Sorry again.
Chris M. Thomasson
2024-04-16 21:14:25 UTC
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
I don't know what that futex notify is actually doing under the covers.
In 10 clock cycles you can't do much.
Well, futex notify might have fast paths in and of itself. To be prudent
I would need to see how they implement it to allow a futex notify by,
every time. Fair enough?
Post by Bonita Montero
Post by Chris M. Thomasson
One impl might be faster than another.
With Windows it's ony 2.7ns and with Linux 1.7ns.
Bonita Montero
2024-04-18 13:12:22 UTC
Permalink
Post by Chris M. Thomasson
Well, futex notify might have fast paths in and of itself. To be prudent
I would need to see how they implement it to allow a futex notify by,
every time. Fair enough?
I'm asking myself if it would be possible to have context-switching as
most as possible in userspace. If there would be a context-switch from
one thread of a process to another thread because a timeslice expired
the kernel should send a signal to the thread and the thread does the
userspace context-switch by itself. Only if there's a context switch
to another process' thread or in kernel mode the kernel's scheduler
acts itself.
This would give the opportunity to have voluntary context switches
when doing locking much faster than trough the kernel, and voluntary
context switches usually happen with a much higher frequency that
there would be a real gain.
With Linux this would be possible trough signals and on Windows the
kernel could induce SEH-exceptions for a thread-switch.
Richard Damon
2024-04-19 02:03:12 UTC
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Well, futex notify might have fast paths in and of itself. To be
prudent I would need to see how they implement it to allow a futex
notify by, every time. Fair enough?
I'm asking myself if it would be possible to have context-switching as
most as possible in userspace. If there would be a context-switch from
one thread of a process to another thread because a timeslice expired
the kernel should send a signal to the thread and the thread does the
userspace context-switch by itself. Only if there's a context switch
to another process' thread or in kernel mode the kernel's scheduler
acts itself.
This would give the opportunity to have voluntary context switches
when doing locking much faster than trough the kernel, and voluntary
context switches usually happen with a much higher frequency that
there would be a real gain.
With Linux this would be possible trough signals and on Windows the
kernel could induce SEH-exceptions for a thread-switch.
How do you "signal" a user-thread without doing a kernel operation and a
thread switch?

Admittedly, if the kernel knows it is switching from one thread to
another in the same process it can do a lighter weight sort of
context-switch, but it still needs to deal with kernel space operations.
Bonita Montero
2024-04-19 04:18:18 UTC
Permalink
Post by Richard Damon
How do you "signal" a user-thread without doing a kernel operation
and a thread switch?
The signals for the involuntary userspace thread-switch would be sent
by a dedicated kernel-thread or by the timer interrupt. This would be
more expensive than a thread-switch through the timer interrupt but
as voluntary thread-switches have a much higher frequency this would
be outweighed.
Post by Richard Damon
Admittedly, if the kernel knows it is switching from one thread
to another in the same process it can do a lighter weight sort of
context-switch, but it still needs to deal with kernel space operations.
A context switch through the kernel is always expensive. A user
-level thread switch when blocking for a lock would be much faster.
Scott Lurndal
2024-04-19 13:41:36 UTC
Permalink
Post by Bonita Montero
Post by Richard Damon
Admittedly, if the kernel knows it is switching from one thread
to another in the same process it can do a lighter weight sort of
context-switch, but it still needs to deal with kernel space operations.
A context switch through the kernel is always expensive. A user
-level thread switch when blocking for a lock would be much faster.
SVR4.2MP implemented a M-N thread model (M user threads mapped to
N kernel threads). Turned out not to work well.
Bonita Montero
2024-04-19 13:44:37 UTC
Permalink
Post by Scott Lurndal
Post by Bonita Montero
Post by Richard Damon
Admittedly, if the kernel knows it is switching from one thread
to another in the same process it can do a lighter weight sort of
context-switch, but it still needs to deal with kernel space operations.
A context switch through the kernel is always expensive. A user
-level thread switch when blocking for a lock would be much faster.
SVR4.2MP implemented a M-N thread model (M user threads mapped to
N kernel threads). Turned out not to work well.
The thing that I'm imaging is still 1:1 but if threads are in userspace
thread-switching would be done by the userspace.
Scott Lurndal
2024-04-19 14:48:44 UTC
Permalink
Post by Bonita Montero
Post by Scott Lurndal
Post by Bonita Montero
Post by Richard Damon
Admittedly, if the kernel knows it is switching from one thread
to another in the same process it can do a lighter weight sort of
context-switch, but it still needs to deal with kernel space operations.
A context switch through the kernel is always expensive. A user
-level thread switch when blocking for a lock would be much faster.
SVR4.2MP implemented a M-N thread model (M user threads mapped to
N kernel threads). Turned out not to work well.
The thing that I'm imaging is still 1:1 but if threads are in userspace
thread-switching would be done by the userspace.
Feel free to prototype it using setcontext(2), getcontext(2) and
makecontext(2).
Bonita Montero
2024-04-19 15:32:38 UTC
Permalink
Post by Scott Lurndal
Feel free to prototype it using setcontext(2), getcontext(2) and
makecontext(2).
I'd need the support of the kernel which should not make context
switches to another thread inside the same process if the thread
is within userspace. And the kernel should have to periodically
inject signals from the timer interrupt to userspace to make it
possible that the userspace-code does the involuntary context
-switch on its own. And I'd need synchronization-primitives like
mutexes and semaphores that would do the otherwise costly context
-switch in userspace; but that's rather easy compared to the kernel
support.
Scott Lurndal
2024-04-19 18:27:14 UTC
Permalink
Post by Bonita Montero
Post by Scott Lurndal
Feel free to prototype it using setcontext(2), getcontext(2) and
makecontext(2).
I'd need the support of the kernel which should not make context
switches to another thread inside the same process if the thread
is within userspace. And the kernel should have to periodically
inject signals from the timer interrupt to userspace to make it
possible that the userspace-code does the involuntary context
-switch on its own. And I'd need synchronization-primitives like
mutexes and semaphores that would do the otherwise costly context
-switch in userspace; but that's rather easy compared to the kernel
support.
https://www.kernel.org/

Feel free to modify the kernel to your heart's content.
Bonita Montero
2024-04-20 02:48:47 UTC
Permalink
Post by Scott Lurndal
Post by Bonita Montero
Post by Scott Lurndal
Feel free to prototype it using setcontext(2), getcontext(2) and
makecontext(2).
I'd need the support of the kernel which should not make context
switches to another thread inside the same process if the thread
is within userspace. And the kernel should have to periodically
inject signals from the timer interrupt to userspace to make it
possible that the userspace-code does the involuntary context
-switch on its own. And I'd need synchronization-primitives like
mutexes and semaphores that would do the otherwise costly context
-switch in userspace; but that's rather easy compared to the kernel
support.
https://www.kernel.org/
Feel free to modify the kernel to your heart's content.
Seems you don't understand the idead and you think this isn't
possible.
Scott Lurndal
2024-04-20 15:16:49 UTC
Permalink
Post by Bonita Montero
Post by Scott Lurndal
Post by Bonita Montero
Post by Scott Lurndal
Feel free to prototype it using setcontext(2), getcontext(2) and
makecontext(2).
I'd need the support of the kernel which should not make context
switches to another thread inside the same process if the thread
is within userspace. And the kernel should have to periodically
inject signals from the timer interrupt to userspace to make it
possible that the userspace-code does the involuntary context
-switch on its own. And I'd need synchronization-primitives like
mutexes and semaphores that would do the otherwise costly context
-switch in userspace; but that's rather easy compared to the kernel
support.
https://www.kernel.org/
Feel free to modify the kernel to your heart's content.
Seems you don't understand the idead and you think this isn't
possible.
It seems the lack of understanding is on you. Du verstehst es nicht.

"... should not make context switches to another thread inside
the same process if the thread is within userspace".

It seems clear that the thread within userspace is completely
invisible to the kernel, thus it cannot by definition switch to it.

It's not that I don't think it is possible, I just don't think
it provides any measurable benefit for the added complexity.
Bonita Montero
2024-04-20 15:36:35 UTC
Permalink
Post by Scott Lurndal
It seems clear that the thread within userspace is completely
invisible to the kernel, thus it cannot by definition switch to it.
I'm not talking about userlevel-threads but the state when a thread
is in userspace-state. You lack imagination since you don't understand
my idea.

Richard Damon
2024-04-20 03:00:18 UTC
Permalink
Post by Bonita Montero
Post by Richard Damon
How do you "signal" a user-thread without doing a kernel operation
and a thread switch?
The signals for the involuntary userspace thread-switch would be sent
by a dedicated kernel-thread or by the timer interrupt. This would be
more expensive than a thread-switch through the timer interrupt but
as voluntary thread-switches have a much higher frequency this would
be outweighed.
TO WHAT?

Are you going to reserve a core with a dedicated thread to do this?

To "interrupt" a user thread to notify it, you would either need to
perform a context switch to save the threads previous context or make
the interrupt non-returnable. If you are going to context switch to the
notification thread, you might as well switch the the new user-thread
that you want to go to.
Post by Bonita Montero
Post by Richard Damon
Admittedly, if the kernel knows it is switching from one thread
to another in the same process it can do a lighter weight sort of
context-switch, but it still needs to deal with kernel space operations.
A context switch through the kernel is always expensive. A user
-level thread switch when blocking for a lock would be much faster.
But you still had the kernal doing a context switch.
Bonita Montero
2024-04-20 04:11:57 UTC
Permalink
Post by Richard Damon
Post by Bonita Montero
Post by Richard Damon
How do you "signal" a user-thread without doing a kernel operation
and a thread switch?
The signals for the involuntary userspace thread-switch would be sent
by a dedicated kernel-thread or by the timer interrupt. This would be
more expensive than a thread-switch through the timer interrupt but
as voluntary thread-switches have a much higher frequency this would
be outweighed.
TO WHAT?
Are you going to reserve a core with a dedicated thread to do this?
A userspace thread is scheduled on a core if there's no other thread
on that core. But if there would be other threads eligible to run on
the core they would be scheduled by a signal in userspace if their
context is also in userspace.
Post by Richard Damon
To "interrupt" a user thread to notify it, you would either need to
perform a context switch to save the threads previous context or make
the interrupt non-returnable. If you are going to context switch to the
notification thread, you might as well switch the the new user-thread
that you want to go to.
You simply don't understand my idea !
Chris M. Thomasson
2024-04-14 19:10:19 UTC
Permalink
[...]
    atomic_int lockState( 0 );
    bench( "Chris' mutex", [&]
        {
            if( lockState.exchange( 1, memory_order_acquire ) )
                while( lockState.exchange( 2, memory_order_acquire ) )
                    lockState.wait( 2, memory_order_acquire );
No need for memory_order_acquire on the wait. Relaxed will do.
            if( lockState.exchange( 0, memory_order_release ) == 2 )
                lockState.notify_one();
        } );
    sort( results.begin(), results.end(), []( kv &lhs, kv &rhs ) {
return lhs.second < rhs.second; } );
    for( kv &result : results )
        cout << result.first << ": " << result.second << endl;
}
Chris M. Thomasson
2024-04-01 05:01:55 UTC
Permalink
Post by Chris M. Thomasson
{
                 for( bool cmp; !futex.compare_exchange_weak( cmp =
false, true, memory_order_acquire, memory_order_relaxed ); )
                     futex.wait( true, memory_order_relaxed );
                 futex.store( false, memory_order_release );
                 futex.notify_one();
             }, futexTimes );
You need to introduce some sort of waitbit logic to minimize calls to
futex.notify_one() and futex.wait(). I posted an example but am having a
hard time finding the damn thing in comp.lang.c++. I will try to find it.
Also, does notify_one automatically imply release semantics?
Chris M. Thomasson
2024-04-01 05:02:23 UTC
Permalink
Post by Chris M. Thomasson
Post by Chris M. Thomasson
{
                 for( bool cmp; !futex.compare_exchange_weak( cmp =
false, true, memory_order_acquire, memory_order_relaxed ); )
                     futex.wait( true, memory_order_relaxed );
                 futex.store( false, memory_order_release );
                 futex.notify_one();
             }, futexTimes );
You need to introduce some sort of waitbit logic to minimize calls to
futex.notify_one() and futex.wait(). I posted an example but am having
a hard time finding the damn thing in comp.lang.c++. I will try to
find it.
Also, does notify_one automatically imply release semantics?
I don't think so, humm. Cannot remember.
Bonita Montero
2024-03-30 04:18:35 UTC
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
The following program simulates constant locking und unlocking of one
to jthread::hardware_concurrency() threadas with a std::mutex and a
futex. On my 16 core / 32 thread Zen4 system a futex is faster up to
5 threads constantly contending, but beyond the CPU time of the futex
explodes and the conventional mutex is faster with Windows as and with
Linux.
[...]
A futex is not a mutex!
You can build a mutex on top of a futex and that's the typical usage.
Chris M. Thomasson
2024-03-30 19:00:31 UTC
Permalink
Post by Bonita Montero
The following program simulates constant locking und unlocking of one
to jthread::hardware_concurrency() threadas with a std::mutex and a
futex. On my 16 core / 32 thread Zen4 system a futex is faster up to
5 threads constantly contending, but beyond the CPU time of the futex
explodes and the conventional mutex is faster with Windows as and with
Linux.
[...]

I think I found my highly experimental windows futex based mutex:


https://groups.google.com/g/comp.lang.c++/c/1MZvhswJ6DQ/m/qyaYH-i0CgAJ

(read all to refresh the context... ;^)


Notice:
___________________
void unlock()
{
if (InterlockedExchange(&m_state, 0) == 2)
{
WakeByAddressSingle(&m_state);
}
}
___________________


We only signal when we have to!


copy and paste from the link. Sorry about the copy and paste destroying
all of my indentation, argh!


Can you run this Bonita? Fwiw, it would be my first time using a Windows
Futex! lol:
______________________________________
#include <iostream>
#include <thread>
#include <functional>

#define WIN32_LEAN_AND_MEAN
#include <Windows.h>


#define CT_L2_ALIGNMENT 128
#define CT_THREADS 32
#define CT_ITERS 6666666


struct ct_futex_mutex
{
ULONG alignas(CT_L2_ALIGNMENT) m_state;

ct_futex_mutex() : m_state(0)
{

}

void lock()
{
if (InterlockedExchange(&m_state, 1))
{
while (InterlockedExchange(&m_state, 2))
{
ULONG cmp = 2;
WaitOnAddress(&m_state, &cmp, sizeof(ULONG), INFINITE);
}
}
}

void unlock()
{
if (InterlockedExchange(&m_state, 0) == 2)
{
WakeByAddressSingle(&m_state);
}
}
};


struct ct_shared
{
ct_futex_mutex m_mtx;
unsigned long m_count;

ct_shared() : m_count(0) {}

~ct_shared()
{
if (m_count != 0)
{
std::cout << "counter is totally fubar!\n";
}
}
};


void ct_thread(ct_shared& shared)
{
for (unsigned long i = 0; i < CT_ITERS; ++i)
{
shared.m_mtx.lock();
++shared.m_count;
shared.m_mtx.unlock();

shared.m_mtx.lock();
--shared.m_count;
shared.m_mtx.unlock();
}
}


int main()
{
std::thread threads[CT_THREADS];

std::cout << "Starting up...\n";

{
ct_shared shared;

for (unsigned long i = 0; i < CT_THREADS; ++i)
{
threads[i] = std::thread(ct_thread, std::ref(shared));
}

std::cout << "Running...\n";

for (unsigned long i = 0; i < CT_THREADS; ++i)
{
threads[i].join();
}
}

std::cout << "Completed!\n";

return 0;
}
______________________________________
Michael S
2024-04-01 09:18:00 UTC
Permalink
On Fri, 29 Mar 2024 14:14:11 +0100
Post by Bonita Montero
The following program simulates constant locking und unlocking of one
to jthread::hardware_concurrency() threadas with a std::mutex and a
futex. On my 16 core / 32 thread Zen4 system a futex is faster up to
5 threads constantly contending, but beyond the CPU time of the futex
explodes and the conventional mutex is faster with Windows as and with
Linux.
In case of heavy contention what to consider 'faster' is not at all
obvious.

On lightly loaded system with more cores than work to do (a typical
client) 'faster' means faster forward progress of group of contending
threads. Achieved by very long polling before switching to wait,
probably up to several tens of usec and by hyperactive tickless OS
scheduler.

On heavily loaded system with much more work to do than available
cores, 'faster' means more work done by unrelated threads and processes.
Achieved by very short polling before switching to wait, probably less
than 500 nsec and by 'passive' OS scheduler that rarely intervenes
outside of clock tick.

And of course there are cases in the middle.

And then traditional HPC with MPI that is completely different kettle of
fish.
Chris M. Thomasson
2024-04-05 02:56:13 UTC
Permalink
Post by Michael S
On Fri, 29 Mar 2024 14:14:11 +0100
Post by Bonita Montero
The following program simulates constant locking und unlocking of one
to jthread::hardware_concurrency() threadas with a std::mutex and a
futex. On my 16 core / 32 thread Zen4 system a futex is faster up to
5 threads constantly contending, but beyond the CPU time of the futex
explodes and the conventional mutex is faster with Windows as and with
Linux.
In case of heavy contention what to consider 'faster' is not at all
obvious.
On lightly loaded system with more cores than work to do (a typical
client) 'faster' means faster forward progress of group of contending
threads. Achieved by very long polling before switching to wait,
probably up to several tens of usec and by hyperactive tickless OS
scheduler.
On heavily loaded system with much more work to do than available
cores, 'faster' means more work done by unrelated threads and processes.
Achieved by very short polling before switching to wait, probably less
than 500 nsec and by 'passive' OS scheduler that rarely intervenes
outside of clock tick.
And of course there are cases in the middle.
And then traditional HPC with MPI that is completely different kettle of
fish.
Ditto. However, I did find it interesting (20+ years ago) to test worst
case scenarios. What happens when I blast this mutex algorithm A with
artificial load vs algo B forcing it into horrible periods of radical
contention. How do they handle it...
Loading...