Discussion:
Futexes ain't fast
(too old to reply)
Bonita Montero
2024-08-28 12:09:10 UTC
Permalink
I tested the operating-system specific mutex (CRITICAL_SECTION Or
pthread_mutext_t) against a futex and a std::mutex. I guessed std::mutex
uses th operating system specific mutex internally, but the times varied
so much across Windows and Linux that I gues that std::mutex used at
least a differently parametrized operating system mutex or maybe even
completely own code.
This are the times and each line has a further contender:

os-mutex C++-futex std::mutex
1 4.538 6.746 6.867
2 14.084 37.033 21.505
3 28.533 111.212 48.5523
4 51.3995 233.28 82.1983
5 85.293 353.142 122.056
6 124.706 482.997 169.534
7 178.731 601.233 250.503
8 285.089 826.833 326.098
9 405.877 1023.48 368.619
10 425.566 1200.88 459.109
11 301.405 1384.84 567.276
12 365.924 1467.74 718.556
13 398.08 1674.63 1152.45
14 471.867 1815.81 1274.33
15 524.513 1949.76 1599.02
16 606.325 2086.62 1773.66
17 655.687 2121.09 2034.13
18 724.213 2267.87 2102.01
19 815.462 2365.32 2398.93
20 868.996 2520.45 2447.82
21 945.699 2588.83 2636.89
22 1093.38 2742.53 2858.13
23 1132.9 2873.63 3080.97
24 1234.79 2963.54 3274.75
25 1336.52 2906.26 3483.22
26 1447.84 3028.91 3676.8
27 1538.75 3227.37 3829.54
28 1624.9 3393.49 4023.94
29 1719.17 3516.76 4184.77
30 1822.47 3644.64 4363.56
31 1937.37 3818.5 4582.23
32 2018.14 3948.14 4705.01



#if defined(_WIN32)
#include <Windows.h>
#elif defined(__unix__)
#include <pthread.h>
#endif
#include <iostream>
#include <atomic>
#include <functional>
#include <thread>
#include <vector>
#include <chrono>
#include <optional>
#include <latch>
#include <mutex>

using namespace std;
using namespace chrono;

int main()
{
using test_t = pair<char const *, function<void ()>>;
vector<jthread> threads;
unsigned hc = jthread::hardware_concurrency();
threads.reserve( hc );
vector<test_t> tests;
#if defined(_WIN32)
CRITICAL_SECTION cs;
InitializeCriticalSection( &cs );
tests.emplace_back( "os-mutex", [&]
{
EnterCriticalSection( &cs );
LeaveCriticalSection( &cs );
} );
#elif defined(__unix__)
pthread_mutex_t pm;
pthread_mutex_init( &pm, nullptr );
tests.emplace_back( "os-mutex", [&]
{
pthread_mutex_lock( &pm );
pthread_mutex_unlock( &pm );
} );
#endif
atomic_bool futex( false );
tests.emplace_back( "C++-futex", [&]
{
while( futex.exchange( true, memory_order_acquire ) )
futex.wait( true, memory_order_relaxed );
futex.exchange( false, memory_order_release );
futex.notify_one();
} );
mutex mtx;
tests.emplace_back( "std::mutex", [&]
{
mtx.lock();
mtx.unlock();
} );
constexpr int64_t ROUNDS = 100'000;
ostringstream oss;
auto print = [&]( auto const &param )
{
oss.str( "" );
oss << param;
string str( oss.str() );
size_t len = str.length();
str = string( len <= 11 ? 11 - len : 0, ' ' ) + str;
cout << str;
};
print( "" );
for( test_t const &test : tests )
print( test.first );
cout << endl;
for( unsigned nThreads = 1; nThreads <= hc; ++nThreads )
{
print( nThreads );
for( test_t const &test : tests )
{
atomic_int64_t tSum( 0 );
latch lat( nThreads + 1 );
for( unsigned t = 0; t < nThreads; ++t )
threads.emplace_back( [&]
{
lat.arrive_and_wait();
auto start = high_resolution_clock::now();
for( int64_t r = ROUNDS; r; --r )
test.second();
tSum.fetch_add( duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count(), memory_order_relaxed );
} );
lat.arrive_and_wait();
threads.resize( 0 );
print( tSum.load( memory_order_relaxed ) / ((double)nThreads * ROUNDS) );
}
cout << endl;
}
}
Bonita Montero
2024-08-28 12:14:11 UTC
Permalink
This are the times for WSL2 / clang++ 18 / Zen4:

os-mutex C++-futex std::mutex
1 4.56686 4.36047 4.3239
2 28.8664 28.3289 32.279
3 31.1153 81.8263 38.0591
4 60.6797 183.65 59.5259
5 67.3972 299.972 68.6268
6 97.8872 427.379 88.9555
7 116.267 633.114 114.086
8 147.561 1092.22 145.959
9 181.437 1585.49 167.332
10 204.377 1985.08 197.827
11 225.404 2592.28 209.141
12 243.496 3070.99 242.013
13 278.532 3602.04 251.041
14 288.268 4069 275.175
15 313.36 4501.02 298.927
16 338.817 4968.24 323.074
17 370.487 5381.25 352.372
18 412.542 5773.2 380.785
19 447.14 5916.54 411.674
20 465.34 6698.89 439.05
21 501.694 7144.91 475.522
22 547.571 7413.91 523.236
23 588.499 8085.89 556.089
24 625.933 8690.31 585.382
25 669.453 9249.96 625.591
26 719.125 9832.12 670.704
27 760.095 10271.5 709.3
28 791.683 10915.1 748.703
29 842.34 11580.1 796.001
30 891.618 12009.6 832.299
31 929.287 12668.1 892.501
32 976.275 9814.77 933.202
Chris M. Thomasson
2024-08-28 18:52:13 UTC
Permalink
Post by Bonita Montero
I tested the operating-system specific mutex (CRITICAL_SECTION Or
pthread_mutext_t) against a futex and a std::mutex. I guessed std::mutex
uses th operating system specific mutex internally, but the times varied
so much across Windows and Linux that I gues that std::mutex used at
least a differently parametrized operating system mutex or maybe even
completely own code.
[...]
Post by Bonita Montero
            while( futex.exchange( true, memory_order_acquire ) )
                futex.wait( true, memory_order_relaxed );
            futex.exchange( false, memory_order_release );
            futex.notify_one();
        } );
[...]

A wait bit would help out here... ;^) Afaict, this is a rather "naive"
use of futexes. In you use case here, the exchange to unlock can be a
simple atomic store with release semantics. Also, try to think about
calling notify_* only when you absolutely need to... :^)
Bonita Montero
2024-08-29 03:51:21 UTC
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
I tested the operating-system specific mutex (CRITICAL_SECTION Or
pthread_mutext_t) against a futex and a std::mutex. I guessed std::mutex
uses th operating system specific mutex internally, but the times varied
so much across Windows and Linux that I gues that std::mutex used at
least a differently parametrized operating system mutex or maybe even
completely own code.
[...]
Post by Bonita Montero
             while( futex.exchange( true, memory_order_acquire ) )
                 futex.wait( true, memory_order_relaxed );
             futex.exchange( false, memory_order_release );
             futex.notify_one();
         } );
[...]
A wait bit would help out here... ;^) Afaict, this is a rather "naive"
use of futexes. In you use case here, the exchange to unlock can be a
simple atomic store with release semantics. Also, try to think about
calling notify_* only when you absolutely need to... :^)
Show me your code ...
Chris M. Thomasson
2024-08-29 04:17:16 UTC
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
I tested the operating-system specific mutex (CRITICAL_SECTION Or
pthread_mutext_t) against a futex and a std::mutex. I guessed std::mutex
uses th operating system specific mutex internally, but the times varied
so much across Windows and Linux that I gues that std::mutex used at
least a differently parametrized operating system mutex or maybe even
completely own code.
[...]
Post by Bonita Montero
             while( futex.exchange( true, memory_order_acquire ) )
                 futex.wait( true, memory_order_relaxed );
             futex.exchange( false, memory_order_release );
             futex.notify_one();
         } );
[...]
A wait bit would help out here... ;^) Afaict, this is a rather "naive"
use of futexes. In you use case here, the exchange to unlock can be a
simple atomic store with release semantics. Also, try to think about
calling notify_* only when you absolutely need to... :^)
Show me your code ...
I did a while back for a futex mutex, win32 impl iirc... Do I have to
find the older code?
Chris M. Thomasson
2024-08-29 04:18:45 UTC
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
I tested the operating-system specific mutex (CRITICAL_SECTION Or
pthread_mutext_t) against a futex and a std::mutex. I guessed std::mutex
uses th operating system specific mutex internally, but the times varied
so much across Windows and Linux that I gues that std::mutex used at
least a differently parametrized operating system mutex or maybe even
completely own code.
[...]
Post by Bonita Montero
             while( futex.exchange( true, memory_order_acquire ) )
                 futex.wait( true, memory_order_relaxed );
             futex.exchange( false, memory_order_release );
             futex.notify_one();
         } );
[...]
A wait bit would help out here... ;^) Afaict, this is a rather
"naive" use of futexes. In you use case here, the exchange to unlock
can be a simple atomic store with release semantics. Also, try to
think about calling notify_* only when you absolutely need to... :^)
Show me your code ...
I did a while back for a futex mutex, win32 impl iirc... Do I have to
find the older code?
Lets see... It's a bit hard to find for some reason:

https://groups.google.com/g/comp.lang.c++/c/1MZvhswJ6DQ/m/qyaYH-i0CgAJ
Bonita Montero
2024-08-29 11:03:03 UTC
Permalink
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
I tested the operating-system specific mutex (CRITICAL_SECTION Or
pthread_mutext_t) against a futex and a std::mutex. I guessed std::mutex
uses th operating system specific mutex internally, but the times varied
so much across Windows and Linux that I gues that std::mutex used at
least a differently parametrized operating system mutex or maybe even
completely own code.
[...]
Post by Bonita Montero
             while( futex.exchange( true, memory_order_acquire ) )
                 futex.wait( true, memory_order_relaxed );
             futex.exchange( false, memory_order_release );
             futex.notify_one();
         } );
[...]
A wait bit would help out here... ;^) Afaict, this is a rather
"naive" use of futexes. In you use case here, the exchange to unlock
can be a simple atomic store with release semantics. Also, try to
think about calling notify_* only when you absolutely need to... :^)
Show me your code ...
I did a while back for a futex mutex, win32 impl iirc... Do I have to
find the older code?
https://groups.google.com/g/comp.lang.c++/c/1MZvhswJ6DQ/m/qyaYH-i0CgAJ
This includes your futex:

os-mutex C++-futex CT-Futex std::mutex
1 4.458 5.288 2.932 6.213
2 14.607 42.9285 33.8295 13.057
3 31.3747 119.928 78.9007 29.7953
4 48.9462 236.334 156.169 38.9327
5 78.7644 401.246 208.919 105.873
6 121.046 652.817 312.418 92.4637
7 229.815 834.371 374.232 176.73
8 228.219 934.64 457.591 294.029
9 428.12 1162.23 465.212 470.136
10 437.309 1244.29 526.45 326.618
11 248.07 1379.83 559.527 550.964
12 319.084 1560.5 532.445 551.784
13 370.088 1644.48 558.57 883.909
14 420.677 1719.65 609.394 1101.48
15 505.628 1884.92 648.361 1288.54
16 538.861 2029.97 670.726 1479.45
17 556.539 2056.18 676.595 1602.34
18 658.829 2205.71 731.472 1958.6
19 673.713 2351.46 771.112 2003.13
20 767.112 2430.94 794.487 2151.04
21 827.977 2549.65 801.974 2443.2
22 917.694 2679.09 881.514 2551.47
23 1016.16 2828.65 905.289 2798.37
24 1096.64 2919.04 946.814 2914.23
25 1161.21 3052.88 996.146 3093
26 1256.53 3094.02 1034.94 3220.38
27 1379.79 3323.04 1079.41 3390.07
28 1446.68 3533.71 1106.27 3516.21
29 1514.12 3618.5 1160 3624.3
30 1601.25 3761.2 1218.03 3765.6
31 1673.59 3913.48 1259.65 3947.46
32 1827.81 4134.67 1319.16 4061.73
Bonita Montero
2024-08-29 12:33:34 UTC
Permalink
This are the Linux/WSL2-times:

os-mutex C++-futex CT-Futex std::mutex
1 4.08416 4.44344 2.97675 4.08336
2 24.8193 48.7768 15.543 19.3337
3 36.7829 83.2617 43.8175 32.0627
4 60.6047 205.076 120.957 55.1088
5 74.1346 324.468 192.807 70.2967
6 95.4995 435.881 302.049 87.4688
7 107.856 644.166 413.012 102.494
8 128.303 864.129 588.455 127.11
9 161.581 1547.53 698.342 142.329
10 186.587 1981.13 806.801 159.648
11 215.675 2384.41 978.299 177.471
12 233.432 3056.21 704.016 213.383
13 300.081 3646.18 1386.07 214.441
14 278.64 4031.88 1512.2 231.601
15 305.642 4473.59 1829.87 247.191
16 325.154 4978.44 1884.21 296.62
17 367.131 5322.68 1976.28 296.221
18 388.628 5846.84 2087.99 320.431
19 419.67 6308.08 2239.44 349.347
20 455.752 6778.73 2396.71 370.552
21 491.6 7357.81 2482.16 403.741
22 521.929 7712.72 2610.48 435.72
23 557.436 8173.53 2749.9 470.429
24 596.45 8701.01 2948.39 498.316
25 638.711 9491.07 3112.28 530.359
26 674.889 10044.9 3401.87 565.995
27 729.066 10601.1 3436.53 600.626
28 754.609 11189 3679.52 648.257
29 807.071 11928.3 3836.02 675.696
30 840.163 12611.1 4104.16 708.034
31 872.082 13137.5 4278.05 740.399
32 920.411 8702.5 4313.41 787.79

The C++ mutex is the fastest.
Bonita Montero
2024-08-29 14:30:01 UTC
Permalink
This is by far the fastest code:

#if defined(_WIN32)
HANDLE hSem = CreateSemaphoreA( nullptr, 0, 0x7FFFFFFF, nullptr );
auto acquire = [&] { WaitForSingleObject( hSem, INFINITE ); };
auto release = [&] { ReleaseSemaphore( hSem, 1, nullptr ); };
#elif defined(__unix__)
sem_t sem;
sem_init( &sem, 0, 0 );
auto acquire = [&] { while( sem_wait( &sem ) == EINTR ); };
auto release = [&] { sem_post( &sem ); };
#endif
tests.emplace_back( "CT-Sema", [&]
{
if( ctFutex.exchange( 1, memory_order_acquire ) )
while( ctFutex.exchange( 2, memory_order_acquire ) )
acquire();
if( ctFutex.exchange( 0, memory_order_release ) == 2 )
release();
} );

Win32 / clang-cl 17:
os-mutex C++-futex CT-Futex CT-Sema std::mutex
1 4.65 5.94 3.071 2.98 6.764
2 14.7045 32.675 32.973 17.716 12.1095
3 27.089 74.588 78.6393 21.6507 34.3997
4 46.4147 239.19 113.077 27.355 43.3407
5 75.028 355.392 238.347 41.7598 60.594
6 117.501 499.727 317.775 54.0522 86.3667
7 167.781 635.287 417.199 67.6324 123.395
8 234.817 930.783 468.499 75.8145 162.341
9 322.251 1138.95 470.976 87.6566 321.837
10 238.418 1327.29 489.538 116.035 574.937
11 300.167 1427.98 546.078 108.498 680.846
12 307.098 1633.05 561.575 114.932 817.142
13 393.239 1668.65 591.897 117.275 986.565
14 428.639 1916.79 613.624 123.589 1166.91
15 485.699 2016.33 665.218 125.904 1308.23
16 518.917 2108.67 657.133 131.058 1513.81
17 572.802 2135.33 687.547 133.357 1718.77
18 645.806 2241.03 721.078 132.565 1826.97
19 691.06 2361.12 758.582 135.032 2007.85
20 775.441 2465.59 786.588 134.991 2221.41
21 828.668 2574.05 809.456 132.644 2383.34
22 936.921 2747.9 885.664 142.004 2648.4
23 993.523 2822.49 932.029 147.374 2676.98
24 1032.7 2595.27 820.291 151.084 2786.07
25 1158.93 2779.5 863.279 156.877 3014.62
26 1168.73 3018.08 899.575 164.815 3131.53
27 1251.29 3176.82 967.667 171.316 3315.81
28 1346.12 3303.94 987.722 176.866 3519.6
29 1424.58 3519.44 1085.82 179.769 3654.11
30 1519.82 3645.82 1140.33 187.544 3813.35
31 1606.87 3812.95 1176.99 198.398 3993.18
32 1677.32 3916.34 1228.59 200.068 4141.67

WSL2 / clang++ 18:

os-mutex C++-futex CT-Futex CT-Sema std::mutex
1 4.29617 4.67628 3.00231 2.98276 4.08426
2 25.5185 40.8151 14.6537 34.0173 30.4795
3 38.0687 89.9109 42.4446 103.87 25.8821
4 58.7989 173.328 118.727 101.993 61.0923
5 72.2819 303.638 208.134 97.3057 75.2547
6 88.2833 452.347 309.916 114.845 88.307
7 104.865 603.253 403.1 121.215 103.206
8 129.398 954.331 591.535 147.412 126.477
9 161.5 1466.58 599.749 160.904 169.812
10 187.113 1972.09 776.433 185.845 194.201
11 214.369 2529.61 973.352 198.818 215.37
12 238.561 3145.39 1040.81 216.217 241.529
13 259.071 3704.29 1311.71 229.053 267.139
14 282.885 4198.89 1481.91 240.272 282.778
15 310.511 4641.19 1573.61 250.648 311.776
16 331.789 5006.72 1843.99 265.364 333.325
17 361.256 5472.61 1913.92 276.66 369.507
18 388.476 6052.98 1966.87 295.109 396.636
19 428.007 6550.83 2117.1 306.379 427.527
20 464.908 7096.96 2130.08 319.971 468.546
21 498.972 7644.56 2411.33 343.361 498.722
22 535.433 8246.91 2509.67 349.584 537.437
23 568.419 8615.14 2678.5 366.501 571.065
24 614.441 9244.88 2722.57 382.8 616.703
25 651.626 9871.74 2879.58 405.05 646.256
26 693.28 10508.2 3235.45 413.181 679.392
27 727.538 11087.6 3495.82 434.287 726.731
28 769.384 11827.3 3586.65 444.818 760.604
29 811.609 12374.2 3900.95 466.089 826.348
30 866.692 12968.7 4177.86 484.856 856.896
31 905.811 13480.7 4245.24 499.957 902.684
32 942.654 5994.88 4393.53 516.779 955.453

Interesting, that Linux is much less performant here, maybe it's
because WSL.
Chris M. Thomasson
2024-08-29 19:20:59 UTC
Permalink
Post by Bonita Montero
#if defined(_WIN32)
    HANDLE hSem = CreateSemaphoreA( nullptr, 0, 0x7FFFFFFF, nullptr );
    auto acquire = [&] { WaitForSingleObject( hSem, INFINITE ); };
    auto release = [&] { ReleaseSemaphore( hSem, 1, nullptr ); };
^^^^^^^^^^^^^^^^^^^^^^

Actually, release can be SetEvent because the mutex logic only needs a
binary semaphore, or an event on windows. However, I cannot remember if
an event is "faster" than a semaphore on windows or not. Still do not
know how windows handles it's futexes wrt internal impl. WaitOnAddress
and such.
Post by Bonita Montero
#elif defined(__unix__)
    sem_t sem;
    sem_init( &sem, 0, 0 );
    auto acquire = [&] { while( sem_wait( &sem ) == EINTR ); };
    auto release = [&] { sem_post( &sem ); };
#endif
    tests.emplace_back( "CT-Sema", [&]
        {
            if( ctFutex.exchange( 1, memory_order_acquire ) )
                while( ctFutex.exchange( 2, memory_order_acquire ) )
                    acquire();
            if( ctFutex.exchange( 0, memory_order_release ) == 2 )
                release();
        } );
Bonita Montero
2024-08-29 19:44:35 UTC
Permalink
Post by Chris M. Thomasson
Actually, release can be SetEvent because the mutex logic only needs a
binary semaphore, or an event on windows. However, I cannot remember if
an event is "faster" than a semaphore on windows or not. Still do not
know how windows handles it's futexes wrt internal impl. WaitOnAddress
and such.
A semaphore is not signifiantly but measurable faster.
Chris M. Thomasson
2024-08-30 22:20:42 UTC
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Actually, release can be SetEvent because the mutex logic only needs a
binary semaphore, or an event on windows. However, I cannot remember
if an event is "faster" than a semaphore on windows or not. Still do
not know how windows handles it's futexes wrt internal impl.
WaitOnAddress and such.
A semaphore is not signifiantly but measurable faster.
See, this would be measuring the "slow" path. If you algo tends to hit a
slow path most of the time, then perhaps is not being used correctly, or
its a sign to take heed, indeed.
Chris M. Thomasson
2024-09-19 20:50:30 UTC
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
#if defined(_WIN32)
     HANDLE hSem = CreateSemaphoreA( nullptr, 0, 0x7FFFFFFF, nullptr );
     auto acquire = [&] { WaitForSingleObject( hSem, INFINITE ); };
     auto release = [&] { ReleaseSemaphore( hSem, 1, nullptr ); };
^^^^^^^^^^^^^^^^^^^^^^
Actually, release can be SetEvent because the mutex logic only needs a
binary semaphore, or an event on windows. However, I cannot remember if
an event is "faster" than a semaphore on windows or not. Still do not
know how windows handles it's futexes wrt internal impl. WaitOnAddress
and such.
[...]

Not sure if you are familiar with them, but I do remember a special type
of event on Windows in the kernel. Iirc, they were called keyed events.
Still not sure how Windows implements WaitOnAddress.
Chris M. Thomasson
2024-08-29 20:12:32 UTC
Permalink
Post by Bonita Montero
#if defined(_WIN32)
    HANDLE hSem = CreateSemaphoreA( nullptr, 0, 0x7FFFFFFF, nullptr );
    auto acquire = [&] { WaitForSingleObject( hSem, INFINITE ); };
    auto release = [&] { ReleaseSemaphore( hSem, 1, nullptr ); };
#elif defined(__unix__)
    sem_t sem;
    sem_init( &sem, 0, 0 );
    auto acquire = [&] { while( sem_wait( &sem ) == EINTR ); };
[...]

Nice touch with the EINTR. I have seen a lot of code that misses this
aspect... Yikes!
Bonita Montero
2024-08-30 16:47:51 UTC
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
#if defined(_WIN32)
     HANDLE hSem = CreateSemaphoreA( nullptr, 0, 0x7FFFFFFF, nullptr );
     auto acquire = [&] { WaitForSingleObject( hSem, INFINITE ); };
     auto release = [&] { ReleaseSemaphore( hSem, 1, nullptr ); };
#elif defined(__unix__)
     sem_t sem;
     sem_init( &sem, 0, 0 );
     auto acquire = [&] { while( sem_wait( &sem ) == EINTR ); };
[...]
Nice touch with the EINTR. I have seen a lot of code that misses this
aspect... Yikes!
Signals are really sick.
Chris M. Thomasson
2024-08-30 19:43:06 UTC
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
#if defined(_WIN32)
     HANDLE hSem = CreateSemaphoreA( nullptr, 0, 0x7FFFFFFF, nullptr );
     auto acquire = [&] { WaitForSingleObject( hSem, INFINITE ); };
     auto release = [&] { ReleaseSemaphore( hSem, 1, nullptr ); };
#elif defined(__unix__)
     sem_t sem;
     sem_init( &sem, 0, 0 );
     auto acquire = [&] { while( sem_wait( &sem ) == EINTR ); };
[...]
Nice touch with the EINTR. I have seen a lot of code that misses this
aspect... Yikes!
Signals are really sick.
Well, at least your code handles the condition. :^)
Chris M. Thomasson
2024-08-29 20:14:21 UTC
Permalink
[...]
Post by Bonita Montero
Interesting, that Linux is much less performant here, maybe it's
because WSL.
I don't know. The actual futex impl needs to be examined?
Chris M. Thomasson
2024-08-29 19:25:01 UTC
Permalink
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
I tested the operating-system specific mutex (CRITICAL_SECTION Or
pthread_mutext_t) against a futex and a std::mutex. I guessed std::mutex
uses th operating system specific mutex internally, but the times varied
so much across Windows and Linux that I gues that std::mutex used at
least a differently parametrized operating system mutex or maybe even
completely own code.
[...]
Post by Bonita Montero
             while( futex.exchange( true, memory_order_acquire ) )
                 futex.wait( true, memory_order_relaxed );
             futex.exchange( false, memory_order_release );
             futex.notify_one();
         } );
[...]
A wait bit would help out here... ;^) Afaict, this is a rather
"naive" use of futexes. In you use case here, the exchange to
unlock can be a simple atomic store with release semantics. Also,
try to think about calling notify_* only when you absolutely need
to... :^)
Show me your code ...
I did a while back for a futex mutex, win32 impl iirc... Do I have to
find the older code?
https://groups.google.com/g/comp.lang.c++/c/1MZvhswJ6DQ/m/qyaYH-i0CgAJ
              os-mutex  C++-futex   CT-Futex std::mutex
          1      4.458      5.288      2.932      6.213
[...]
         23    1016.16    2828.65    905.289    2798.37
         24    1096.64    2919.04    946.814    2914.23
         25    1161.21    3052.88    996.146       3093
         26    1256.53    3094.02    1034.94    3220.38
         27    1379.79    3323.04    1079.41    3390.07
         28    1446.68    3533.71    1106.27    3516.21
         29    1514.12     3618.5       1160     3624.3
         30    1601.25     3761.2    1218.03     3765.6
         31    1673.59    3913.48    1259.65    3947.46
         32    1827.81    4134.67    1319.16    4061.73
The futex mutex seems to scale a little better...? 1319.16 v 1827.81 for
os mutex?
Chris M. Thomasson
2024-08-29 19:27:53 UTC
Permalink
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
I tested the operating-system specific mutex (CRITICAL_SECTION Or
pthread_mutext_t) against a futex and a std::mutex. I guessed std::mutex
uses th operating system specific mutex internally, but the times varied
so much across Windows and Linux that I gues that std::mutex used at
least a differently parametrized operating system mutex or maybe even
completely own code.
[...]
Post by Bonita Montero
             while( futex.exchange( true, memory_order_acquire ) )
                 futex.wait( true, memory_order_relaxed );
             futex.exchange( false, memory_order_release );
             futex.notify_one();
         } );
[...]
A wait bit would help out here... ;^) Afaict, this is a rather
"naive" use of futexes. In you use case here, the exchange to
unlock can be a simple atomic store with release semantics. Also,
try to think about calling notify_* only when you absolutely need
to... :^)
Show me your code ...
I did a while back for a futex mutex, win32 impl iirc... Do I have
to find the older code?
https://groups.google.com/g/comp.lang.c++/c/1MZvhswJ6DQ/m/qyaYH-i0CgAJ
               os-mutex  C++-futex   CT-Futex std::mutex
           1      4.458      5.288      2.932      6.213
[...]
          23    1016.16    2828.65    905.289    2798.37
          24    1096.64    2919.04    946.814    2914.23
          25    1161.21    3052.88    996.146       3093
          26    1256.53    3094.02    1034.94    3220.38
          27    1379.79    3323.04    1079.41    3390.07
          28    1446.68    3533.71    1106.27    3516.21
          29    1514.12     3618.5       1160     3624.3
          30    1601.25     3761.2    1218.03     3765.6
          31    1673.59    3913.48    1259.65    3947.46
          32    1827.81    4134.67    1319.16    4061.73
The futex mutex seems to scale a little better...? 1319.16 v 1827.81 for
os mutex?
32 means 32 threads, right wrt hc in your code?
Bonita Montero
2024-08-29 19:45:15 UTC
Permalink
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
I tested the operating-system specific mutex (CRITICAL_SECTION Or
pthread_mutext_t) against a futex and a std::mutex. I guessed std::mutex
uses th operating system specific mutex internally, but the times varied
so much across Windows and Linux that I gues that std::mutex used at
least a differently parametrized operating system mutex or maybe even
completely own code.
[...]
Post by Bonita Montero
             while( futex.exchange( true, memory_order_acquire ) )
                 futex.wait( true, memory_order_relaxed );
             futex.exchange( false, memory_order_release );
             futex.notify_one();
         } );
[...]
A wait bit would help out here... ;^) Afaict, this is a rather
"naive" use of futexes. In you use case here, the exchange to
unlock can be a simple atomic store with release semantics. Also,
try to think about calling notify_* only when you absolutely need
to... :^)
Show me your code ...
I did a while back for a futex mutex, win32 impl iirc... Do I have
to find the older code?
https://groups.google.com/g/comp.lang.c++/c/1MZvhswJ6DQ/m/qyaYH-i0CgAJ
               os-mutex  C++-futex   CT-Futex std::mutex
           1      4.458      5.288      2.932      6.213
[...]
          23    1016.16    2828.65    905.289    2798.37
          24    1096.64    2919.04    946.814    2914.23
          25    1161.21    3052.88    996.146       3093
          26    1256.53    3094.02    1034.94    3220.38
          27    1379.79    3323.04    1079.41    3390.07
          28    1446.68    3533.71    1106.27    3516.21
          29    1514.12     3618.5       1160     3624.3
          30    1601.25     3761.2    1218.03     3765.6
          31    1673.59    3913.48    1259.65    3947.46
          32    1827.81    4134.67    1319.16    4061.73
The futex mutex seems to scale a little better...? 1319.16 v 1827.81
for os mutex?
32 means 32 threads, right wrt hc in your code?
Yes, AMD 7950X 16-core CPU.
Chris M. Thomasson
2024-08-30 22:35:19 UTC
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
I tested the operating-system specific mutex (CRITICAL_SECTION Or
pthread_mutext_t) against a futex and a std::mutex. I guessed std::mutex
uses th operating system specific mutex internally, but the times varied
so much across Windows and Linux that I gues that std::mutex used at
least a differently parametrized operating system mutex or maybe even
completely own code.
[...]
Post by Bonita Montero
             while( futex.exchange( true, memory_order_acquire ) )
                 futex.wait( true, memory_order_relaxed );
             futex.exchange( false, memory_order_release );
             futex.notify_one();
         } );
[...]
A wait bit would help out here... ;^) Afaict, this is a rather
"naive" use of futexes. In you use case here, the exchange to
unlock can be a simple atomic store with release semantics.
Also, try to think about calling notify_* only when you
absolutely need to... :^)
Show me your code ...
I did a while back for a futex mutex, win32 impl iirc... Do I have
to find the older code?
https://groups.google.com/g/comp.lang.c++/c/1MZvhswJ6DQ/m/qyaYH-i0CgAJ
               os-mutex  C++-futex   CT-Futex std::mutex
           1      4.458      5.288      2.932      6.213
[...]
          23    1016.16    2828.65    905.289    2798.37
          24    1096.64    2919.04    946.814    2914.23
          25    1161.21    3052.88    996.146       3093
          26    1256.53    3094.02    1034.94    3220.38
          27    1379.79    3323.04    1079.41    3390.07
          28    1446.68    3533.71    1106.27    3516.21
          29    1514.12     3618.5       1160     3624.3
          30    1601.25     3761.2    1218.03     3765.6
          31    1673.59    3913.48    1259.65    3947.46
          32    1827.81    4134.67    1319.16    4061.73
The futex mutex seems to scale a little better...? 1319.16 v 1827.81
for os mutex?
32 means 32 threads, right wrt hc in your code?
Yes, AMD 7950X 16-core CPU.
I think there is a way to make it even faster by using a CAS wrt the
logic of going from the (0,1)=>2 state. I think I remember something
like that. Faster is by further reducing calls into the slow path.

Of course the CAS would only be called on a "slow path" wrt the first
exchange failing to acquire the mutex.
Chris M. Thomasson
2024-08-29 04:20:22 UTC
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
I tested the operating-system specific mutex (CRITICAL_SECTION Or
pthread_mutext_t) against a futex and a std::mutex. I guessed std::mutex
uses th operating system specific mutex internally, but the times varied
so much across Windows and Linux that I gues that std::mutex used at
least a differently parametrized operating system mutex or maybe even
completely own code.
[...]
Post by Bonita Montero
             while( futex.exchange( true, memory_order_acquire ) )
                 futex.wait( true, memory_order_relaxed );
             futex.exchange( false, memory_order_release );
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Can be atomic store with memory_order_release in this context of your code.
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
             futex.notify_one();
         } );
[...]
A wait bit would help out here... ;^) Afaict, this is a rather
"naive" use of futexes. In you use case here, the exchange to unlock
can be a simple atomic store with release semantics. Also, try to
think about calling notify_* only when you absolutely need to... :^)
Show me your code ...
I did a while back for a futex mutex, win32 impl iirc... Do I have to
find the older code?
jseigh
2024-08-30 13:31:20 UTC
Permalink
Post by Bonita Montero
I tested the operating-system specific mutex (CRITICAL_SECTION Or
pthread_mutext_t) against a futex and a std::mutex. I guessed std::mutex
uses th operating system specific mutex internally, but the times varied
so much across Windows and Linux that I gues that std::mutex used at
least a differently parametrized operating system mutex or maybe even
completely own code.
They don't have to be fast, they just have to allow correct synchronization
and allow performant fast paths. Can the api be improved for the latter?
Certainly yes. It is an awkward api to use.

Joe Seigh
Bonita Montero
2024-08-30 15:41:08 UTC
Permalink
Post by jseigh
They don't have to be fast, they just have to allow correct synchronization
and allow performant fast paths. ...
Futexes are there to make the slow past faster.
The fast path has been fast before.
Chris M. Thomasson
2024-08-30 19:43:39 UTC
Permalink
Post by Bonita Montero
Post by jseigh
They don't have to be fast, they just have to allow correct
synchronization
and allow performant fast paths. ...
Futexes are there to make the slow past faster.
The fast path has been fast before.
No. You are wrong here.
Bonita Montero
2024-08-30 20:03:22 UTC
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
Post by jseigh
They don't have to be fast, they just have to allow correct
synchronization
and allow performant fast paths. ...
Futexes are there to make the slow past faster.
The fast path has been fast before.
No. You are wrong here.
No, you don't need a futex for the slow path.
Chris M. Thomasson
2024-08-30 20:04:36 UTC
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
Post by jseigh
They don't have to be fast, they just have to allow correct
synchronization
and allow performant fast paths. ...
Futexes are there to make the slow past faster.
The fast path has been fast before.
No. You are wrong here.
No, you don't need a futex for the slow path.
I don't know what you even mean here. Tell me the difference between a
slow path and a fast path.
Bonita Montero
2024-09-26 16:30:53 UTC
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
Post by jseigh
They don't have to be fast, they just have to allow correct synchronization
and allow performant fast paths. ...
Futexes are there to make the slow past faster.
The fast path has been fast before.
No. You are wrong here.
No, you don't need a futex for the slow path.
I don't know what you even mean here. Tell me the difference between a
slow path and a fast path.
The fast path is in userspace, the slow path is in kernel space.
Chris M. Thomasson
2024-09-26 17:40:30 UTC
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
Post by jseigh
They don't have to be fast, they just have to allow correct synchronization
and allow performant fast paths. ...
Futexes are there to make the slow past faster.
The fast path has been fast before.
No. You are wrong here.
No, you don't need a futex for the slow path.
I don't know what you even mean here. Tell me the difference between a
slow path and a fast path.
The fast path is in userspace, the slow path is in kernel space.
Basically right. Think of hitting a slow path (before resorting to the
kernel), then trying to spin a couple of times in user space before we
have to hit a kernel call. Akin to a adaptive mutex. If we can avoid a
kernel call, well, go ahead and try a couple of times...

Actually, there is a "method" to do this even with a simple mutex via
try_lock. I wrote about it in the past. Hummm... Think of trying to give
a little back off with actual work before we resort to waiting in the
kernel. Iirc, the pattern was something like, wrt mutex:

<pseudo-code>

Keep in mind that the prefix try_* means try to no other work. Not
block! This is important...
________________________
void lock()
{
while (! try_lock())
{
if (! try_to_do_some_other_useful_work())
{
lock(); // oh shit, commit to a lock.
break;
}
}
}
________________________

Notice what's occurring here? We can use "other work" as a sort of
"natural backoff" in the contended case of a mutex in general... In a
certain sense... Instead of spinning, we are doing "actual work" if we
can and only if its available. This can work pretty good on certain
scenarios wrt how a program is structured...
Bonita Montero
2024-09-26 17:44:02 UTC
Permalink
Post by Chris M. Thomasson
Basically right. Think of hitting a slow path (before resorting to the
kernel), then trying to spin a couple of times in user space before we
have to hit a kernel call. Akin to a adaptive mutex. If we can avoid a
kernel call, well, go ahead and try a couple of times...
Adaptive mutexes often don't make any sense. The locking-frequency has
to be high and the locked-interval needs to be very short to make spin-
ning likely to succeed. That's both not often.
Post by Chris M. Thomasson
Actually, there is a "method" to do this even with a simple mutex via
try_lock. ...
I don't know what try_lock is good for. I've never seen someone using
try_lock. If you need to lock sth. you usually haven't an alternative
to locking.
Chris M. Thomasson
2024-09-26 17:48:53 UTC
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Basically right. Think of hitting a slow path (before resorting to the
kernel), then trying to spin a couple of times in user space before we
have to hit a kernel call. Akin to a adaptive mutex. If we can avoid a
kernel call, well, go ahead and try a couple of times...
Adaptive mutexes often don't make any sense. The locking-frequency has
to be high and the locked-interval needs to be very short to make spin-
ning likely to succeed. That's both not often.
Adaptive mutexes with the right spin counts do make a difference!
Post by Bonita Montero
Post by Chris M. Thomasson
Actually, there is a "method" to do this even with a simple mutex via
try_lock. ...
I don't know what try_lock is good for. I've never seen someone using
try_lock. If you need to lock sth. you usually haven't an alternative
to locking.
I just showed you one use of try_lock. This is just one out of many.
try_lock is useful.
Bonita Montero
2024-09-26 17:54:57 UTC
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
Adaptive mutexes often don't make any sense. The locking-frequency has
to be high and the locked-interval needs to be very short to make spin-
ning likely to succeed. That's both not often.
Adaptive mutexes with the right spin counts do make a difference!
Only under the rare circumstances I've shown.
Post by Chris M. Thomasson
I just showed you one use of try_lock. This is just one out of many.
try_lock is useful.
No one needs try_lock.
Chris M. Thomasson
2024-09-26 18:09:03 UTC
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
Adaptive mutexes often don't make any sense. The locking-frequency has
to be high and the locked-interval needs to be very short to make spin-
ning likely to succeed. That's both not often.
Adaptive mutexes with the right spin counts do make a difference!
Only under the rare circumstances I've shown.
Post by Chris M. Thomasson
I just showed you one use of try_lock. This is just one out of many.
try_lock is useful.
No one needs try_lock.
^^^^^^^^^^^^^^

Yawn...

Chris M. Thomasson
2024-08-30 22:22:25 UTC
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
Post by jseigh
They don't have to be fast, they just have to allow correct
synchronization
and allow performant fast paths. ...
Futexes are there to make the slow past faster.
The fast path has been fast before.
No. You are wrong here.
No, you don't need a futex for the slow path.
The idea is to minimize calls into any slow path... :^)
Chris M. Thomasson
2024-08-30 19:45:01 UTC
Permalink
Post by jseigh
Post by Bonita Montero
I tested the operating-system specific mutex (CRITICAL_SECTION Or
pthread_mutext_t) against a futex and a std::mutex. I guessed std::mutex
uses th operating system specific mutex internally, but the times varied
so much across Windows and Linux that I gues that std::mutex used at
least a differently parametrized operating system mutex or maybe even
completely own code.
They don't have to be fast, they just have to allow correct synchronization
and allow performant fast paths.  Can the api be improved for the latter?
Certainly yes.  It is an awkward api to use.
100% agreed. Eventcounts are in the mix as well.
jseigh
2024-08-31 00:36:06 UTC
Permalink
...
Post by Chris M. Thomasson
Post by jseigh
They don't have to be fast, they just have to allow correct
synchronization
and allow performant fast paths.  Can the api be improved for the latter?
Certainly yes.  It is an awkward api to use.
100% agreed. Eventcounts are in the mix as well.
You might take a look at folly's eventcount. It (I think) only calls
futex wake if there are waiters, and only calls futex_wait if the
eventcount hasn't been incremented. I did my own C version since I'm
working in C for now.

Joe Seigh
Chris M. Thomasson
2024-09-02 18:56:24 UTC
Permalink
...
Post by Chris M. Thomasson
Post by jseigh
They don't have to be fast, they just have to allow correct
synchronization
and allow performant fast paths.  Can the api be improved for the latter?
Certainly yes.  It is an awkward api to use.
100% agreed. Eventcounts are in the mix as well.
You might take a look at folly's eventcount.  It (I think) only calls
futex wake if there are waiters, and only calls futex_wait if the
eventcount hasn't been incremented.  I did my own C version since I'm
working in C for now.
I need to look at it. Have been really busy lately. I think you said
that eventcounts can be a little "too" lock-free? ;^)
Chris M. Thomasson
2024-09-15 00:24:13 UTC
Permalink
...
Post by Chris M. Thomasson
Post by jseigh
They don't have to be fast, they just have to allow correct
synchronization
and allow performant fast paths.  Can the api be improved for the latter?
Certainly yes.  It is an awkward api to use.
100% agreed. Eventcounts are in the mix as well.
You might take a look at folly's eventcount.  It (I think) only calls
futex wake if there are waiters, and only calls futex_wait if the
eventcount hasn't been incremented.  I did my own C version since I'm
working in C for now.
Humm... Sounds interesting. I might have some more time tonight. Working
on an interesting collision avoidance algorithm right now using one of
my field algorithms. Remember that old eventcount algo of mine I was
tinkering with way back in c.p.t? You said the membars are no so pretty...
Loading...