Discussion:
smrproxy v2
Add Reply
jseigh
2024-10-17 12:10:20 UTC
Reply
Permalink
I replaced the hazard pointer logic in smrproxy. It's now wait-free
instead of mostly wait-free. The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store. The unlock is same
as before, just a store.

It's way faster now.

It's on the feature/003 branch as a POC. I'm working on porting
it to c++ and don't want to waste any more time on c version.

No idea of it's a new algorithm. I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.

Joe Seigh
Chris M. Thomasson
2024-10-17 20:10:43 UTC
Reply
Permalink
I replaced the hazard pointer logic in smrproxy.  It's now wait-free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
I have to take a look at it! Been really busy lately. Shit happens.
jseigh
2024-10-17 21:08:04 UTC
Reply
Permalink
Post by Chris M. Thomasson
I replaced the hazard pointer logic in smrproxy.  It's now wait-free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
I have to take a look at it! Been really busy lately. Shit happens.
There's a quick and dirty explanation at
http://threadnought.wordpress.com/

repo at https://github.com/jseigh/smrproxy

I'll need to create some memory access diagrams that
visualize how it works at some point.

Anyway if it's new, another algorithm to use without
attribution.

Joe Seigh
Chris M. Thomasson
2024-10-17 23:40:00 UTC
Reply
Permalink
Post by jseigh
Post by Chris M. Thomasson
I replaced the hazard pointer logic in smrproxy.  It's now wait-free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
I have to take a look at it! Been really busy lately. Shit happens.
There's a quick and dirty explanation at
http://threadnought.wordpress.com/
repo at https://github.com/jseigh/smrproxy
I'll need to create some memory access diagrams that
visualize how it works at some point.
Anyway if it's new, another algorithm to use without
attribution.
Interesting. From a quick view, it kind of reminds me of a distributed
seqlock for some reason. Are you using an asymmetric membar in here? in
smr_poll ?
Chris M. Thomasson
2024-10-18 00:16:55 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
I replaced the hazard pointer logic in smrproxy.  It's now wait-free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
I have to take a look at it! Been really busy lately. Shit happens.
There's a quick and dirty explanation at
http://threadnought.wordpress.com/
repo at https://github.com/jseigh/smrproxy
I'll need to create some memory access diagrams that
visualize how it works at some point.
Anyway if it's new, another algorithm to use without
attribution.
Interesting. From a quick view, it kind of reminds me of a distributed
seqlock for some reason. Are you using an asymmetric membar in here? in
smr_poll ?
I remember a long time ago I was messing around where each thread had
two version counters:

pseudo code:

per_thread
{
word m_version[2];


word acquire()
{
word ver = load(global_version);
m_version[ver % 2] = ver ;
return ver ;
}


void release(word ver)
{
m_version[ver % 2] = 0;
}
}


The global_version would only be incremented by the polling thread. This
was WAY back. I think I might of posted about it on cpt.

So, when a node was made unreachable, it would be included in the
polling logic. The polling could increment the version counter then wait
for all the threads prior m_versions to be zero. Collect the current
generation of objects in a defer list. Then on the next cycle it would
increment the version counter, wait until all threads prior versions
were zero, then delete the defer count, and transfer the current gen to
the defer.

It went something like that.
Chris M. Thomasson
2024-10-18 00:24:08 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
I replaced the hazard pointer logic in smrproxy.  It's now wait-free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
I have to take a look at it! Been really busy lately. Shit happens.
There's a quick and dirty explanation at
http://threadnought.wordpress.com/
repo at https://github.com/jseigh/smrproxy
I'll need to create some memory access diagrams that
visualize how it works at some point.
Anyway if it's new, another algorithm to use without
attribution.
Interesting. From a quick view, it kind of reminds me of a distributed
seqlock for some reason. Are you using an asymmetric membar in here?
in smr_poll ?
I remember a long time ago I was messing around where each thread had
per_thread
{
    word m_version[2];
    word acquire()
    {
        word ver = load(global_version);
        m_version[ver % 2] = ver ;
        return ver ;
    }
    void release(word ver)
    {
        m_version[ver % 2] = 0;
    }
}
The global_version would only be incremented by the polling thread. This
was WAY back. I think I might of posted about it on cpt.
So, when a node was made unreachable, it would be included in the
polling logic. The polling could increment the version counter then wait
for all the threads prior m_versions to be zero. Collect the current
generation of objects in a defer list. Then on the next cycle it would
increment the version counter, wait until all threads prior versions
were zero, then delete the defer count, and transfer the current gen to
the defer.
It went something like that.
Iirc, I was using FlushProcessWriteBuffers back then for an asymmetric
barrier for my experiment. The polling thread would execute one after it
increased the global version. Actualy, I can't remember where I placed
it exactly, after or before. The defer list made things work.
jseigh
2024-10-18 12:07:11 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
I replaced the hazard pointer logic in smrproxy.  It's now wait-free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
I have to take a look at it! Been really busy lately. Shit happens.
There's a quick and dirty explanation at
http://threadnought.wordpress.com/
repo at https://github.com/jseigh/smrproxy
I'll need to create some memory access diagrams that
visualize how it works at some point.
Anyway if it's new, another algorithm to use without
attribution.
Interesting. From a quick view, it kind of reminds me of a distributed
seqlock for some reason. Are you using an asymmetric membar in here? in
smr_poll ?
Yes, linux membarrier() in smr_poll.

Not seqlock, not least for the reason that exiting the critical region
is 3 instructions unless you use atomics which are expensive and have
memory barriers usually.

A lot of the qsbr and ebr reader lock/unlock code is going to look
somewhat similar so you have to know how the reclaim logic uses it.
In this case I am slingshotting off of the asymmetric memory barrier.

Earlier at one point I was going to have smrproxy use hazard pointer
logic or qsbr logic as a config option, but the extra code complexity
and the fact that qsbr required 2 grace periods kind of made that
unfeasible. The qsbr logic was mostly ripped out but there were still
some pieces there.

Anyway I'm working a c++ version which involves a lot of extra work
besides just rewriting smrproxy. There coming up with an api for
proxies and testcases which tend to be more work than the code that
they are testing.

Joe Seigh
Chris M. Thomasson
2024-10-25 22:00:15 UTC
Reply
Permalink
Post by jseigh
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
I replaced the hazard pointer logic in smrproxy.  It's now wait-free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
I have to take a look at it! Been really busy lately. Shit happens.
There's a quick and dirty explanation at
http://threadnought.wordpress.com/
repo at https://github.com/jseigh/smrproxy
I'll need to create some memory access diagrams that
visualize how it works at some point.
Anyway if it's new, another algorithm to use without
attribution.
Interesting. From a quick view, it kind of reminds me of a distributed
seqlock for some reason. Are you using an asymmetric membar in here?
in smr_poll ?
Yes, linux membarrier() in smr_poll.
Not seqlock, not least for the reason that exiting the critical region
is 3 instructions unless you use atomics which are expensive and have
memory barriers usually.
A lot of the qsbr and ebr reader lock/unlock code is going to look
somewhat similar so you have to know how the reclaim logic uses it.
In this case I am slingshotting off of the asymmetric memory barrier.
Earlier at one point I was going to have smrproxy use hazard pointer
logic or qsbr logic as a config option, but the extra code complexity
and the fact that qsbr required 2 grace periods kind of made that
unfeasible.  The qsbr logic was mostly ripped out but there were still
some pieces there.
Anyway I'm working a c++ version which involves a lot of extra work
besides just rewriting smrproxy.  There coming up with an api for
proxies and testcases which tend to be more work than the code that
they are testing.
Damn! I almost missed this post! Fucking Thunderbird... Will get back to
you. Working on something else right now Joe, thanks.

https://www.facebook.com/share/p/ydGSuPLDxjkY9TAQ/
jseigh
2024-10-25 22:56:16 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
I replaced the hazard pointer logic in smrproxy.  It's now wait-free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
I have to take a look at it! Been really busy lately. Shit happens.
There's a quick and dirty explanation at
http://threadnought.wordpress.com/
repo at https://github.com/jseigh/smrproxy
I'll need to create some memory access diagrams that
visualize how it works at some point.
Anyway if it's new, another algorithm to use without
attribution.
Interesting. From a quick view, it kind of reminds me of a
distributed seqlock for some reason. Are you using an asymmetric
membar in here? in smr_poll ?
Yes, linux membarrier() in smr_poll.
Not seqlock, not least for the reason that exiting the critical region
is 3 instructions unless you use atomics which are expensive and have
memory barriers usually.
A lot of the qsbr and ebr reader lock/unlock code is going to look
somewhat similar so you have to know how the reclaim logic uses it.
In this case I am slingshotting off of the asymmetric memory barrier.
Earlier at one point I was going to have smrproxy use hazard pointer
logic or qsbr logic as a config option, but the extra code complexity
and the fact that qsbr required 2 grace periods kind of made that
unfeasible.  The qsbr logic was mostly ripped out but there were still
some pieces there.
Anyway I'm working a c++ version which involves a lot of extra work
besides just rewriting smrproxy.  There coming up with an api for
proxies and testcases which tend to be more work than the code that
they are testing.
Damn! I almost missed this post! Fucking Thunderbird... Will get back to
you. Working on something else right now Joe, thanks.
https://www.facebook.com/share/p/ydGSuPLDxjkY9TAQ/
No problem. The c++ work is progressing pretty slowly, not least in
part because the documentation is not always clear as to what
something does or even what problem it is supposed to solve.
To think I took a pass on on rust because I though it was
more complicated than it needed to be.

Joe Seigh
Chris M. Thomasson
2024-10-27 19:33:46 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
I replaced the hazard pointer logic in smrproxy.  It's now wait-free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
I have to take a look at it! Been really busy lately. Shit happens.
There's a quick and dirty explanation at
http://threadnought.wordpress.com/
repo at https://github.com/jseigh/smrproxy
I'll need to create some memory access diagrams that
visualize how it works at some point.
Anyway if it's new, another algorithm to use without
attribution.
Interesting. From a quick view, it kind of reminds me of a
distributed seqlock for some reason. Are you using an asymmetric
membar in here? in smr_poll ?
Yes, linux membarrier() in smr_poll.
Not seqlock, not least for the reason that exiting the critical region
is 3 instructions unless you use atomics which are expensive and have
memory barriers usually.
A lot of the qsbr and ebr reader lock/unlock code is going to look
somewhat similar so you have to know how the reclaim logic uses it.
In this case I am slingshotting off of the asymmetric memory barrier.
Earlier at one point I was going to have smrproxy use hazard pointer
logic or qsbr logic as a config option, but the extra code complexity
and the fact that qsbr required 2 grace periods kind of made that
unfeasible.  The qsbr logic was mostly ripped out but there were still
some pieces there.
Anyway I'm working a c++ version which involves a lot of extra work
besides just rewriting smrproxy.  There coming up with an api for
proxies and testcases which tend to be more work than the code that
they are testing.
Damn! I almost missed this post! Fucking Thunderbird... Will get back
to you. Working on something else right now Joe, thanks.
https://www.facebook.com/share/p/ydGSuPLDxjkY9TAQ/
No problem.  The c++ work is progressing pretty slowly, not least in
part because the documentation is not always clear as to what
something does or even what problem it is supposed to solve.
To think I took a pass on on rust because I though it was
more complicated than it needed to be.
Never even tried Rust, shit, I am behind the times. ;^)

Humm... I don't think we can get 100% C++ because of the damn asymmetric
membar for these rather "specialized" algorithms?

Is C++ thinking about creating a standard way to gain an asymmetric membar?
jseigh
2024-10-27 22:29:43 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
I replaced the hazard pointer logic in smrproxy.  It's now wait-
free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
I have to take a look at it! Been really busy lately. Shit happens.
There's a quick and dirty explanation at
http://threadnought.wordpress.com/
repo at https://github.com/jseigh/smrproxy
I'll need to create some memory access diagrams that
visualize how it works at some point.
Anyway if it's new, another algorithm to use without
attribution.
Interesting. From a quick view, it kind of reminds me of a
distributed seqlock for some reason. Are you using an asymmetric
membar in here? in smr_poll ?
Yes, linux membarrier() in smr_poll.
Not seqlock, not least for the reason that exiting the critical region
is 3 instructions unless you use atomics which are expensive and have
memory barriers usually.
A lot of the qsbr and ebr reader lock/unlock code is going to look
somewhat similar so you have to know how the reclaim logic uses it.
In this case I am slingshotting off of the asymmetric memory barrier.
Earlier at one point I was going to have smrproxy use hazard pointer
logic or qsbr logic as a config option, but the extra code complexity
and the fact that qsbr required 2 grace periods kind of made that
unfeasible.  The qsbr logic was mostly ripped out but there were still
some pieces there.
Anyway I'm working a c++ version which involves a lot of extra work
besides just rewriting smrproxy.  There coming up with an api for
proxies and testcases which tend to be more work than the code that
they are testing.
Damn! I almost missed this post! Fucking Thunderbird... Will get back
to you. Working on something else right now Joe, thanks.
https://www.facebook.com/share/p/ydGSuPLDxjkY9TAQ/
No problem.  The c++ work is progressing pretty slowly, not least in
part because the documentation is not always clear as to what
something does or even what problem it is supposed to solve.
To think I took a pass on on rust because I though it was
more complicated than it needed to be.
Never even tried Rust, shit, I am behind the times. ;^)
Humm... I don't think we can get 100% C++ because of the damn asymmetric
membar for these rather "specialized" algorithms?
Is C++ thinking about creating a standard way to gain an asymmetric membar?
I don't think so. It's platform dependent. Apart from linux, mostly
it's done with a call to some virtual memory function that flushes
the TLBs (translation lookaside buffers) which involves IPI calls
to all the processors and those have memory barriers. This is
old, 1973, patent 3,947,823 cited by the patent I did.

Anyway, I version the code so there's a asymmetric memory barrier
version and an explicit memory barrier version, the latter
being much slower.

Joe Seigh
Chris M. Thomasson
2024-10-27 22:32:33 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
I replaced the hazard pointer logic in smrproxy.  It's now
wait- free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
I have to take a look at it! Been really busy lately. Shit happens.
There's a quick and dirty explanation at
http://threadnought.wordpress.com/
repo at https://github.com/jseigh/smrproxy
I'll need to create some memory access diagrams that
visualize how it works at some point.
Anyway if it's new, another algorithm to use without
attribution.
Interesting. From a quick view, it kind of reminds me of a
distributed seqlock for some reason. Are you using an asymmetric
membar in here? in smr_poll ?
Yes, linux membarrier() in smr_poll.
Not seqlock, not least for the reason that exiting the critical region
is 3 instructions unless you use atomics which are expensive and have
memory barriers usually.
A lot of the qsbr and ebr reader lock/unlock code is going to look
somewhat similar so you have to know how the reclaim logic uses it.
In this case I am slingshotting off of the asymmetric memory barrier.
Earlier at one point I was going to have smrproxy use hazard pointer
logic or qsbr logic as a config option, but the extra code complexity
and the fact that qsbr required 2 grace periods kind of made that
unfeasible.  The qsbr logic was mostly ripped out but there were still
some pieces there.
Anyway I'm working a c++ version which involves a lot of extra work
besides just rewriting smrproxy.  There coming up with an api for
proxies and testcases which tend to be more work than the code that
they are testing.
Damn! I almost missed this post! Fucking Thunderbird... Will get
back to you. Working on something else right now Joe, thanks.
https://www.facebook.com/share/p/ydGSuPLDxjkY9TAQ/
No problem.  The c++ work is progressing pretty slowly, not least in
part because the documentation is not always clear as to what
something does or even what problem it is supposed to solve.
To think I took a pass on on rust because I though it was
more complicated than it needed to be.
Never even tried Rust, shit, I am behind the times. ;^)
Humm... I don't think we can get 100% C++ because of the damn
asymmetric membar for these rather "specialized" algorithms?
Is C++ thinking about creating a standard way to gain an asymmetric membar?
I don't think so.  It's platform dependent.  Apart from linux, mostly
it's done with a call to some virtual memory function that flushes
the TLBs (translation lookaside buffers) which involves IPI calls
to all the processors and those have memory barriers.  This is
old, 1973, patent 3,947,823 cited by the patent I did.
Anyway, I version the code so there's a asymmetric memory barrier
version and an explicit memory barrier version, the latter
being much slower.
Ahh, nice! acquire/release, no seq_cst, right? ;^)
jseigh
2024-10-28 00:35:54 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
I replaced the hazard pointer logic in smrproxy.  It's now
wait- free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
I have to take a look at it! Been really busy lately. Shit happens.
There's a quick and dirty explanation at
http://threadnought.wordpress.com/
repo at https://github.com/jseigh/smrproxy
I'll need to create some memory access diagrams that
visualize how it works at some point.
Anyway if it's new, another algorithm to use without
attribution.
Interesting. From a quick view, it kind of reminds me of a
distributed seqlock for some reason. Are you using an asymmetric
membar in here? in smr_poll ?
Yes, linux membarrier() in smr_poll.
Not seqlock, not least for the reason that exiting the critical region
is 3 instructions unless you use atomics which are expensive and have
memory barriers usually.
A lot of the qsbr and ebr reader lock/unlock code is going to look
somewhat similar so you have to know how the reclaim logic uses it.
In this case I am slingshotting off of the asymmetric memory barrier.
Earlier at one point I was going to have smrproxy use hazard pointer
logic or qsbr logic as a config option, but the extra code complexity
and the fact that qsbr required 2 grace periods kind of made that
unfeasible.  The qsbr logic was mostly ripped out but there were still
some pieces there.
Anyway I'm working a c++ version which involves a lot of extra work
besides just rewriting smrproxy.  There coming up with an api for
proxies and testcases which tend to be more work than the code that
they are testing.
Damn! I almost missed this post! Fucking Thunderbird... Will get
back to you. Working on something else right now Joe, thanks.
https://www.facebook.com/share/p/ydGSuPLDxjkY9TAQ/
No problem.  The c++ work is progressing pretty slowly, not least in
part because the documentation is not always clear as to what
something does or even what problem it is supposed to solve.
To think I took a pass on on rust because I though it was
more complicated than it needed to be.
Never even tried Rust, shit, I am behind the times. ;^)
Humm... I don't think we can get 100% C++ because of the damn
asymmetric membar for these rather "specialized" algorithms?
Is C++ thinking about creating a standard way to gain an asymmetric membar?
I don't think so.  It's platform dependent.  Apart from linux, mostly
it's done with a call to some virtual memory function that flushes
the TLBs (translation lookaside buffers) which involves IPI calls
to all the processors and those have memory barriers.  This is
old, 1973, patent 3,947,823 cited by the patent I did.
Anyway, I version the code so there's a asymmetric memory barrier
version and an explicit memory barrier version, the latter
being much slower.
Ahh, nice! acquire/release, no seq_cst, right? ;^)
The membar version? That's a store/load membar so it is expensive.
Chris M. Thomasson
2024-10-28 04:02:58 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
I replaced the hazard pointer logic in smrproxy.  It's now
wait- free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
I have to take a look at it! Been really busy lately. Shit happens.
There's a quick and dirty explanation at
http://threadnought.wordpress.com/
repo at https://github.com/jseigh/smrproxy
I'll need to create some memory access diagrams that
visualize how it works at some point.
Anyway if it's new, another algorithm to use without
attribution.
Interesting. From a quick view, it kind of reminds me of a
distributed seqlock for some reason. Are you using an asymmetric
membar in here? in smr_poll ?
Yes, linux membarrier() in smr_poll.
Not seqlock, not least for the reason that exiting the critical region
is 3 instructions unless you use atomics which are expensive and have
memory barriers usually.
A lot of the qsbr and ebr reader lock/unlock code is going to look
somewhat similar so you have to know how the reclaim logic uses it.
In this case I am slingshotting off of the asymmetric memory barrier.
Earlier at one point I was going to have smrproxy use hazard pointer
logic or qsbr logic as a config option, but the extra code complexity
and the fact that qsbr required 2 grace periods kind of made that
unfeasible.  The qsbr logic was mostly ripped out but there were still
some pieces there.
Anyway I'm working a c++ version which involves a lot of extra work
besides just rewriting smrproxy.  There coming up with an api for
proxies and testcases which tend to be more work than the code that
they are testing.
Damn! I almost missed this post! Fucking Thunderbird... Will get
back to you. Working on something else right now Joe, thanks.
https://www.facebook.com/share/p/ydGSuPLDxjkY9TAQ/
No problem.  The c++ work is progressing pretty slowly, not least in
part because the documentation is not always clear as to what
something does or even what problem it is supposed to solve.
To think I took a pass on on rust because I though it was
more complicated than it needed to be.
Never even tried Rust, shit, I am behind the times. ;^)
Humm... I don't think we can get 100% C++ because of the damn
asymmetric membar for these rather "specialized" algorithms?
Is C++ thinking about creating a standard way to gain an asymmetric membar?
I don't think so.  It's platform dependent.  Apart from linux, mostly
it's done with a call to some virtual memory function that flushes
the TLBs (translation lookaside buffers) which involves IPI calls
to all the processors and those have memory barriers.  This is
old, 1973, patent 3,947,823 cited by the patent I did.
Anyway, I version the code so there's a asymmetric memory barrier
version and an explicit memory barrier version, the latter
being much slower.
Ahh, nice! acquire/release, no seq_cst, right? ;^)
The membar version?  That's a store/load membar so it is expensive.
I was wondering in your c++ version if you had to use any seq_cst
barriers. I think acquire/release should be good enough. Now, when I say
C++, I mean pure C++, no calls to FlushProcessWriteBuffers and things
like that.

I take it that your pure C++ version has no atomic RMW, right? Just
loads and stores?
jseigh
2024-10-28 11:45:15 UTC
Reply
Permalink
Post by Chris M. Thomasson
The membar version?  That's a store/load membar so it is expensive.
I was wondering in your c++ version if you had to use any seq_cst
barriers. I think acquire/release should be good enough. Now, when I say
C++, I mean pure C++, no calls to FlushProcessWriteBuffers and things
like that.
I take it that your pure C++ version has no atomic RMW, right? Just
loads and stores?
While a lock action has acquire memory order semantics, if the
implementation has internal stores, you have to those stores
are complete before any access from the critical section.
So you may need a store/load memory barrier.

For cmpxchg, it has full cst_seq. For other rmw atomics I don't
know. I have to ask on c.a. I think some data dependency and/or
control dependency might factor in.

Joe Seigh
Chris M. Thomasson
2024-10-28 21:57:23 UTC
Reply
Permalink
Post by jseigh
Post by Chris M. Thomasson
The membar version?  That's a store/load membar so it is expensive.
I was wondering in your c++ version if you had to use any seq_cst
barriers. I think acquire/release should be good enough. Now, when I
say C++, I mean pure C++, no calls to FlushProcessWriteBuffers and
things like that.
I take it that your pure C++ version has no atomic RMW, right? Just
loads and stores?
While a lock action has acquire memory order semantics, if the
implementation has internal stores, you have to those stores
are complete before any access from the critical section.
So you may need a store/load memory barrier.
Wrt acquiring a lock the only class of mutex logic that comes to mind
that requires an explicit storeload style membar is Petersons, and some
others along those lines, so to speak. This is for the store and load
version. Now, RMW on x86 basically implies a StoreLoad wrt the LOCK
prefix, XCHG aside for it has an implied LOCK prefix. For instance the
original SMR algo requires a storeload as is on x86/x64. MFENCE or LOCK
prefix.

Fwiw, my experimental pure C++ proxy works fine with XADD, or atomic
fetch-add. It needs an explicit membars (no #StoreLoad) on SPARC in RMO
mode. On x86, the LOCK prefix handles that wrt the RMW's themselves.
This is a lot different than using stores and loads. The original SMR
and Peterson's algo needs that "store followed by a load to a different
location" action to hold true, aka, storeload...

Now, I don't think that a data-dependant load can act like a storeload.
I thought that they act sort of like an acquire, aka #LoadStore |
#LoadLoad wrt SPARC. SPARC in RMO mode honors data-dependencies. Now,
the DEC Alpha is a different story... ;^)
Post by jseigh
For cmpxchg, it has full cst_seq.  For other rmw atomics I don't
know.  I have to ask on c.a.  I think some data dependency and/or
control dependency might factor in.
Joe Seigh
Chris M. Thomasson
2024-10-28 22:09:39 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
The membar version?  That's a store/load membar so it is expensive.
I was wondering in your c++ version if you had to use any seq_cst
barriers. I think acquire/release should be good enough. Now, when I
say C++, I mean pure C++, no calls to FlushProcessWriteBuffers and
things like that.
I take it that your pure C++ version has no atomic RMW, right? Just
loads and stores?
While a lock action has acquire memory order semantics, if the
implementation has internal stores, you have to those stores
are complete before any access from the critical section.
So you may need a store/load memory barrier.
Wrt acquiring a lock the only class of mutex logic that comes to mind
that requires an explicit storeload style membar is Petersons, and some
others along those lines, so to speak. This is for the store and load
version. Now, RMW on x86 basically implies a StoreLoad wrt the LOCK
prefix, XCHG aside for it has an implied LOCK prefix. For instance the
original SMR algo requires a storeload as is on x86/x64. MFENCE or LOCK
prefix.
Fwiw, my experimental pure C++ proxy works fine with XADD, or atomic
fetch-add. It needs an explicit membars (no #StoreLoad) on SPARC in RMO
mode. On x86, the LOCK prefix handles that wrt the RMW's themselves.
The fun part is that since its pure C++, those membars are going to be
emitted for me no matter what arch it's compiling for.
Post by Chris M. Thomasson
This is a lot different than using stores and loads. The original SMR
and Peterson's algo needs that "store followed by a load to a different
location" action to hold true, aka, storeload...
Now, I don't think that a data-dependant load can act like a storeload.
I thought that they act sort of like an acquire, aka #LoadStore |
#LoadLoad wrt SPARC. SPARC in RMO mode honors data-dependencies. Now,
the DEC Alpha is a different story... ;^)
Post by jseigh
For cmpxchg, it has full cst_seq.  For other rmw atomics I don't
know.  I have to ask on c.a.  I think some data dependency and/or
control dependency might factor in.
Joe Seigh
jseigh
2024-10-29 01:17:55 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
The membar version?  That's a store/load membar so it is expensive.
I was wondering in your c++ version if you had to use any seq_cst
barriers. I think acquire/release should be good enough. Now, when I
say C++, I mean pure C++, no calls to FlushProcessWriteBuffers and
things like that.
I take it that your pure C++ version has no atomic RMW, right? Just
loads and stores?
While a lock action has acquire memory order semantics, if the
implementation has internal stores, you have to those stores
are complete before any access from the critical section.
So you may need a store/load memory barrier.
Wrt acquiring a lock the only class of mutex logic that comes to mind
that requires an explicit storeload style membar is Petersons, and some
others along those lines, so to speak. This is for the store and load
version. Now, RMW on x86 basically implies a StoreLoad wrt the LOCK
prefix, XCHG aside for it has an implied LOCK prefix. For instance the
original SMR algo requires a storeload as is on x86/x64. MFENCE or LOCK
prefix.
Fwiw, my experimental pure C++ proxy works fine with XADD, or atomic
fetch-add. It needs an explicit membars (no #StoreLoad) on SPARC in RMO
mode. On x86, the LOCK prefix handles that wrt the RMW's themselves.
This is a lot different than using stores and loads. The original SMR
and Peterson's algo needs that "store followed by a load to a different
location" action to hold true, aka, storeload...
Now, I don't think that a data-dependant load can act like a storeload.
I thought that they act sort of like an acquire, aka #LoadStore |
#LoadLoad wrt SPARC. SPARC in RMO mode honors data-dependencies. Now,
the DEC Alpha is a different story... ;^)
fwiw, here's the lock and unlock logic from smrproxy rewrite

inline void lock()
{
epoch_t _epoch = shadow_epoch.load(std::memory_order_relaxed);
_ref_epoch.store(_epoch, std::memory_order_relaxed);
std::atomic_signal_fence(std::memory_order_acquire);
}

inline void unlock()
{
_ref_epoch.store(0, std::memory_order_release);
}

epoch_t is interesting. It's uint64_t but handles wrapped
compares, ie. for an epoch_t x1 and uint64_t n

x1 < (x1 + n)

for any value of x1 and any value of n from 0 to 2**63;
eg.
0xfffffffffffffff0 < 0x0000000000000001


The rewrite is almost complete except for some thread_local
stuff. I think I might break off there. Most of the
additional work is writing the test code. I'm considering
rewriting it in Rust.

Joe Seigh
Chris M. Thomasson
2024-10-29 04:35:33 UTC
Reply
Permalink
Post by jseigh
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
The membar version?  That's a store/load membar so it is expensive.
I was wondering in your c++ version if you had to use any seq_cst
barriers. I think acquire/release should be good enough. Now, when I
say C++, I mean pure C++, no calls to FlushProcessWriteBuffers and
things like that.
I take it that your pure C++ version has no atomic RMW, right? Just
loads and stores?
While a lock action has acquire memory order semantics, if the
implementation has internal stores, you have to those stores
are complete before any access from the critical section.
So you may need a store/load memory barrier.
Wrt acquiring a lock the only class of mutex logic that comes to mind
that requires an explicit storeload style membar is Petersons, and
some others along those lines, so to speak. This is for the store and
load version. Now, RMW on x86 basically implies a StoreLoad wrt the
LOCK prefix, XCHG aside for it has an implied LOCK prefix. For
instance the original SMR algo requires a storeload as is on x86/x64.
MFENCE or LOCK prefix.
Fwiw, my experimental pure C++ proxy works fine with XADD, or atomic
fetch-add. It needs an explicit membars (no #StoreLoad) on SPARC in
RMO mode. On x86, the LOCK prefix handles that wrt the RMW's
themselves. This is a lot different than using stores and loads. The
original SMR and Peterson's algo needs that "store followed by a load
to a different location" action to hold true, aka, storeload...
Now, I don't think that a data-dependant load can act like a
storeload. I thought that they act sort of like an acquire, aka
#LoadStore | #LoadLoad wrt SPARC. SPARC in RMO mode honors data-
dependencies. Now, the DEC Alpha is a different story... ;^)
fwiw, here's the lock and unlock logic from smrproxy rewrite
    inline void lock()
    {
        epoch_t _epoch = shadow_epoch.load(std::memory_order_relaxed);
        _ref_epoch.store(_epoch, std::memory_order_relaxed);
        std::atomic_signal_fence(std::memory_order_acquire);
^^^^^^^^^^^^^^^^^^^^^^
Post by jseigh
    }
Still don't know how your pure C++ write up can handle this without an
std::atomic_thread_fence(std::memory_order_acquire).
Post by jseigh
    inline void unlock()
    {
        _ref_epoch.store(0, std::memory_order_release);
    }
epoch_t is interesting.  It's uint64_t but handles wrapped
compares, ie. for an epoch_t x1 and uint64_t n
    x1 < (x1 + n)
for any value of x1 and any value of n from 0 to 2**63;
eg.
   0xfffffffffffffff0 < 0x0000000000000001
For some reason it reminds me of your atomic 63 bit counter from a while
back. If your smr_poll thread is the only thread that can increment the
epoch, then it will be able to detect rolling to zero, right? For one of
my older proxys the thread would have two version counters:

per_thread
{
word m_ver[2];


lock()
{
word ver = g_version;
m_ver[ver % 2] = ver;
}

unlock(word ver)
{
m_ver[ver % 2] = 0;
}
}

So, it's different logic. The only way I got around using memory
barriers (compiler barriers aside for a moment), here was for the
polling thread to use an asymmetric membar. For a pure C++ version, I
think it would require:


per_thread
{
word m_ver[2];


lock()
{
word ver = g_version;
m_ver[ver % 2] = ver;
#LoadStore | #LoadLoad;
}

unlock(word ver)
{
#LoadStore | #StoreStore;
m_ver[ver % 2] = 0;
}
}

I think.... Humm...
Post by jseigh
The rewrite is almost complete except for some thread_local
stuff.  I think I might break off there.  Most of the
additional work is writing the test code.  I'm considering
rewriting it in Rust.
Joe Seigh
jseigh
2024-10-29 11:23:56 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by jseigh
fwiw, here's the lock and unlock logic from smrproxy rewrite
     inline void lock()
     {
         epoch_t _epoch = shadow_epoch.load(std::memory_order_relaxed);
         _ref_epoch.store(_epoch, std::memory_order_relaxed);
         std::atomic_signal_fence(std::memory_order_acquire);
^^^^^^^^^^^^^^^^^^^^^^
Post by jseigh
     }
Still don't know how your pure C++ write up can handle this without an
std::atomic_thread_fence(std::memory_order_acquire).
No thread fence is necessary. The loads can move before
the store. They just can't move before the async
membar. After that membar any previously retired
objects are no longer reachable.
Post by Chris M. Thomasson
Post by jseigh
     inline void unlock()
     {
         _ref_epoch.store(0, std::memory_order_release);
     }
Chris M. Thomasson
2024-10-29 21:51:58 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by jseigh
fwiw, here's the lock and unlock logic from smrproxy rewrite
     inline void lock()
     {
         epoch_t _epoch = shadow_epoch.load(std::memory_order_relaxed);
         _ref_epoch.store(_epoch, std::memory_order_relaxed);
         std::atomic_signal_fence(std::memory_order_acquire);
^^^^^^^^^^^^^^^^^^^^^^
Post by jseigh
     }
Still don't know how your pure C++ write up can handle this without an
std::atomic_thread_fence(std::memory_order_acquire).
No thread fence is necessary.  The loads can move before
the store.  They just can't move before the async
membar.  After that membar any previously retired
objects are no longer reachable.
Ahhhh. I thought your C++ version is going to be one that did not use an
asymmetric membar in order to make it so called "100%" pure... Not sure
about C++ including one in the actual standard. If they did I think it
would have some info about it, perhaps akin to:

https://en.cppreference.com/w/cpp/atomic/atomic/is_lock_free

Well, even then if an asymmetric was not available on the arch the code
is being compiled for, it can say, well, shit happens! Then allow you to
fall back to another version...?

Also, the C++ wording of this is interesting:

https://en.cppreference.com/w/cpp/atomic/atomic/is_always_lock_free

wrt:
_____________
​0​ for the built-in atomic types that are never lock-free,
1 for the built-in atomic types that are sometimes lock-free,
2 for the built-in atomic types that are always lock-free.
_____________

That's fun... ;^)
Post by Chris M. Thomasson
Post by jseigh
     inline void unlock()
     {
         _ref_epoch.store(0, std::memory_order_release);
     }
Chris M. Thomasson
2024-10-29 04:38:09 UTC
Reply
Permalink
Post by jseigh
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
The membar version?  That's a store/load membar so it is expensive.
I was wondering in your c++ version if you had to use any seq_cst
barriers. I think acquire/release should be good enough. Now, when I
say C++, I mean pure C++, no calls to FlushProcessWriteBuffers and
things like that.
I take it that your pure C++ version has no atomic RMW, right? Just
loads and stores?
While a lock action has acquire memory order semantics, if the
implementation has internal stores, you have to those stores
are complete before any access from the critical section.
So you may need a store/load memory barrier.
Wrt acquiring a lock the only class of mutex logic that comes to mind
that requires an explicit storeload style membar is Petersons, and
some others along those lines, so to speak. This is for the store and
load version. Now, RMW on x86 basically implies a StoreLoad wrt the
LOCK prefix, XCHG aside for it has an implied LOCK prefix. For
instance the original SMR algo requires a storeload as is on x86/x64.
MFENCE or LOCK prefix.
Fwiw, my experimental pure C++ proxy works fine with XADD, or atomic
fetch-add. It needs an explicit membars (no #StoreLoad) on SPARC in
RMO mode. On x86, the LOCK prefix handles that wrt the RMW's
themselves. This is a lot different than using stores and loads. The
original SMR and Peterson's algo needs that "store followed by a load
to a different location" action to hold true, aka, storeload...
Now, I don't think that a data-dependant load can act like a
storeload. I thought that they act sort of like an acquire, aka
#LoadStore | #LoadLoad wrt SPARC. SPARC in RMO mode honors data-
dependencies. Now, the DEC Alpha is a different story... ;^)
fwiw, here's the lock and unlock logic from smrproxy rewrite
    inline void lock()
    {
        epoch_t _epoch = shadow_epoch.load(std::memory_order_relaxed);
        _ref_epoch.store(_epoch, std::memory_order_relaxed);
        std::atomic_signal_fence(std::memory_order_acquire);
    }
    inline void unlock()
    {
        _ref_epoch.store(0, std::memory_order_release);
    }
epoch_t is interesting.  It's uint64_t but handles wrapped
compares, ie. for an epoch_t x1 and uint64_t n
Only your single polling thread can mutate the shadow_epoch, right?
Post by jseigh
    x1 < (x1 + n)
for any value of x1 and any value of n from 0 to 2**63;
eg.
   0xfffffffffffffff0 < 0x0000000000000001
The rewrite is almost complete except for some thread_local
stuff.  I think I might break off there.  Most of the
additional work is writing the test code.  I'm considering
rewriting it in Rust.
jseigh
2024-10-29 11:27:40 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by jseigh
fwiw, here's the lock and unlock logic from smrproxy rewrite
     inline void lock()
     {
         epoch_t _epoch = shadow_epoch.load(std::memory_order_relaxed);
         _ref_epoch.store(_epoch, std::memory_order_relaxed);
         std::atomic_signal_fence(std::memory_order_acquire);
     }
     inline void unlock()
     {
         _ref_epoch.store(0, std::memory_order_release);
     }
epoch_t is interesting.  It's uint64_t but handles wrapped
compares, ie. for an epoch_t x1 and uint64_t n
Only your single polling thread can mutate the shadow_epoch, right?
Yes. It's just an optimization. The reader threads could read
from the global epoch but it would be in a separate cache line
and be an extra dependent load. So one dependent load and
same cache line.
Chris M. Thomasson
2024-10-29 21:56:16 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by jseigh
fwiw, here's the lock and unlock logic from smrproxy rewrite
     inline void lock()
     {
         epoch_t _epoch = shadow_epoch.load(std::memory_order_relaxed);
         _ref_epoch.store(_epoch, std::memory_order_relaxed);
         std::atomic_signal_fence(std::memory_order_acquire);
     }
     inline void unlock()
     {
         _ref_epoch.store(0, std::memory_order_release);
     }
epoch_t is interesting.  It's uint64_t but handles wrapped
compares, ie. for an epoch_t x1 and uint64_t n
Only your single polling thread can mutate the shadow_epoch, right?
Yes.  It's just an optimization. The reader threads could read
from the global epoch but it would be in a separate cache line
and be an extra dependent load.  So one dependent load and
same cache line.
Are you taking advantage of the fancy alignment capabilities of C++?

https://en.cppreference.com/w/cpp/language/alignas

and friends? They seem to work fine wrt the last time I checked them.

It's nice to have a standard way to pad and align on cache line
boundaries. :^)
jseigh
2024-10-30 16:36:45 UTC
Reply
Permalink
Post by Chris M. Thomasson
Yes.  It's just an optimization. The reader threads could read
from the global epoch but it would be in a separate cache line
and be an extra dependent load.  So one dependent load and
same cache line.
Are you taking advantage of the fancy alignment capabilities of C++?
https://en.cppreference.com/w/cpp/language/alignas
and friends? They seem to work fine wrt the last time I checked them.
It's nice to have a standard way to pad and align on cache line
boundaries. :^)
It's target processor dependent. You need to query the cache line size
and pass to compile as a define variable.

There's supposed to be built in define that would be target system
dependent but it's not implemented.
Chris M. Thomasson
2024-11-01 19:20:39 UTC
Reply
Permalink
Post by Chris M. Thomasson
Yes.  It's just an optimization. The reader threads could read
from the global epoch but it would be in a separate cache line
and be an extra dependent load.  So one dependent load and
same cache line.
Are you taking advantage of the fancy alignment capabilities of C++?
https://en.cppreference.com/w/cpp/language/alignas
and friends? They seem to work fine wrt the last time I checked them.
It's nice to have a standard way to pad and align on cache line
boundaries. :^)
It's target processor dependent.  You need to query the cache line size
and pass to compile as a define variable.
There's supposed to be built in define that would be target system
dependent but it's not implemented.
What about:

https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size

?
Chris M. Thomasson
2024-10-29 04:41:12 UTC
Reply
Permalink
Post by jseigh
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
The membar version?  That's a store/load membar so it is expensive.
I was wondering in your c++ version if you had to use any seq_cst
barriers. I think acquire/release should be good enough. Now, when I
say C++, I mean pure C++, no calls to FlushProcessWriteBuffers and
things like that.
I take it that your pure C++ version has no atomic RMW, right? Just
loads and stores?
While a lock action has acquire memory order semantics, if the
implementation has internal stores, you have to those stores
are complete before any access from the critical section.
So you may need a store/load memory barrier.
Wrt acquiring a lock the only class of mutex logic that comes to mind
that requires an explicit storeload style membar is Petersons, and
some others along those lines, so to speak. This is for the store and
load version. Now, RMW on x86 basically implies a StoreLoad wrt the
LOCK prefix, XCHG aside for it has an implied LOCK prefix. For
instance the original SMR algo requires a storeload as is on x86/x64.
MFENCE or LOCK prefix.
Fwiw, my experimental pure C++ proxy works fine with XADD, or atomic
fetch-add. It needs an explicit membars (no #StoreLoad) on SPARC in
RMO mode. On x86, the LOCK prefix handles that wrt the RMW's
themselves. This is a lot different than using stores and loads. The
original SMR and Peterson's algo needs that "store followed by a load
to a different location" action to hold true, aka, storeload...
Now, I don't think that a data-dependant load can act like a
storeload. I thought that they act sort of like an acquire, aka
#LoadStore | #LoadLoad wrt SPARC. SPARC in RMO mode honors data-
dependencies. Now, the DEC Alpha is a different story... ;^)
fwiw, here's the lock and unlock logic from smrproxy rewrite
    inline void lock()
    {
        epoch_t _epoch = shadow_epoch.load(std::memory_order_relaxed);
        _ref_epoch.store(_epoch, std::memory_order_relaxed);
        std::atomic_signal_fence(std::memory_order_acquire);
    }
    inline void unlock()
    {
        _ref_epoch.store(0, std::memory_order_release);
    }
epoch_t is interesting.  It's uint64_t but handles wrapped
compares, ie. for an epoch_t x1 and uint64_t n
[...]

Humm... I am not sure if it would work with just the release. The
polling thread would read from these per thread epochs, _ref_epoch,
using an acquire barrier? Still. Not sure if that would work. Need to
put my thinking cap on. ;^)
Chris M. Thomasson
2024-10-29 22:05:47 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
The membar version?  That's a store/load membar so it is expensive.
I was wondering in your c++ version if you had to use any seq_cst
barriers. I think acquire/release should be good enough. Now, when
I say C++, I mean pure C++, no calls to FlushProcessWriteBuffers
and things like that.
I take it that your pure C++ version has no atomic RMW, right? Just
loads and stores?
While a lock action has acquire memory order semantics, if the
implementation has internal stores, you have to those stores
are complete before any access from the critical section.
So you may need a store/load memory barrier.
Wrt acquiring a lock the only class of mutex logic that comes to mind
that requires an explicit storeload style membar is Petersons, and
some others along those lines, so to speak. This is for the store and
load version. Now, RMW on x86 basically implies a StoreLoad wrt the
LOCK prefix, XCHG aside for it has an implied LOCK prefix. For
instance the original SMR algo requires a storeload as is on x86/x64.
MFENCE or LOCK prefix.
Fwiw, my experimental pure C++ proxy works fine with XADD, or atomic
fetch-add. It needs an explicit membars (no #StoreLoad) on SPARC in
RMO mode. On x86, the LOCK prefix handles that wrt the RMW's
themselves. This is a lot different than using stores and loads. The
original SMR and Peterson's algo needs that "store followed by a load
to a different location" action to hold true, aka, storeload...
Now, I don't think that a data-dependant load can act like a
storeload. I thought that they act sort of like an acquire, aka
#LoadStore | #LoadLoad wrt SPARC. SPARC in RMO mode honors data-
dependencies. Now, the DEC Alpha is a different story... ;^)
fwiw, here's the lock and unlock logic from smrproxy rewrite
     inline void lock()
     {
         epoch_t _epoch = shadow_epoch.load(std::memory_order_relaxed);
         _ref_epoch.store(_epoch, std::memory_order_relaxed);
         std::atomic_signal_fence(std::memory_order_acquire);
     }
     inline void unlock()
     {
         _ref_epoch.store(0, std::memory_order_release);
     }
epoch_t is interesting.  It's uint64_t but handles wrapped
compares, ie. for an epoch_t x1 and uint64_t n
[...]
Humm... I am not sure if it would work with just the release. The
polling thread would read from these per thread epochs, _ref_epoch,
using an acquire barrier? Still. Not sure if that would work. Need to
put my thinking cap on. ;^)
Ahhh, if you are using an async membar in your upcoming C++ version,
then it would be fine. No problem. A compiler fence ala
atomic_signal_fence, and the the explicit release, well, it will work. I
don't see why it would not work.

For some reason, I thought you were going to not use an async membar in
your C++ version. Sorry. However, it still would be fun to test
against... ;^)
jseigh
2024-10-30 16:39:38 UTC
Reply
Permalink
Post by Chris M. Thomasson
Ahhh, if you are using an async membar in your upcoming C++ version,
then it would be fine. No problem. A compiler fence ala
atomic_signal_fence, and the the explicit release, well, it will work. I
don't see why it would not work.
For some reason, I thought you were going to not use an async membar in
your C++ version. Sorry. However, it still would be fun to test
against... ;^)
The C version has both versions. The C++ version does only the
async member version. But I'm not publishing that code so it's
a moot point.
Chris M. Thomasson
2024-11-04 05:14:37 UTC
Reply
Permalink
Post by Chris M. Thomasson
Ahhh, if you are using an async membar in your upcoming C++ version,
then it would be fine. No problem. A compiler fence ala
atomic_signal_fence, and the the explicit release, well, it will work.
I don't see why it would not work.
For some reason, I thought you were going to not use an async membar
in your C++ version. Sorry. However, it still would be fun to test
against... ;^)
The C version has both versions.  The C++ version does only the
async member version.  But I'm not publishing that code so it's
a moot point.
I got side tracked with more heavy math. The problem with C++ code that
uses an async memory barrier is that its automatically rendered into a
non-portable state... Yikes! Imvvvvvho, C/C++ should think about
including them in some future standard. It would be nice. Well, for us
at least! ;^)
jseigh
2024-11-04 12:46:37 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Ahhh, if you are using an async membar in your upcoming C++ version,
then it would be fine. No problem. A compiler fence ala
atomic_signal_fence, and the the explicit release, well, it will
work. I don't see why it would not work.
For some reason, I thought you were going to not use an async membar
in your C++ version. Sorry. However, it still would be fun to test
against... ;^)
The C version has both versions.  The C++ version does only the
async member version.  But I'm not publishing that code so it's
a moot point.
I got side tracked with more heavy math. The problem with C++ code that
uses an async memory barrier is that its automatically rendered into a
non-portable state... Yikes! Imvvvvvho, C/C++ should think about
including them in some future standard. It would be nice. Well, for us
at least! ;^)
That's never going to happen. DWCAS has been around for more than
50 years and c++ doesn't support that and probably never will.
You can't write lock-free queues that are ABA free and
are performant without that. So async memory barriers won't
happen any time soon either.

Long term I think c++ will fade into irrelevance along with
all the other programming languages based on an imperfect
knowledge of concurrency, which is basically all of them
right now.
M***@DastartdlyHQ.org
2024-11-04 14:09:14 UTC
Reply
Permalink
On Mon, 4 Nov 2024 07:46:37 -0500
Post by jseigh
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Ahhh, if you are using an async membar in your upcoming C++ version,
then it would be fine. No problem. A compiler fence ala
atomic_signal_fence, and the the explicit release, well, it will
work. I don't see why it would not work.
For some reason, I thought you were going to not use an async membar
in your C++ version. Sorry. However, it still would be fun to test
against... ;^)
The C version has both versions.  The C++ version does only the
async member version.  But I'm not publishing that code so it's
a moot point.
I got side tracked with more heavy math. The problem with C++ code that
uses an async memory barrier is that its automatically rendered into a
non-portable state... Yikes! Imvvvvvho, C/C++ should think about
including them in some future standard. It would be nice. Well, for us
at least! ;^)
That's never going to happen. DWCAS has been around for more than
50 years and c++ doesn't support that and probably never will.
You can't write lock-free queues that are ABA free and
are performant without that. So async memory barriers won't
happen any time soon either.
Long term I think c++ will fade into irrelevance along with
all the other programming languages based on an imperfect
knowledge of concurrency, which is basically all of them
right now.
Given most concurrent operating systems are written in these "imperfect"
languages how does that square with your definition? And how would your
perfect language run on them?

Anyway, concurrency is the job of the OS, not the language. C++ threading is
just a wrapper around pthreads on *nix and windows threads on Windows. The
language just needs an interface to the underlying OS functionality, it should
not try and implement the functionality itself as it'll always be a hack.
Chris M. Thomasson
2024-11-04 20:42:53 UTC
Reply
Permalink
Post by M***@DastartdlyHQ.org
On Mon, 4 Nov 2024 07:46:37 -0500
Post by jseigh
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Ahhh, if you are using an async membar in your upcoming C++ version,
then it would be fine. No problem. A compiler fence ala
atomic_signal_fence, and the the explicit release, well, it will
work. I don't see why it would not work.
For some reason, I thought you were going to not use an async membar
in your C++ version. Sorry. However, it still would be fun to test
against... ;^)
The C version has both versions.  The C++ version does only the
async member version.  But I'm not publishing that code so it's
a moot point.
I got side tracked with more heavy math. The problem with C++ code that
uses an async memory barrier is that its automatically rendered into a
non-portable state... Yikes! Imvvvvvho, C/C++ should think about
including them in some future standard. It would be nice. Well, for us
at least! ;^)
That's never going to happen. DWCAS has been around for more than
50 years and c++ doesn't support that and probably never will.
You can't write lock-free queues that are ABA free and
are performant without that. So async memory barriers won't
happen any time soon either.
Long term I think c++ will fade into irrelevance along with
all the other programming languages based on an imperfect
knowledge of concurrency, which is basically all of them
right now.
Given most concurrent operating systems are written in these "imperfect"
languages how does that square with your definition? And how would your
perfect language run on them?
Anyway, concurrency is the job of the OS, not the language. C++ threading is
just a wrapper around pthreads on *nix and windows threads on Windows. The
language just needs an interface to the underlying OS functionality, it should
not try and implement the functionality itself as it'll always be a hack.
A start would be C++ having an "always lock free" CAS for two contiguous
words on systems that support it, yes, even 64 bit. ala:

struct anchor {
word a;
word b;
};

cmpxchg8b for x86, cmpxchg16b for x64, ect...

https://www.felixcloutier.com/x86/cmpxchg8b:cmpxchg16b
Chris M. Thomasson
2024-11-09 21:51:26 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by M***@DastartdlyHQ.org
On Mon, 4 Nov 2024 07:46:37 -0500
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Ahhh, if you are using an async membar in your upcoming C++ version,
then it would be fine. No problem. A compiler fence ala
atomic_signal_fence, and the the explicit release, well, it will
work. I don't see why it would not work.
For some reason, I thought you were going to not use an async membar
in your C++ version. Sorry. However, it still would be fun to test
against... ;^)
The C version has both versions.  The C++ version does only the
async member version.  But I'm not publishing that code so it's
a moot point.
I got side tracked with more heavy math. The problem with C++ code that
uses an async memory barrier is that its automatically rendered into a
non-portable state... Yikes! Imvvvvvho, C/C++ should think about
including them in some future standard. It would be nice. Well, for us
at least! ;^)
That's never going to happen.  DWCAS has been around for more than
50 years and c++ doesn't support that and probably never will.
You can't write lock-free queues that are ABA free and
are performant without that.  So async memory barriers won't
happen any time soon either.
Long term I think c++ will fade into irrelevance along with
all the other programming languages based on an imperfect
knowledge of concurrency, which is basically all of them
right now.
Given most concurrent operating systems are written in these "imperfect"
languages how does that square with your definition? And how would your
perfect language run on them?
Anyway, concurrency is the job of the OS, not the language. C++ threading is
just a wrapper around pthreads on *nix and windows threads on Windows. The
language just needs an interface to the underlying OS functionality, it should
not try and implement the functionality itself as it'll always be a hack.
A start would be C++ having an "always lock free" CAS for two contiguous
struct anchor {
    word a;
    word b;
};
Even better:

struct anchor {
void* a;
word b;
};

where sizeof(void*) = sizeof(word)... ;^)
Post by Chris M. Thomasson
cmpxchg8b for x86, cmpxchg16b for x64, ect...
https://www.felixcloutier.com/x86/cmpxchg8b:cmpxchg16b
Chris M. Thomasson
2024-12-12 08:43:23 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Ahhh, if you are using an async membar in your upcoming C++ version,
then it would be fine. No problem. A compiler fence ala
atomic_signal_fence, and the the explicit release, well, it will
work. I don't see why it would not work.
For some reason, I thought you were going to not use an async membar
in your C++ version. Sorry. However, it still would be fun to test
against... ;^)
The C version has both versions.  The C++ version does only the
async member version.  But I'm not publishing that code so it's
a moot point.
I got side tracked with more heavy math. The problem with C++ code
that uses an async memory barrier is that its automatically rendered
into a non-portable state... Yikes! Imvvvvvho, C/C++ should think
about including them in some future standard. It would be nice. Well,
for us at least! ;^)
That's never going to happen.  DWCAS has been around for more than
50 years and c++ doesn't support that and probably never will.
You can't write lock-free queues that are ABA free and
are performant without that.
Hummm... If I remember correctly, you said something about using a
simple atomic exchange to pop a whole list (lock-free stack), then
simple reversing the list to get a fifo order? Do you remember any of
that way back on c.p.t?
So async memory barriers won't
happen any time soon either.
Long term I think c++ will fade into irrelevance along with
all the other programming languages based on an imperfect
knowledge of concurrency, which is basically all of them
right now.
jseigh
2024-12-12 12:29:11 UTC
Reply
Permalink
Post by Chris M. Thomasson
Hummm... If I remember correctly, you said something about using a
simple atomic exchange to pop a whole list (lock-free stack), then
simple reversing the list to get a fifo order? Do you remember any of
that way back on c.p.t?
That kind of stuff pre-dates c.p.t. even.
Chris M. Thomasson
2024-12-12 21:55:38 UTC
Reply
Permalink
Post by jseigh
Post by Chris M. Thomasson
Hummm... If I remember correctly, you said something about using a
simple atomic exchange to pop a whole list (lock-free stack), then
simple reversing the list to get a fifo order? Do you remember any of
that way back on c.p.t?
That kind of stuff pre-dates c.p.t. even.
Has to. Although, it was not in the IBM principles of operation where
the describe their lock-free stack in an appendix, iirc... I cannot
remember the exact one right now. Iirc, it was under free pool
manipulation?
jseigh
2024-12-13 00:48:12 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
Hummm... If I remember correctly, you said something about using a
simple atomic exchange to pop a whole list (lock-free stack), then
simple reversing the list to get a fifo order? Do you remember any of
that way back on c.p.t?
That kind of stuff pre-dates c.p.t. even.
Has to. Although, it was not in the IBM principles of operation where
the describe their lock-free stack in an appendix, iirc... I cannot
remember the exact one right now. Iirc, it was under free pool
manipulation?
If you pop off a LIFO stack onto another LIFO, everything becomes FIFO.

Also, there's a doubly linked stack where the back links are lazily
initialized. That seems to get reinvented every so often.
Chris M. Thomasson
2024-12-26 04:36:12 UTC
Reply
Permalink
Post by jseigh
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
Hummm... If I remember correctly, you said something about using a
simple atomic exchange to pop a whole list (lock-free stack), then
simple reversing the list to get a fifo order? Do you remember any
of that way back on c.p.t?
That kind of stuff pre-dates c.p.t. even.
Has to. Although, it was not in the IBM principles of operation where
the describe their lock-free stack in an appendix, iirc... I cannot
remember the exact one right now. Iirc, it was under free pool
manipulation?
If you pop off a LIFO stack onto another LIFO, everything becomes FIFO.
Humm.... Merry Christmas!
Post by jseigh
Also, there's a doubly linked stack where the back links are lazily
initialized.  That seems to get reinvented every so often.
ahhhhh!

Chris M. Thomasson
2024-10-28 04:08:23 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
I replaced the hazard pointer logic in smrproxy.  It's now
wait- free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
I have to take a look at it! Been really busy lately. Shit happens.
There's a quick and dirty explanation at
http://threadnought.wordpress.com/
repo at https://github.com/jseigh/smrproxy
I'll need to create some memory access diagrams that
visualize how it works at some point.
Anyway if it's new, another algorithm to use without
attribution.
Interesting. From a quick view, it kind of reminds me of a
distributed seqlock for some reason. Are you using an asymmetric
membar in here? in smr_poll ?
Yes, linux membarrier() in smr_poll.
Not seqlock, not least for the reason that exiting the critical region
is 3 instructions unless you use atomics which are expensive and have
memory barriers usually.
A lot of the qsbr and ebr reader lock/unlock code is going to look
somewhat similar so you have to know how the reclaim logic uses it.
In this case I am slingshotting off of the asymmetric memory barrier.
Earlier at one point I was going to have smrproxy use hazard pointer
logic or qsbr logic as a config option, but the extra code complexity
and the fact that qsbr required 2 grace periods kind of made that
unfeasible.  The qsbr logic was mostly ripped out but there were still
some pieces there.
Anyway I'm working a c++ version which involves a lot of extra work
besides just rewriting smrproxy.  There coming up with an api for
proxies and testcases which tend to be more work than the code that
they are testing.
Damn! I almost missed this post! Fucking Thunderbird... Will get
back to you. Working on something else right now Joe, thanks.
https://www.facebook.com/share/p/ydGSuPLDxjkY9TAQ/
No problem.  The c++ work is progressing pretty slowly, not least in
part because the documentation is not always clear as to what
something does or even what problem it is supposed to solve.
To think I took a pass on on rust because I though it was
more complicated than it needed to be.
Never even tried Rust, shit, I am behind the times. ;^)
Humm... I don't think we can get 100% C++ because of the damn
asymmetric membar for these rather "specialized" algorithms?
Is C++ thinking about creating a standard way to gain an asymmetric membar?
I don't think so.  It's platform dependent.  Apart from linux, mostly
it's done with a call to some virtual memory function that flushes
the TLBs (translation lookaside buffers) which involves IPI calls
to all the processors and those have memory barriers.  This is
old, 1973, patent 3,947,823 cited by the patent I did.
Anyway, I version the code so there's a asymmetric memory barrier
version and an explicit memory barrier version, the latter
being much slower.
I should get back into one of my older proxy algorithms. Things are
mostly wait free, if XADD can be wait free itself. No CAS in sight. I
just found an older version I posted.. Almost forgot I make this post:

https://groups.google.com/g/comp.lang.c++/c/FBqOMvqWpR4/m/bDZZLUmAAgAJ

https://pastebin.com/raw/nPVYXbWM
(raw text, no ad bullshit)
Chris M. Thomasson
2024-10-28 04:13:05 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
Post by jseigh
Post by Chris M. Thomasson
I replaced the hazard pointer logic in smrproxy.  It's now
wait- free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
I have to take a look at it! Been really busy lately. Shit happens.
There's a quick and dirty explanation at
http://threadnought.wordpress.com/
repo at https://github.com/jseigh/smrproxy
I'll need to create some memory access diagrams that
visualize how it works at some point.
Anyway if it's new, another algorithm to use without
attribution.
Interesting. From a quick view, it kind of reminds me of a
distributed seqlock for some reason. Are you using an asymmetric
membar in here? in smr_poll ?
Yes, linux membarrier() in smr_poll.
Not seqlock, not least for the reason that exiting the critical region
is 3 instructions unless you use atomics which are expensive and have
memory barriers usually.
A lot of the qsbr and ebr reader lock/unlock code is going to look
somewhat similar so you have to know how the reclaim logic uses it.
In this case I am slingshotting off of the asymmetric memory barrier.
Earlier at one point I was going to have smrproxy use hazard pointer
logic or qsbr logic as a config option, but the extra code complexity
and the fact that qsbr required 2 grace periods kind of made that
unfeasible.  The qsbr logic was mostly ripped out but there were still
some pieces there.
Anyway I'm working a c++ version which involves a lot of extra work
besides just rewriting smrproxy.  There coming up with an api for
proxies and testcases which tend to be more work than the code that
they are testing.
Damn! I almost missed this post! Fucking Thunderbird... Will get
back to you. Working on something else right now Joe, thanks.
https://www.facebook.com/share/p/ydGSuPLDxjkY9TAQ/
No problem.  The c++ work is progressing pretty slowly, not least in
part because the documentation is not always clear as to what
something does or even what problem it is supposed to solve.
To think I took a pass on on rust because I though it was
more complicated than it needed to be.
Never even tried Rust, shit, I am behind the times. ;^)
Humm... I don't think we can get 100% C++ because of the damn
asymmetric membar for these rather "specialized" algorithms?
Is C++ thinking about creating a standard way to gain an asymmetric membar?
I don't think so.  It's platform dependent.  Apart from linux, mostly
it's done with a call to some virtual memory function that flushes
the TLBs (translation lookaside buffers) which involves IPI calls
to all the processors and those have memory barriers.  This is
old, 1973, patent 3,947,823 cited by the patent I did.
Anyway, I version the code so there's a asymmetric memory barrier
version and an explicit memory barrier version, the latter
being much slower.
I should get back into one of my older proxy algorithms. Things are
mostly wait free, if XADD can be wait free itself. No CAS in sight. I
https://groups.google.com/g/comp.lang.c++/c/FBqOMvqWpR4/m/bDZZLUmAAgAJ
https://pastebin.com/raw/nPVYXbWM
(raw text, no ad bullshit)
then we can do something like a per-reader-thread incrementing its own
version counter on every say 1000 reads and do a store followed by a
release membar to allow the polling thread to pick it up via an acquire
load. This is in pure C++, no asymmetric magic here at all.
Chris M. Thomasson
2024-10-29 05:02:23 UTC
Reply
Permalink
I replaced the hazard pointer logic in smrproxy.  It's now wait-free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
For some reason you made me think of another very simple proxy technique
using per thread mutexes. It was an experiment a while back:
___________________
per_thread
{
std::mutex m_locks[2];

lock()
{
word ver = g_version;
m_locks[ver % 2].lock();
}

unlock(word ver)
{
m_locks[ver % 2].unlock();
}
}
___________________

The polling thread would increase the g_version counter then lock and
unlock all of the threads previous locks. Iirc, it worked way better
than a read write lock for sure. Basically:
___________________
word ver = g_version.inc(); // ver is the previous version

for all threads as t
{
t.m_locks[ver % 2].lock();
t.m_locks[ver % 2].unlock();
}
___________________

After that, it knew the previous generation was completed.

It was just a way for using a mutex to get distributed proxy like behavior.
Chris M. Thomasson
2024-10-30 06:40:12 UTC
Reply
Permalink
Post by Chris M. Thomasson
I replaced the hazard pointer logic in smrproxy.  It's now wait-free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
For some reason you made me think of another very simple proxy technique
___________________
per_thread
{
    std::mutex m_locks[2];
    lock()
    {
        word ver = g_version;
        m_locks[ver % 2].lock();
    }
    unlock(word ver)
    {
        m_locks[ver % 2].unlock();
    }
}
___________________
The polling thread would increase the g_version counter then lock and
unlock all of the threads previous locks. Iirc, it worked way better
___________________
word ver = g_version.inc(); // ver is the previous version
for all threads as t
{
   t.m_locks[ver % 2].lock();
   t.m_locks[ver % 2].unlock();
}
___________________
After that, it knew the previous generation was completed.
It was just a way for using a mutex to get distributed proxy like behavior.
There are fun things to do here. A thread can do an unlock lock cycle
ever often, say 1000 iterations. The fun part is that this can beat a
read write lock.
Chris M. Thomasson
2024-10-30 06:42:03 UTC
Reply
Permalink
Post by Chris M. Thomasson
I replaced the hazard pointer logic in smrproxy.  It's now wait-free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
For some reason you made me think of another very simple proxy technique
___________________
per_thread
{
    std::mutex m_locks[2];
    lock()
    {
        word ver = g_version;
        m_locks[ver % 2].lock();
    }
    unlock(word ver)
    {
        m_locks[ver % 2].unlock();
    }
Oops! lock() should return the version then pass it on to unlock. Sorry
for missing that in my pseudo-code. ;^o
Post by Chris M. Thomasson
}
___________________
The polling thread would increase the g_version counter then lock and
unlock all of the threads previous locks. Iirc, it worked way better
___________________
word ver = g_version.inc(); // ver is the previous version
for all threads as t
{
   t.m_locks[ver % 2].lock();
   t.m_locks[ver % 2].unlock();
}
___________________
After that, it knew the previous generation was completed.
It was just a way for using a mutex to get distributed proxy like behavior.
Chris M. Thomasson
2024-10-30 06:53:19 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by Chris M. Thomasson
I replaced the hazard pointer logic in smrproxy.  It's now wait-free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
For some reason you made me think of another very simple proxy
___________________
per_thread
{
     std::mutex m_locks[2];
     lock()
     {
         word ver = g_version;
         m_locks[ver % 2].lock();
     }
     unlock(word ver)
     {
         m_locks[ver % 2].unlock();
     }
Oops! lock() should return the version then pass it on to unlock. Sorry
for missing that in my pseudo-code. ;^o
Post by Chris M. Thomasson
}
___________________
The polling thread would increase the g_version counter then lock and
unlock all of the threads previous locks. Iirc, it worked way better
___________________
word ver = g_version.inc(); // ver is the previous version
for all threads as t
{
    t.m_locks[ver % 2].lock();
    t.m_locks[ver % 2].unlock();
}
___________________
After that, it knew the previous generation was completed.
It was just a way for using a mutex to get distributed proxy like behavior.
Iirc, I still had to use a defer list where the dtors were called for
the nodes.
Chris M. Thomasson
2024-10-30 06:54:33 UTC
Reply
Permalink
I replaced the hazard pointer logic in smrproxy.  It's now wait-free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
Joe, can you call dtors for nodes after a single epoch?
jseigh
2024-10-30 16:43:09 UTC
Reply
Permalink
Post by Chris M. Thomasson
I replaced the hazard pointer logic in smrproxy.  It's now wait-free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
Joe, can you call dtors for nodes after a single epoch?
Yes since I can check that epochs are being referenced
directly instead of indirectly via quiescent states.
It's the inferring from events vs. conditions thing.
Chris M. Thomasson
2024-11-02 20:50:51 UTC
Reply
Permalink
I replaced the hazard pointer logic in smrproxy.  It's now wait-free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
Joe Seigh
Another crude experiment I was testing out back in the day. Heart beat
for per threads, wrt a version that does not use asymmetric membars to
test against:

per_thread
{
word cur = 0;

void beat()
{
word global = g_version;
if (cur == global) return;
store_seq_cst(&cur, global);
}
}


Well, this case requires worker threads to beat every now and then when
they are outside of critical sections. Just a test case against my
asymmetric versions. The _single_ poll thread would increment the
g_version then check for threads that were equal to the new version.
This was a quiescent period when all threads went through it.

Of course the asymmetric version beat it, but it was fun to test against
anyway.
jseigh
2024-11-21 20:17:32 UTC
Reply
Permalink
I replaced the hazard pointer logic in smrproxy.  It's now wait-free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
I got a port to c++ working now. There are 5 proxy implementations
1) smrproxy v2
2) arcproxy - reference counted proxy
3) rwlock based proxy
4) mutex based proxy
5) an unsafe proxy with no locking

The testcase is templated so you can use any of the
5 proxy implementations without rewriting for each proxy
type. You can do apple to apple comparisons. I
realize that's the complete antithesis of current
programming practices but there you have it. :)

A bit of clean up and performance tuning now.

Joe Seigh
jseigh
2024-11-23 16:10:46 UTC
Reply
Permalink
Post by jseigh
I replaced the hazard pointer logic in smrproxy.  It's now wait-free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
I got a port to c++ working now. There are 5 proxy implementations
1) smrproxy v2
2) arcproxy - reference counted proxy
3) rwlock based proxy
4) mutex based proxy
5) an unsafe proxy with no locking
The testcase is templated so you can use any of the
5 proxy implementations without rewriting for each proxy
type.  You can do apple to apple comparisons.  I
realize that's the complete antithesis of current
programming practices but there you have it.  :)
A bit of clean up and performance tuning now.
Ok, smrproxy lock/unlock is down to 0.6 nanoseconds now,
about what the C version was.

Joe Seigh
Chris M. Thomasson
2024-11-23 21:31:24 UTC
Reply
Permalink
Post by jseigh
Post by jseigh
I replaced the hazard pointer logic in smrproxy.  It's now wait-free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
I got a port to c++ working now. There are 5 proxy implementations
1) smrproxy v2
2) arcproxy - reference counted proxy
3) rwlock based proxy
4) mutex based proxy
5) an unsafe proxy with no locking
The testcase is templated so you can use any of the
5 proxy implementations without rewriting for each proxy
type.  You can do apple to apple comparisons.  I
realize that's the complete antithesis of current
programming practices but there you have it.  :)
A bit of clean up and performance tuning now.
Ok, smrproxy lock/unlock is down to 0.6 nanoseconds now,
about what the C version was.
Nice! Are you using pthread_getspecific or tss_get in you C version?
jseigh
2024-11-23 22:12:56 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by jseigh
Post by jseigh
I replaced the hazard pointer logic in smrproxy.  It's now wait-free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
I got a port to c++ working now. There are 5 proxy implementations
1) smrproxy v2
2) arcproxy - reference counted proxy
3) rwlock based proxy
4) mutex based proxy
5) an unsafe proxy with no locking
The testcase is templated so you can use any of the
5 proxy implementations without rewriting for each proxy
type.  You can do apple to apple comparisons.  I
realize that's the complete antithesis of current
programming practices but there you have it.  :)
A bit of clean up and performance tuning now.
Ok, smrproxy lock/unlock is down to 0.6 nanoseconds now,
about what the C version was.
Nice! Are you using pthread_getspecific or tss_get in you C version?
Just thread_local.
Chris M. Thomasson
2024-11-24 01:46:32 UTC
Reply
Permalink
[...]
Post by jseigh
Post by Chris M. Thomasson
Post by jseigh
Ok, smrproxy lock/unlock is down to 0.6 nanoseconds now,
about what the C version was.
Nice! Are you using pthread_getspecific or tss_get in you C version?
Just thread_local.
Nice!
jseigh
2024-11-24 20:14:58 UTC
Reply
Permalink
Post by jseigh
Post by jseigh
I replaced the hazard pointer logic in smrproxy.  It's now wait-free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
I got a port to c++ working now. There are 5 proxy implementations
1) smrproxy v2
2) arcproxy - reference counted proxy
3) rwlock based proxy
4) mutex based proxy
5) an unsafe proxy with no locking
The testcase is templated so you can use any of the
5 proxy implementations without rewriting for each proxy
type.  You can do apple to apple comparisons.  I
realize that's the complete antithesis of current
programming practices but there you have it.  :)
A bit of clean up and performance tuning now.
Ok, smrproxy lock/unlock is down to 0.6 nanoseconds now,
about what the C version was.
I've been using cpu time to measure performance. That's ok
for lock-free/wait-free locking. For normal mutexes and
shared locks, it doesn't measure wait time so those didn't
look as bad as they really were. You can add logic
to measure how long it takes to acquire a lock but that
adds significant overhead.

Joe Seigh
Chris M. Thomasson
2024-11-24 23:19:49 UTC
Reply
Permalink
Post by jseigh
Post by jseigh
Post by jseigh
I replaced the hazard pointer logic in smrproxy.  It's now wait-free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
I got a port to c++ working now. There are 5 proxy implementations
1) smrproxy v2
2) arcproxy - reference counted proxy
3) rwlock based proxy
4) mutex based proxy
5) an unsafe proxy with no locking
The testcase is templated so you can use any of the
5 proxy implementations without rewriting for each proxy
type.  You can do apple to apple comparisons.  I
realize that's the complete antithesis of current
programming practices but there you have it.  :)
A bit of clean up and performance tuning now.
Ok, smrproxy lock/unlock is down to 0.6 nanoseconds now,
about what the C version was.
I've been using cpu time to measure performance. That's ok
for lock-free/wait-free locking.  For normal mutexes and
shared locks, it doesn't measure wait time so those didn't
look as bad as they really were.  You can add logic
to measure how long it takes to acquire a lock but that
adds significant overhead.
I remember back in the day when I was comparing and contrasting various
lock/wait-free algorithms with their 100% lock-based counter parts. Some
of the lock-based tests too so long that I just terminated the damn
program. Iirc, a lock-free test would take around 5 minutes. The
lock-based test would be around 30+ minutes. This was way back on c.p.t.
jseigh
2024-11-25 00:09:05 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by jseigh
Post by jseigh
Post by jseigh
I replaced the hazard pointer logic in smrproxy.  It's now wait-free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
I got a port to c++ working now. There are 5 proxy implementations
1) smrproxy v2
2) arcproxy - reference counted proxy
3) rwlock based proxy
4) mutex based proxy
5) an unsafe proxy with no locking
The testcase is templated so you can use any of the
5 proxy implementations without rewriting for each proxy
type.  You can do apple to apple comparisons.  I
realize that's the complete antithesis of current
programming practices but there you have it.  :)
A bit of clean up and performance tuning now.
Ok, smrproxy lock/unlock is down to 0.6 nanoseconds now,
about what the C version was.
I've been using cpu time to measure performance. That's ok
for lock-free/wait-free locking.  For normal mutexes and
shared locks, it doesn't measure wait time so those didn't
look as bad as they really were.  You can add logic
to measure how long it takes to acquire a lock but that
adds significant overhead.
I remember back in the day when I was comparing and contrasting various
lock/wait-free algorithms with their 100% lock-based counter parts. Some
of the lock-based tests too so long that I just terminated the damn
program. Iirc, a lock-free test would take around 5 minutes. The lock-
based test would be around 30+ minutes. This was way back on c.p.t.
I set the iteration count as a parameter. Mutex can be particularly
slow with a lot of reader threads. I usually see about 1000 - 10000
times slower than smrproxy. rwlocks aren't as bad, about 200 x
slower.

Mutex, rwlock, and arcproxy use interlocked instructions so you
can get a really wide performance range based on cache geometry
and processor sets you run on.

Joe Seigh
Chris M. Thomasson
2024-11-25 00:37:36 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by jseigh
Post by jseigh
Post by jseigh
I replaced the hazard pointer logic in smrproxy.  It's now wait-free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
I got a port to c++ working now. There are 5 proxy implementations
1) smrproxy v2
2) arcproxy - reference counted proxy
3) rwlock based proxy
4) mutex based proxy
5) an unsafe proxy with no locking
The testcase is templated so you can use any of the
5 proxy implementations without rewriting for each proxy
type.  You can do apple to apple comparisons.  I
realize that's the complete antithesis of current
programming practices but there you have it.  :)
A bit of clean up and performance tuning now.
Ok, smrproxy lock/unlock is down to 0.6 nanoseconds now,
about what the C version was.
I've been using cpu time to measure performance. That's ok
for lock-free/wait-free locking.  For normal mutexes and
shared locks, it doesn't measure wait time so those didn't
look as bad as they really were.  You can add logic
to measure how long it takes to acquire a lock but that
adds significant overhead.
I remember back in the day when I was comparing and contrasting
various lock/wait-free algorithms with their 100% lock-based counter
parts. Some of the lock-based tests too so long that I just terminated
the damn program. Iirc, a lock-free test would take around 5 minutes.
The lock- based test would be around 30+ minutes. This was way back on
c.p.t.
I set the iteration count as a parameter.  Mutex can be particularly
slow with a lot of reader threads.  I usually see about 1000 - 10000
times slower than smrproxy.   rwlocks aren't as bad, about 200 x
slower.
Mutex, rwlock, and arcproxy use interlocked instructions so you
can get a really wide performance range based on cache geometry
and processor sets you run on.
Big time. My older proxy uses interlocked instructions as well. Except,
it does not use any CAS instructions... :^)

https://pastebin.com/raw/f71480694

Wow, back in 2010! How time goes by. Shit... ;^o
Chris M. Thomasson
2024-11-25 00:48:07 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by jseigh
Post by jseigh
Post by jseigh
I replaced the hazard pointer logic in smrproxy.  It's now wait-free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
I got a port to c++ working now. There are 5 proxy implementations
1) smrproxy v2
2) arcproxy - reference counted proxy
3) rwlock based proxy
4) mutex based proxy
5) an unsafe proxy with no locking
The testcase is templated so you can use any of the
5 proxy implementations without rewriting for each proxy
type.  You can do apple to apple comparisons.  I
realize that's the complete antithesis of current
programming practices but there you have it.  :)
A bit of clean up and performance tuning now.
Ok, smrproxy lock/unlock is down to 0.6 nanoseconds now,
about what the C version was.
I've been using cpu time to measure performance. That's ok
for lock-free/wait-free locking.  For normal mutexes and
shared locks, it doesn't measure wait time so those didn't
look as bad as they really were.  You can add logic
to measure how long it takes to acquire a lock but that
adds significant overhead.
I remember back in the day when I was comparing and contrasting
various lock/wait-free algorithms with their 100% lock-based counter
parts. Some of the lock-based tests too so long that I just
terminated the damn program. Iirc, a lock-free test would take around
5 minutes. The lock- based test would be around 30+ minutes. This was
way back on c.p.t.
I set the iteration count as a parameter.  Mutex can be particularly
slow with a lot of reader threads.  I usually see about 1000 - 10000
times slower than smrproxy.   rwlocks aren't as bad, about 200 x
slower.
Mutex, rwlock, and arcproxy use interlocked instructions so you
can get a really wide performance range based on cache geometry
and processor sets you run on.
Big time. My older proxy uses interlocked instructions as well. Except,
it does not use any CAS instructions... :^)
https://pastebin.com/raw/f71480694
Wow, back in 2010! How time goes by. Shit... ;^o
Btw, my code for that link is using Relacy Race Detector. :^)
Chris M. Thomasson
2024-11-25 00:47:06 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by jseigh
Post by jseigh
Post by jseigh
I replaced the hazard pointer logic in smrproxy.  It's now wait-free
instead of mostly wait-free.  The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store.  The unlock is same
as before, just a store.
It's way faster now.
It's on the feature/003 branch as a POC.   I'm working on porting
it to c++ and don't want to waste any more time on c version.
No idea of it's a new algorithm.  I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.
I got a port to c++ working now. There are 5 proxy implementations
1) smrproxy v2
2) arcproxy - reference counted proxy
3) rwlock based proxy
4) mutex based proxy
5) an unsafe proxy with no locking
The testcase is templated so you can use any of the
5 proxy implementations without rewriting for each proxy
type.  You can do apple to apple comparisons.  I
realize that's the complete antithesis of current
programming practices but there you have it.  :)
A bit of clean up and performance tuning now.
Ok, smrproxy lock/unlock is down to 0.6 nanoseconds now,
about what the C version was.
I've been using cpu time to measure performance. That's ok
for lock-free/wait-free locking.  For normal mutexes and
shared locks, it doesn't measure wait time so those didn't
look as bad as they really were.  You can add logic
to measure how long it takes to acquire a lock but that
adds significant overhead.
I remember back in the day when I was comparing and contrasting
various lock/wait-free algorithms with their 100% lock-based counter
parts. Some of the lock-based tests too so long that I just terminated
the damn program. Iirc, a lock-free test would take around 5 minutes.
The lock- based test would be around 30+ minutes. This was way back on
c.p.t.
I set the iteration count as a parameter.  Mutex can be particularly
slow with a lot of reader threads.  I usually see about 1000 - 10000
times slower than smrproxy.   rwlocks aren't as bad, about 200 x
slower.
Mutex, rwlock, and arcproxy use interlocked instructions so you
can get a really wide performance range based on cache geometry
and processor sets you run on.
Actually, I remember way back where a scenario that had a lot of writes
would start to mess with a deferred reclamation wrt a polling thread
type of “scheme”. Too many deferred nodes would start to "pile up".
Basically, the single polling thread was having trouble keeping up with
all of them. The interlocked versions seemed to perform sort of "better"
in a sense during periods that had a lot of frequent “writes”. Of course
clever use of node caches helps heavy write periods. Anyway, some of the
tests just used a mutex for writes, others used lock-free and would
generate high loads of them that would push and pop nodes and defer them
to the poll thread to test how much load it (poll thread) could take.
jseigh
2024-11-25 11:31:32 UTC
Reply
Permalink
...
Post by Chris M. Thomasson
Actually, I remember way back where a scenario that had a lot of writes
would start to mess with a deferred reclamation wrt a polling thread
type of “scheme”. Too many deferred nodes would start to "pile up".
Basically, the single polling thread was having trouble keeping up with
all of them. The interlocked versions seemed to perform sort of "better"
in a sense during periods that had a lot of frequent “writes”. Of course
clever use of node caches helps heavy write periods. Anyway, some of the
tests just used a mutex for writes, others used lock-free and would
generate high loads of them that would push and pop nodes and defer them
to the poll thread to test how much load it (poll thread) could take.
Using deferred reclamation to implement a lock-free queue will slow
things down. Also cache line thrashing will limit how fast queue
operations will go no matter what.

Better to parallelize. You can take a really bad lock-free queue
implementation, parallelize it, and then tell everyone how fast
your queue is.

Single queue at 1 op/sec.
Parallel queue on 1000000 processors, 1000000 op/sec.


Joe Seigh
Chris M. Thomasson
2024-11-25 22:12:37 UTC
Reply
Permalink
Post by jseigh
...
Post by Chris M. Thomasson
Actually, I remember way back where a scenario that had a lot of
writes would start to mess with a deferred reclamation wrt a polling
thread type of “scheme”. Too many deferred nodes would start to "pile
up". Basically, the single polling thread was having trouble keeping
up with all of them. The interlocked versions seemed to perform sort
of "better" in a sense during periods that had a lot of frequent
“writes”. Of course clever use of node caches helps heavy write
periods. Anyway, some of the tests just used a mutex for writes,
others used lock-free and would generate high loads of them that would
push and pop nodes and defer them to the poll thread to test how much
load it (poll thread) could take.
Using deferred reclamation to implement a lock-free queue will slow
things down.  Also cache line thrashing will limit how fast queue
operations will go no matter what.
Better to parallelize.  You can take a really bad lock-free queue
implementation, parallelize it, and then tell everyone how fast
your queue is.
LOL! Yeah. In layers. Per thread queues, per cpu queues and some hashed
global ones. Then we get to play with single consumer/producer, multi
consumer single producer, and all that jazz... ;^)

Actually it was a RCU algorithm wrt deferred reclamation that worked
great with read mostly. Then exposing it to a lot of writes sort of
messed around with it, in a bad way. ;^)

The writes were able to swamp things under heavy artificial load.
Post by jseigh
Single queue at 1 op/sec.
Parallel queue on 1000000 processors, 1000000 op/sec.
jseigh
2024-11-27 15:29:05 UTC
Reply
Permalink
Post by jseigh
Post by jseigh
Ok, smrproxy lock/unlock is down to 0.6 nanoseconds now,
about what the C version was.
I've been using cpu time to measure performance. That's ok
for lock-free/wait-free locking.  For normal mutexes and
shared locks, it doesn't measure wait time so those didn't
look as bad as they really were.  You can add logic
to measure how long it takes to acquire a lock but that
adds significant overhead.
Some timings with 128 reader threads

unsafe 52.983 nsecs ( 0.000) 860.576 nsecs ( 0.000)
smr 54.714 nsecs ( 1.732) 882.356 nsecs ( 21.780)
smrlite 53.149 nsecs ( 0.166) 870.066 nsecs ( 9.490)
arc 739.833 nsecs ( 686.850) 11,988.289 nsecs ( 11,127.713)
rwlock 1,078.306 nsecs ( 1,025.323) 17,309.882 nsecs ( 16,449.306)
mutex 3,203.034 nsecs ( 3,150.052) 51,479.407 nsecs ( 50,618.831)

The first column is cpu time, third column is elapsed time.
unsafe is without any synchronized reader access. The
value in parentheses is the unsafe access time subtracted
out to separate out the synchronization overheads. smrlite is
smr proxy with thread_local overhead. So smrproxy lock/unlock
by itself is about 0.1 - 0.2 nanoseconds.

I'm going to drop working on the whole proxy interface thing. The
application can decide if it wants to hardcode a dependency on a
particular 3rd party libarary implementation or abstract it out
to a more portable api.

Joe Seigh
jseigh
2024-12-09 18:34:11 UTC
Reply
Permalink
Post by jseigh
Some timings with 128 reader threads
unsafe        52.983 nsecs (      0.000)     860.576 nsecs (      0.000)
smr           54.714 nsecs (      1.732)     882.356 nsecs (     21.780)
smrlite       53.149 nsecs (      0.166)     870.066 nsecs (      9.490)
arc          739.833 nsecs (    686.850)  11,988.289 nsecs ( 11,127.713)
rwlock     1,078.306 nsecs (  1,025.323)  17,309.882 nsecs ( 16,449.306)
mutex      3,203.034 nsecs (  3,150.052)  51,479.407 nsecs ( 50,618.831)
The first column is cpu time, third column is elapsed time.
unsafe is without any synchronized reader access.   The
value in parentheses is the unsafe access time subtracted
out to separate out the synchronization overheads.  smrlite is
smr proxy with thread_local overhead.  So smrproxy lock/unlock
by itself is about 0.1 - 0.2 nanoseconds.
I'm going to drop working on the whole proxy interface thing.  The
application can decide if it wants to hardcode a dependency on a
particular 3rd party libarary implementation or abstract it out
to a more portable api.
I figured out where the smr vs smrlite overhead is likely coming from.

1) thread_local load about .3 nsecs, 2 for lock/unlock so .6 nsecs.
2) overhead from lazy initialization, about .6 nsecs.

smrlite most of the time doesn't show any measurable overhead,
0 nsecs.

Theoretically, you could do do lazy initialization with zero
runtime overhead, but for most c++ apps, 1 millisecond is
considered fast, so I don't think there would be much interest
in it.

Joe Seigh
jseigh
2024-12-11 12:13:30 UTC
Reply
Permalink
Post by jseigh
Post by jseigh
I'm going to drop working on the whole proxy interface thing.  The
application can decide if it wants to hardcode a dependency on a
particular 3rd party libarary implementation or abstract it out
to a more portable api.
I figured out where the smr vs smrlite overhead is likely coming from.
1) thread_local load about .3 nsecs, 2 for lock/unlock so .6 nsecs.
2) overhead from lazy initialization, about .6 nsecs.
smrlite most of the time doesn't show any measurable overhead,
0 nsecs.
Fixed the whole thread_local issue. Just don't use it. I
went back to the api I was using in C and leave it up to
the app to keep track of thread local objects. Fixes the
problem of multiple proxies needing multiple per thread
objects. This is a language issue, not an api issue.
Post by jseigh
Theoretically, you could do do lazy initialization with zero
runtime overhead, but for most c++ apps, 1 millisecond is
considered fast, so I don't think there would be much interest
in it.
And apparantly I am right. :)

Joe Seigh
Chris M. Thomasson
2024-12-12 00:22:20 UTC
Reply
Permalink
Post by jseigh
Post by jseigh
I'm going to drop working on the whole proxy interface thing.  The
application can decide if it wants to hardcode a dependency on a
particular 3rd party libarary implementation or abstract it out
to a more portable api.
I figured out where the smr vs smrlite overhead is likely coming from.
1) thread_local load about .3 nsecs, 2 for lock/unlock so .6 nsecs.
2) overhead from lazy initialization, about .6 nsecs.
smrlite most of the time doesn't show any measurable overhead,
0 nsecs.
Fixed the whole thread_local issue.  Just don't use it.  I
went back to the api I was using in C and leave it up to
the app to keep track of thread local objects. Fixes the
problem of multiple proxies needing multiple per thread
objects.  This is a language issue, not an api issue.
Well, that works for sure. Humm. If a user wants to register a
per-thread object with the polling thread, well... Let them do it!
Actually, it's a little more flexible, so to speak... Wrt the api:

void proxy_register(per_thread&)
void proxy_unregister(per_thread&)

lol. ;^)
Post by jseigh
Theoretically, you could do do lazy initialization with zero
runtime overhead, but for most c++ apps, 1 millisecond is
considered fast, so I don't think there would be much interest
in it.
And apparantly I am right. :)
;^D

Unfortunately, I just seem to not be able to find the time to work on
this. Been working on shaders and OpenGL a lot lately.
Loading...