Discussion:
Sieve of Erastosthenes optimized to the max
Add Reply
Bonita Montero
2023-12-10 09:46:18 UTC
Reply
Permalink
I've parallelized my sieve of Erastosthenes to all cores. The parallel
threads do al the punching of the non-prime multiplex beyond sqrt( max
). I found that the algorithm is limited by the memory bandwith since
the bit-range between sqrt( max ) and max is to large to fit inside
the cache. So I partitioned the each thread to a number of bit-ranges
that fits inside the L2-cache. With that I got a speedup of factor 28
when calculating about the first 210 million primes, i.e. all primes
<= 2 ^ 32.
On my Zen4 16 core PC under Windows with clang-cl 16.0.5 producing the
primes without any file output is only 0.28s. On my Linux-PC, a Zen2 64
core CPU producing the same number of primes is about 0.09s.
The file output is done with my own buffering. With that writing the
mentioned prime number range results in a 2.2GB file which is written
in about two seconds with a PCIe 4.0 SSD.
The first parameter is the highest prime candidate to be generated,
it can be prefixed with 0x for hex values. The second number is the
name of the output file; it can be "" to suppress file output. The
third parameter is the number of punching threads; if it is left the
number of threads defaults to the number of threads reported by the
runtime.

#include <iostream>
#include <cstdlib>
#include <charconv>
#include <cstring>
#include <vector>
#include <algorithm>
#include <cmath>
#include <bit>
#include <fstream>
#include <string_view>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <utility>
#include <semaphore>
#include <atomic>
#include <new>
#include <span>
#if defined(_MSC_VER)
#include <intrin.h>
#elif defined(__GNUC__) || defined(__clang__)
#include <cpuid.h>
#endif

#if defined(_MSC_VER)
#pragma warning(disable: 26495)
#endif

using namespace std;

#if defined(__cpp_lib_hardware_interference_size)
constexpr ptrdiff_t CL_SIZE = hardware_destructive_interference_size;
#else
constexpr ptrdiff_t CL_SIZE = 64;
#endif

size_t getL2Size();

int main( int argc, char **argv )
{
if( argc < 2 )
return EXIT_FAILURE;
auto parse = []( char const *str, bool hex, auto &value )
{
from_chars_result fcr = from_chars( str, str + strlen( str ), value,
!hex ? 10 : 16 );
return fcr.ec == errc() && !*fcr.ptr;
};
char const *sizeStr = argv[1];
bool hex = sizeStr[0] == '0' && (sizeStr[1] == 'x' || sizeStr[1] == 'X');
sizeStr += 2 * hex;
size_t end;
if( !parse( sizeStr, hex, end ) )
return EXIT_FAILURE;
if( (ptrdiff_t)end++ < 0 )
throw bad_alloc();
unsigned hc;
if( argc < 4 || !parse( argv[3], false, hc ) )
hc = jthread::hardware_concurrency();
if( !hc )
return EXIT_FAILURE;
using word_t = size_t;
constexpr size_t
BITS = sizeof(word_t) * 8,
CL_BITS = CL_SIZE * 8;
union alignas(CL_SIZE) ndi_words_cl { word_t words[CL_SIZE /
sizeof(word_t)]; ndi_words_cl() {} };
size_t nCls = (end + CL_BITS - 1) / CL_BITS;
vector<ndi_words_cl> sieveCachelines( (end + CL_BITS - 1) / CL_BITS );
size_t nWords = (end + BITS - 1) / BITS;
span<word_t> sieve( &sieveCachelines[0].words[0], nWords );
auto fill = sieve.begin();
*fill++ = (word_t)0xAAAAAAAAAAAAAAACu;
if( fill != sieve.end() )
for_each( fill, sieve.end(), []( word_t &w ) { w =
(word_t)0xAAAAAAAAAAAAAAAAu; } );
size_t sqrt = (ptrdiff_t)ceil( ::sqrt( (ptrdiff_t)end ) );
auto punch = [&]( auto, size_t start, size_t end, size_t prime )
{
if( start >= end )
return;
size_t multiple = start;
if( prime >= BITS ) [[likely]]
do
sieve[multiple / BITS] &= rotl( (word_t)-2, multiple % BITS );
while( (multiple += prime) < end );
else
{
auto pSieve = sieve.begin() + multiple / BITS;
do
{
word_t
word = *pSieve,
mask = rotl( (word_t)-2, multiple % BITS ),
oldMask;
do
word &= mask,
oldMask = mask,
mask = rotl( mask, prime % BITS ),
multiple += prime;
while( mask < oldMask );
*pSieve++ = word;
} while( multiple < end );
}
};
for( size_t prime = 3; prime < sqrt; ++prime )
{
for( auto pSieve = sieve.begin() + prime / BITS; ; )
{
if( word_t w = *pSieve >> prime % BITS; w ) [[likely]]
{
prime += countr_zero( w );
break;
}
prime = prime + BITS & -(ptrdiff_t)BITS;
if( prime >= sqrt )
break;
}
if( prime >= sqrt ) [[unlikely]]
break;
punch( false_type(), prime * prime, sqrt, prime );
}
auto scan = [&]( size_t start, size_t end, auto fn )
{
for( size_t prime = start; prime < end; )
{
word_t word = sieve[prime / BITS] >> prime % BITS;
bool hasBits = word;
for( unsigned gap; word; word >>= gap, word >>= 1 )
{
gap = countr_zero( word );
if( (prime += gap) >= end ) [[unlikely]]
return;
fn( prime );
if( ++prime >= end ) [[unlikely]]
return;
}
prime = (prime + BITS - hasBits) & -(ptrdiff_t)BITS;
}
};
size_t bitsPerPartition = (end - sqrt) / hc;
using range_t = pair<size_t, size_t>;
vector<pair<size_t, size_t>> ranges;
ranges.reserve( hc );
for( size_t t = hc, rEnd = end, rStart; t--; rEnd = rStart )
{
rStart = sqrt + t * bitsPerPartition;
size_t rClStart = rStart & -(CL_SIZE * 8);
rStart = rClStart >= sqrt ? rClStart : sqrt;
if( rStart >= rEnd )
continue;
ranges.emplace_back( rStart, rEnd );
}
vector<jthread> threads;
threads.reserve( ranges.size() - 1 );
static auto dbl = []( ptrdiff_t x ) { return (double)x; };
double maxPartitionBits = dbl( getL2Size() ) * 8 / 3;
for( range_t const &range : ranges )
{
auto rangePuncher = [&]( size_t rStart, size_t rEnd )
{
double nBits = dbl( rEnd - rStart );
ptrdiff_t nPartitions = (ptrdiff_t)ceil( nBits / maxPartitionBits );
double bitsPerPartition = nBits / dbl( nPartitions );
for( ptrdiff_t p = nPartitions, pEnd = rEnd, pStart; p--; pEnd = pStart )
{
pStart = rStart + (ptrdiff_t)((double)p * bitsPerPartition);
pStart &= -(CL_SIZE * 8);
scan( 3, sqrt,
[&]( size_t prime )
{
punch( true_type(), (pStart + prime - 1) / prime * prime, pEnd,
prime );
} );
}
};
if( &range != &ranges.back() )
threads.emplace_back( rangePuncher, range.first, range.second );
else
rangePuncher( range.first, range.second );
}
threads.resize( 0 );
if( argc >= 3 && !*argv[2] )
return EXIT_SUCCESS;
ofstream ofs;
ofs.exceptions( ofstream::failbit | ofstream::badbit );
ofs.open( argc >= 3 ? argv[2] : "primes.txt", ofstream::binary |
ofstream::trunc );
constexpr size_t
BUF_SIZE = 0x100000,
AHEAD = 32;
union ndi_char { char c; ndi_char() {} };
vector<ndi_char> printBuf( BUF_SIZE + AHEAD - 1 );
auto
bufBegin = printBuf.begin(),
bufEnd = bufBegin,
bufFlush = bufBegin + BUF_SIZE;
auto print = [&]() { ofs << string_view( &bufBegin->c, &to_address(
bufEnd )->c ); };
scan( 2, end,
[&]( size_t prime )
{
auto [ptr, ec] = to_chars( &bufEnd->c, &bufEnd[AHEAD - 1].c, prime );
if( ec != errc() ) [[unlikely]]
throw system_error( (int)ec, generic_category(), "converson failed" );
bufEnd = ptr - &bufBegin->c + bufBegin; // pointer to iterator - NOP
bufEnd++->c = '\n';
if( bufEnd >= bufFlush ) [[unlikely]]
print(), bufEnd = bufBegin;
} );
print();
}

size_t getL2Size()
{
int regs[4];
auto cpuId = [&]( unsigned code )
{
#if defined(_MSC_VER)
__cpuid( regs, code );
#elif defined(__GNUC__) || defined(__clang__)
__cpuid(code, regs[0], regs[1], regs[2], regs[3]);
#endif
return regs[0];
};
if( (unsigned)cpuId( 0x80000000u ) < 0x80000006u )
return 512ull * 1024;
cpuId( 0x80000006u );
return ((unsigned)regs[2] >> 16) * 1024;
}
Vir Campestris
2023-12-10 21:48:16 UTC
Reply
Permalink
Post by Bonita Montero
I've parallelized my sieve of Erastosthenes to all cores. The parallel
threads do al the punching of the non-prime multiplex beyond sqrt( max
). I found that the algorithm is limited by the memory bandwith since
the bit-range between sqrt( max ) and max is to large to fit inside
the cache. So I partitioned the each thread to a number of bit-ranges
that fits inside the L2-cache. With that I got a speedup of factor 28
when calculating about the first 210 million primes, i.e. all primes
<= 2 ^ 32.
On my Zen4 16 core PC under Windows with clang-cl 16.0.5 producing the
primes without any file output is only 0.28s. On my Linux-PC, a Zen2 64
core CPU producing the same number of primes is about 0.09s.
The file output is done with my own buffering. With that writing the
mentioned prime number range results in a 2.2GB file which is written
in about two seconds with a PCIe 4.0 SSD.
The first parameter is the highest prime candidate to be generated,
it can be prefixed with 0x for hex values. The second number is the
name of the output file; it can be "" to suppress file output. The
third parameter is the number of punching threads; if it is left the
number of threads defaults to the number of threads reported by the
runtime.
<snip code>

Just so happens I've been thinking about this. I find your code
impenetrable - there are no comments at all. But two things occurred to
me quite quickly:
- You don't need to store the odd numbers
- You only need a bitmask to save the sieve.

I think you're doing the latter, though it's not obvious. I think you're
storing bits in an array of size_t, and that will be what
0xAAAAAAAAAAAAAAACu and 0xAAAAAAAAAAAAAAAAu are about. However if I am
reading that right you've assumed that unsigned is the same size as
size_t, and that the size is 64 bits.

Andy
Bonita Montero
2023-12-11 03:15:47 UTC
Reply
Permalink
Post by Vir Campestris
- You don't need to store the odd numbers
All primes except 2 are odd.
Post by Vir Campestris
- You only need a bitmask to save the sieve.
I'm using the "scan"-lambda which does serarch through the sieve
as fast as possible with countr_zero, which maps to a single CPU
-instruction on modern processors.
Post by Vir Campestris
... reading that right you've assumed that unsigned is the same
size as size_t, and that the size is 64 bits.
I've got a type word_t that is a size_t, but it can also
be a uint8_t to test what's the performance difference.
Vir Campestris
2023-12-11 17:12:00 UTC
Reply
Permalink
Post by Bonita Montero
Post by Vir Campestris
- You don't need to store the odd numbers
All primes except 2 are odd.
I had a really bad night, and I was half asleep yesterday.

Because all primes except 2 are odd you don't need to store the _even_
numbers, which is what I meant to say. Or else you only need to store
the odd ones.
Post by Bonita Montero
Post by Vir Campestris
- You only need a bitmask to save the sieve.
I'm using the "scan"-lambda which does serarch through the sieve
as fast as possible with countr_zero, which maps to a single CPU
-instruction on modern processors.
Post by Vir Campestris
... reading that right you've assumed that unsigned is the same
size as size_t, and that the size is 64 bits.
I've got a type word_t that is a size_t, but it can also
be a uint8_t to test what's the performance difference.
Your constants are 64 bit hexadecimal numbers (from counting the digits)
and are unsigned (the u suffix) but there's no guarantee that size_t
will be the same size as unsigned, nor that it will be 64 bit.

Andy
Bonita Montero
2023-12-11 17:19:17 UTC
Reply
Permalink
Post by Vir Campestris
Because all primes except 2 are odd you don't need to store the _even_
numbers, which is what I meant to say. Or else you only need to store
the odd ones.
I'll try that the next days; but I don't expect a significant change.
Post by Vir Campestris
Your constants are 64 bit hexadecimal numbers (from counting the digits)
and are unsigned (the u suffix) but there's no guarantee that size_t
will be the same size as unsigned, nor that it will be 64 bit.
The constants are wide enough for 64 bit size_t-s but it won't make
a functional difference if you define word_t as a uint8_t. With an
uint8_t word_t the code runs about 1/6-th slower; that's less than
I expected.
Vir Campestris
2023-12-13 15:16:17 UTC
Reply
Permalink
Post by Bonita Montero
Post by Vir Campestris
Because all primes except 2 are odd you don't need to store the _even_
numbers, which is what I meant to say. Or else you only need to store
the odd ones.
I'll try that the next days; but I don't expect a significant change.
Well, I went away and tried it. I didn't write anything nearly as
sophisticated as yours - I didn't worry about threads and caching. The
code follows. You'll have to unwrap it in a few places.

It occurred to me I could go further - not just optimise it out to store
odd numbers only, but also miss out the ones divisible by 3, or other
higher primes. The results were:
- Storing all numbers: 9.494 seconds to run the sieve
- Storing odd numbers only: 4.455. That's over twice as fast.
- Excluding also the ones divisible by 3: 3.468. That's an improvement,
but not as much as I expected. Perhaps it's got coo complicated. I don't
have a high powered PC.
Post by Bonita Montero
Post by Vir Campestris
Your constants are 64 bit hexadecimal numbers (from counting the
digits) and are unsigned (the u suffix) but there's no guarantee that
size_t will be the same size as unsigned, nor that it will be 64 bit.
The constants are wide enough for 64 bit size_t-s but it won't make
a functional difference if you define word_t as a uint8_t. With an
uint8_t word_t the code runs about 1/6-th slower; that's less than
I expected.
Most likely what you are seeing there is that you've gone from being
memory bound to being compute bound. The CPU cache will make sure the
main memory accesses are 128 bits (or whatever your bus width is)

Andy
--
#include <cassert>
#include <chrono>
#include <cmath>
#include <iostream>
#include <thread>
#include <vector>

size_t PrimeCount = 1e8;

// this is a convenience class for timing.
class stopwatch {
std::chrono::time_point<std::chrono::steady_clock> mBegin, mEnd;
public:
stopwatch() {}

// record the start of an interval
void start()
{
mBegin = std::chrono::steady_clock::now();
}

// record the end of an interval
void stop()
{
mEnd = std::chrono::steady_clock::now();
}

// report the difference between the last start and stop
double delta() const
{
return
std::chrono::duration_cast<std::chrono::milliseconds>(mEnd -
mBegin).count() / 1000.0;
}
};

// the bitmask will be stores in a class that looks like this
class Primes
{
public:
Primes(size_t size) {};
virtual ~Primes() {};
virtual void unprime(size_t value) = 0; // marks a number
as not prime
virtual bool get(size_t value) const = 0; // gets how a
number is marked
virtual size_t size() const = 0; // Size of the store
};

// Basic store. Stores all the primes in a vector<bool>
class PrimesVB: public Primes
{
std::vector<bool> mStore;
public:
PrimesVB(size_t size) : Primes(size), mStore(size, true) {};
virtual void unprime(size_t value) override
{
mStore[value] = false;
}

virtual bool get(size_t value) const override
{
return mStore[value];
}

virtual size_t size() const override
{
return mStore.size();
}
};

class Primes2: public PrimesVB
{
size_t mSize;
public:
Primes2(size_t size) : PrimesVB((size + 1) / 2), mSize(size) {};
virtual void unprime(size_t value) override
{
if (value & 1)
PrimesVB::unprime(value / 2);
}

virtual bool get(size_t value) const override
{
if (value <= 2) return true;
return (value & 1) && PrimesVB::get(value / 2);
}

virtual size_t size() const override
{
return mSize;
}
};

class Primes23: public PrimesVB
{
// Size of the map. This is a lot bigger than the size of the
bitmap.
size_t mSize;

// each "page" contains 2 "lines". A line is a prime we are
storing.
// this map is where we store each number, depending on its
value modulo 6.
// -1 is known not to be prime, and isn't stored.
// we store 7, 13, 18 in "line" 0; 11, 17, 23 in "line" 1.
// the others are either divisible by 2 or 3, and don't need to
be stored.
static constexpr int mPosMap[6] = {-1, 0, -1, -1, -1, 1};
public:
Primes23(size_t size) : PrimesVB((size + 2) / 3), mSize(size) {};
virtual void unprime(size_t value) override
{
if (value <= 3) return;
auto page = value / 6;
auto line = mPosMap[value % 6];
if (line >= 0)
{
PrimesVB::unprime(page * 2 + line);
}
}

virtual bool get(size_t value) const override
{
if (value <= 3) return true;

auto page = value / 6; // 5=0 7=1 11=1 13=2
auto line = mPosMap[value % 6]; // 5=2 7=0 11=2 13=0
if (line >= 0)
{
return PrimesVB::get(page * 2 + line);
}
else
{
return false;
}
}

virtual size_t size() const override
{
return mSize;
}
};


// Simple sieve of Eratosthenes function
void sieve(Primes& store)
{
const size_t storesize = store.size();
const size_t maxfilter = std::ceil(std::sqrt(storesize));

size_t prime = 2;
while (prime < maxfilter)
{
// Unmark all the multiples
for (size_t inval = prime*2; inval < storesize; inval+=
prime)
store.unprime(inval);

// Find the next prime to filter with
while (!store.get(++prime) && prime < maxfilter) {};
}
}

// Benchmark function. Returns the constructed collection for validation.
template <typename storetype> storetype test(const char* classname)
{
stopwatch s;
s.start();
storetype store(PrimeCount);
s.stop();
std::cout << classname << " construct " << s.delta() << "
seconds." << std::endl;

s.start();
sieve(store);
s.stop();
std::cout << classname << " sieve " << s.delta() << " seconds."
<< std::endl;

return store;
}

int main()
{
auto vb = test<PrimesVB>("vector<bool>");
auto p2 = test<Primes2>("Storing odds only");
auto p23 = test<Primes23>("Not storing 2s and 3s");

// double check they all match.
for (auto p = 1; p < PrimeCount; ++p)
{
assert(vb.get(p) == p2.get(p));
assert(vb.get(p) == p23.get(p));
}

return 0;
}
Vir Campestris
2023-12-13 15:25:42 UTC
Reply
Permalink
Post by Vir Campestris
Well, I went away and tried it. I didn't write anything nearly as
sophisticated as yours - I didn't worry about threads and caching. The
code follows. You'll have to unwrap it in a few places.
It occurred to me I could go further - not just optimise it out to store
odd numbers only, but also miss out the ones divisible by 3, or other
- Storing all numbers: 9.494 seconds to run the sieve
- Storing odd numbers only: 4.455. That's over twice as fast.
- Excluding also the ones divisible by 3: 3.468. That's an improvement,
but not as much as I expected. Perhaps it's got coo complicated. I don't
have a high powered PC.
And no sooner had I posted that than I realised I'd got the optimisation
settings wrong.

With -Ofast fed to g++ the version that doesn't store multiples of 3 is
_slower_ than the odds only one!

Andy
Vir Campestris
2023-12-14 15:06:14 UTC
Reply
Permalink
I tried a few more collections.

The slowest on my system is vector<bool> 12.165s for 1e9 primes.

Having a manual bitmask, and storing in uint64_t is a lot faster -
almost twice as fast (6.659).

A uint32_t is a little faster than that (6.306).

Optimising it to store odds only more than doubles the speed of
vector<bool> to 5.107 seconds.

That has less effect with uint64_t, taking it to 4.946 - which is the
best time I have.

Oddly storing odds only with uint32_t is not as fast as odds only with
uint64_t, although it is still faster than storing all the values (5.302).

I've optimised it to do less recursion, which has helped, but it still
isn't worth not storing the multiples of 3. I'll guess that's because
that code required a divide and mod by 6, whereas optimising out the
evens only requires shifts and masks.

Andy
red floyd
2023-12-14 16:20:19 UTC
Reply
Permalink
Post by Vir Campestris
Post by Bonita Montero
Post by Vir Campestris
Because all primes except 2 are odd you don't need to store the
_even_ numbers, which is what I meant to say. Or else you only need
to store the odd ones.
I'll try that the next days; but I don't expect a significant change.
Well, I went away and tried it. I didn't write anything nearly as
sophisticated as yours - I didn't worry about threads and caching. The
code follows. You'll have to unwrap it in a few places.
It occurred to me I could go further - not just optimise it out to store
odd numbers only, but also miss out the ones divisible by 3, or other
- Storing all numbers: 9.494 seconds to run the sieve
- Storing odd numbers only: 4.455. That's over twice as fast.
- Excluding also the ones divisible by 3: 3.468. That's an improvement,
but not as much as I expected. Perhaps it's got coo complicated. I don't
have a high powered PC.
Another way to attempt to optimize is that except for 2 and 3, all
primes are of the form 6n+1 or 6n-1.
Tim Rentsch
2023-12-23 18:30:21 UTC
Reply
Permalink
Post by red floyd
Post by Vir Campestris
Because all primes except 2 are odd you don't need to store the
_even_ numbers, which is what I meant to say. Or else you only
need to store the odd ones.
I'll try that the next days; but I don't expect a significant change.
Well, I went away and tried it. I didn't write anything nearly as
sophisticated as yours - I didn't worry about threads and
caching. The code follows. You'll have to unwrap it in a few places.
It occurred to me I could go further - not just optimise it out to
store odd numbers only, but also miss out the ones divisible by 3,
- Storing all numbers: 9.494 seconds to run the sieve
- Storing odd numbers only: 4.455. That's over twice as fast.
- Excluding also the ones divisible by 3: 3.468. That's an
improvement, but not as much as I expected. Perhaps it's got coo
complicated. I don't have a high powered PC.
Another way to attempt to optimize is that except for 2 and 3, all
primes are of the form 6n+1 or 6n-1.
A more compact form is to consider numbers mod 30, in which case
numbers that are 0 mod {2, 3, or 5} can be ignored. Conveniently
there are just 8 such values - 1, 7, 11, 13, 17, 19, 23, and 29 -
so each block of 30 can be represented in one 8-bit byte. Doing
that gives a 25% reduction in space compared to a 6n+/-1 scheme.
Vir Campestris
2023-12-23 21:20:42 UTC
Reply
Permalink
Post by Tim Rentsch
A more compact form is to consider numbers mod 30, in which case
numbers that are 0 mod {2, 3, or 5} can be ignored. Conveniently
there are just 8 such values - 1, 7, 11, 13, 17, 19, 23, and 29 -
so each block of 30 can be represented in one 8-bit byte. Doing
that gives a 25% reduction in space compared to a 6n+/-1 scheme.
I found that on my system the modulo operation was so slow this wasn't
worth doing.

Andy
Tim Rentsch
2023-12-24 08:36:42 UTC
Reply
Permalink
Post by Vir Campestris
Post by Tim Rentsch
A more compact form is to consider numbers mod 30, in which case
numbers that are 0 mod {2, 3, or 5} can be ignored. Conveniently
there are just 8 such values - 1, 7, 11, 13, 17, 19, 23, and 29 -
so each block of 30 can be represented in one 8-bit byte. Doing
that gives a 25% reduction in space compared to a 6n+/-1 scheme.
I found that on my system the modulo operation was so slow this wasn't
worth doing.
Depending on how the code is written, no modulo operations need
to be done, because they will be optimized away and done at
compile time. If we look at multiplying two numbers represented
by bits in bytes i and j, the two numbers are

i*30 + a
j*30 + b

for some a and b in { 1, 7, 11, 13 17, 19, 23, 29 }.

The values we're interested in are the index of the product and
the residue of the product, namely

(i*30+a) * (j*30+b) / 30 (for the index)
(i*30+a) * (j*30+b) % 30 (for the residue)

Any term with a *30 in the numerator doesn't contribute to
the residue, and also can be combined with the by-30 divide
for computing the index. Thus these expressions can be
rewritten as

i*30*j + i*b + j*a + (a*b/30) (for the index)
a*b%30 (for the residue)

When a and b have values that are known at compile time,
neither the divide nor the remainder result in run-time
operations being done; all of that heavy lifting is
optimized away and done at compile time. Of course there
are some multiplies, but they are cheaper than divides, and
also can be done in parallel. (The multiplication a*b also
can be done at compile time.)

The residue needs to be turned into a bit mask to do the
logical operation on the byte of bits, but here again that
computation can be optimized away and done at compile time.

Does that all make sense?
Vir Campestris
2023-12-29 18:03:54 UTC
Reply
Permalink
Post by Tim Rentsch
Depending on how the code is written, no modulo operations need
to be done, because they will be optimized away and done at
compile time. If we look at multiplying two numbers represented
by bits in bytes i and j, the two numbers are
i*30 + a
j*30 + b
for some a and b in { 1, 7, 11, 13 17, 19, 23, 29 }.
The values we're interested in are the index of the product and
the residue of the product, namely
(i*30+a) * (j*30+b) / 30 (for the index)
(i*30+a) * (j*30+b) % 30 (for the residue)
Any term with a *30 in the numerator doesn't contribute to
the residue, and also can be combined with the by-30 divide
for computing the index. Thus these expressions can be
rewritten as
i*30*j + i*b + j*a + (a*b/30) (for the index)
a*b%30 (for the residue)
When a and b have values that are known at compile time,
neither the divide nor the remainder result in run-time
operations being done; all of that heavy lifting is
optimized away and done at compile time. Of course there
are some multiplies, but they are cheaper than divides, and
also can be done in parallel. (The multiplication a*b also
can be done at compile time.)
The residue needs to be turned into a bit mask to do the
logical operation on the byte of bits, but here again that
computation can be optimized away and done at compile time.
Does that all make sense?
Right now, no. But that's me. I'll flag it to read again when I've had a
better night's sleep.

Andy
Bonita Montero
2023-12-20 12:44:49 UTC
Reply
Permalink
Post by Vir Campestris
Because all primes except 2 are odd you don't need to store the _even_
numbers, which is what I meant to say. Or else you only need to store
the odd ones.
I changed the code this morning:

#include <cstdlib>
#include <charconv>
#include <cstring>
#include <vector>
#include <algorithm>
#include <cmath>
#include <bit>
#include <fstream>
#include <string_view>
#include <thread>
#include <utility>
#include <new>
#include <span>
#include <array>
#include <cassert>
#if defined(_MSC_VER)
#include <intrin.h>
#elif defined(__GNUC__) || defined(__clang__)
#include <cpuid.h>
#endif

#if defined(_MSC_VER)
#pragma warning(disable: 26495)
#endif

using namespace std;

#if defined(__cpp_lib_hardware_interference_size)
constexpr ptrdiff_t CL_SIZE = hardware_destructive_interference_size;
#else
constexpr ptrdiff_t CL_SIZE = 64;
#endif

size_t getL2Size();
bool smt();

int main( int argc, char **argv )
{
if( argc < 2 )
return EXIT_FAILURE;
auto parse = []( char const *str, auto &value )
{
bool hex = str[0] == '0' && (str[1] == 'x' || str[1] == 'X');
str += 2 * hex;
auto [ptr, ec] = from_chars( str, str + strlen( str ), value, !hex ?
10 : 16 );
return ec == errc() && !*ptr;
};
size_t end;
if( !parse( argv[1], end ) )
return EXIT_FAILURE;
if( end < 2 || (ptrdiff_t)end++ < 0 )
throw bad_alloc();
unsigned
hc = jthread::hardware_concurrency(),
nThreads;
if( argc < 4 || !parse( argv[3], nThreads ) )
nThreads = hc;
if( !nThreads )
return EXIT_FAILURE;
using word_t = size_t;
constexpr size_t
BITS_PER_CL = CL_SIZE * 8,
BITS = sizeof(word_t) * 8,
DBITS = 2 * BITS;
size_t nBits = end / 2 + (end & 1 ^ 1);
union alignas(CL_SIZE) ndi_words_cl { word_t words[CL_SIZE /
sizeof(word_t)]; ndi_words_cl() {} };
vector<ndi_words_cl> sieveCachelines( (nBits + BITS_PER_CL - 1) /
BITS_PER_CL );
span<word_t> sieve( &sieveCachelines[0].words[0], (nBits + BITS - 1) /
BITS );
fill( sieve.begin(), sieve.end(), (word_t)-1 );
size_t sqrt = (ptrdiff_t)ceil( ::sqrt( (ptrdiff_t)end ) );
auto punch = [&]( auto, size_t start, size_t end, size_t prime )
{
assert(start & 1);
size_t bit = start / 2;
end = end / 2 + (end & 1 ^ 1);
if( bit >= end )
return;
if( prime >= BITS ) [[likely]]
do [[likely]]
sieve[bit / BITS] &= rotl( (word_t)-2, bit % BITS );
while( (bit += prime) < end );
else
{
auto pSieve = sieve.begin() + bit / BITS;
do [[likely]]
{
word_t
word = *pSieve,
mask = rotl( (word_t)-2, bit % BITS ),
oldMask;
do
word &= mask,
oldMask = mask,
mask = rotl( mask, prime % BITS ),
bit += prime;
while( mask < oldMask );
*pSieve++ = word;
} while( bit < end );
}
};
for( size_t prime = 3; prime < sqrt; prime += 2 ) [[likely]]
{
auto pSieve = sieve.begin() + prime / DBITS;
do [[likely]]
if( word_t w = *pSieve >> prime / 2 % BITS; w ) [[likely]]
{
prime += 2 * countr_zero( w );
break;
}
while( (prime = (prime + DBITS & -(ptrdiff_t)DBITS) + 1) < sqrt );
if( prime >= sqrt ) [[unlikely]]
break;
punch( false_type(), prime * prime, sqrt, prime );
}
auto scan = [&]( size_t start, size_t end, auto fn )
{
assert(start & 1);
start /= 2;
end = end / 2 + (end & 1 ^ 1);
if( start >= end )
return;
auto pSieve = sieve.begin() + start / BITS;
for( size_t bit = start; ; )
{
word_t word = *pSieve++ >> bit % BITS;
bool hasBits = word;
for( unsigned gap; word; word >>= gap, word >>= 1 ) [[likely]]
{
gap = countr_zero( word );
if( (bit += gap) >= end ) [[unlikely]]
return;
fn( bit * 2 + 1, true_type() );
if( ++bit >= end ) [[unlikely]]
return;
}
if( bit >= end ) [[unlikely]]
break;
bit = (bit + BITS - hasBits) & -(ptrdiff_t)BITS;
}
};
static auto dbl = []( ptrdiff_t x ) { return (double)x; };
double threadTange = dbl( end - sqrt ) / nThreads;
using range_t = pair<size_t, size_t>;
vector<pair<size_t, size_t>> ranges;
ranges.reserve( nThreads );
for( size_t t = nThreads, rEnd = end, rStart; t--; rEnd = rStart )
[[likely]]
{
rStart = sqrt + (ptrdiff_t)((ptrdiff_t)t * threadTange + 0.5);
size_t rClStart = rStart & -(2 * CL_SIZE * 8);
rStart = rClStart >= sqrt ? rClStart : sqrt;
if( rStart < rEnd )
ranges.emplace_back( rStart, rEnd );
}
double maxCacheRange = dbl( getL2Size() * (8 * 2) ) / 3 * (smt() &&
nThreads > hc / 2 ? 1 : 2);
vector<jthread> threads;
threads.reserve( ranges.size() - 1 );
for( range_t const &range : ranges )
{
auto cacheAwarePunch = [&]( size_t rStart, size_t rEnd )
{
double thisThreadRange = dbl( rEnd - rStart );
ptrdiff_t nCachePartitions = (ptrdiff_t)ceil( thisThreadRange /
maxCacheRange );
double cacheRange = thisThreadRange / dbl( nCachePartitions );
for( size_t p = nCachePartitions, cacheEnd = rEnd, cacheStart; p--;
cacheEnd = cacheStart ) [[likely]]
{
cacheStart = rStart + (ptrdiff_t)((double)(ptrdiff_t)p * cacheRange
+ 0.5);
cacheStart &= -(2 * CL_SIZE * 8);
scan( 3, sqrt,
[&]( size_t prime, auto )
{
size_t start = ((cacheStart >= sqrt ? cacheStart : sqrt) + prime -
1) / prime * prime;
start = start & 1 ? start : start + prime;
punch( true_type(), start, cacheEnd, prime );
} );
}
};
if( &range != &ranges.back() )
threads.emplace_back( cacheAwarePunch, range.first, range.second );
else
cacheAwarePunch( range.first, range.second );
}
threads.resize( 0 );
if( argc >= 3 && !*argv[2] )
return EXIT_SUCCESS;
ofstream ofs;
ofs.exceptions( ofstream::failbit | ofstream::badbit );
ofs.open( argc >= 3 ? argv[2] : "primes.txt", ofstream::binary |
ofstream::trunc );
constexpr size_t
BUF_SIZE = 0x100000,
AHEAD = 32;
union ndi_char { char c; ndi_char() {} };
vector<ndi_char> rawPrintBuf( BUF_SIZE + AHEAD - 1 );
span printBuf( &rawPrintBuf.front().c, &rawPrintBuf.back().c + 1 );
auto
bufBegin = printBuf.begin(),
bufEnd = bufBegin,
bufFlush = bufBegin + BUF_SIZE;
auto print = [&]() { ofs << string_view( bufBegin, bufEnd ); };
auto printPrime = [&]( size_t prime, auto )
{
auto [ptr, ec] = to_chars( &*bufEnd, &bufEnd[AHEAD - 1], prime );
if( ec != errc() ) [[unlikely]]
throw system_error( (int)ec, generic_category(), "converson failed" );
bufEnd = ptr - &*bufBegin + bufBegin; // pointer to iterator - NOP
*bufEnd++ = '\n';
if( bufEnd >= bufFlush ) [[unlikely]]
print(), bufEnd = bufBegin;
};
printPrime( 2, false_type() );
scan( 3, end, printPrime );
print();
}

array<unsigned, 4> cpuId( unsigned code )
{
array<unsigned, 4> regs;
#if defined(_MSC_VER)
__cpuid( (int *)&regs[0], code );
#elif defined(__GNUC__) || defined(__clang__)
__cpuid(code, regs[0], regs[1], regs[2], regs[3]);
#endif
return regs;
}

bool smt()
{
if( cpuId( 0 )[0] < 1 )
return false;
return cpuId( 1 )[3] >> 28 & 1;
}

size_t getL2Size()
{
if( cpuId( 0x80000000u )[0] < 0x80000006u )
return 512ull * 1024;
return (cpuId( 0x80000006u )[2] >> 16) * 1024;
}

Now the code runs 0.15s on the 16 core Zen4 system, that's +87% faster.
On th4e 64 core Zen2 system the time to produce the first 21 million
primes is about 0.05 seconds.
Vir Campestris
2023-12-21 15:30:12 UTC
Reply
Permalink
Post by Bonita Montero
Now the code runs 0.15s on the 16 core Zen4 system, that's +87% faster.
On th4e 64 core Zen2 system the time to produce the first 21 million
primes is about 0.05 seconds.
I thought I'd try it.

Can I make a suggestion? Where it says

if( argc < 2 )
return EXIT_FAILURE;

make it instead print a user guide?

On my system my code takes about 20 seconds to produce 1e9 primes. It's
single threaded. Yours is taking a little over 6, but about 18 seconds
of CPU time. (I have 4 cores, each with 2 threads. Mine is designed to
be cool and quiet...)

I've obviously got something wrong though - I'm using a _lot_ more
memory than you.

Out of interest - I thought you must be Spanish from your name. And yet
you speak fluent German?

Andy
Bonita Montero
2023-12-21 16:07:45 UTC
Reply
Permalink
Post by Vir Campestris
Can I make a suggestion? Where it says
    if( argc < 2 )
        return EXIT_FAILURE;
make it instead print a user guide?
The code isnt for an application which is ready-to-use but just
a technology experiment.
Post by Vir Campestris
On my system my code takes about 20 seconds to produce 1e9 primes. It's
single threaded. Yours is taking a little over 6, but about 18 seconds
of CPU time. (I have 4 cores, each with 2 threads. Mine is designed to
be cool and quiet...)
bg --times --high bitmapSieve 0x1000000000 "" // bg is my own tool
real 2.471s
user 1m:0.313s
sys 0.234s
cylces 273.769.270.155

As you can see my code spends an minute CPU-time in about 2.5 seconds.
Interestingly the whole code largely depends on the performance of the
L2 cache. If more than one of my punch-threads occupies one core there
is almost no performance-improvement since both threads depend on the
same L2 cache.
Post by Vir Campestris
I've obviously got something wrong though - I'm using a _lot_ more
memory than you.
I'm using a bitmap to store the odd values.
Post by Vir Campestris
Out of interest - I thought you must be Spanish from your name.
And yet you speak fluent German?
My parents come from Peru.
Bonita Montero
2023-12-21 16:13:39 UTC
Reply
Permalink
Post by Vir Campestris
On my system my code takes about 20 seconds to produce 1e9 primes. It's
single threaded. Yours is taking a little over 6, but about 18 seconds
of CPU time. (I have 4 cores, each with 2 threads. Mine is designed to
be cool and quiet...)
I forgot to mention that I'm using "" as the second commandline
parameter, ich suppresses the file output. Producing a 2.2GB
file from all primes that fit in a 32 bit number costs an extra
two seconds for my system with a PCIe 4.0 SSD from Samsung.
Tim Rentsch
2023-12-23 18:21:04 UTC
Reply
Permalink
Post by Vir Campestris
On my system my code takes about 20 seconds to produce 1e9
primes. [...]
Do you mean 1e9 primes or just the primes less than 1e9? To
do the first 1e9 primes a sieve would need to go up to about
23.9e9 (so half that many bits if only odd numbers were
represented).
Vir Campestris
2023-12-23 21:21:21 UTC
Reply
Permalink
Post by Tim Rentsch
Post by Vir Campestris
On my system my code takes about 20 seconds to produce 1e9
primes. [...]
Do you mean 1e9 primes or just the primes less than 1e9? To
do the first 1e9 primes a sieve would need to go up to about
23.9e9 (so half that many bits if only odd numbers were
represented).
Primes up to 1e9.

I have another idea though, watch this space...

Andy
Tim Rentsch
2023-12-24 18:49:51 UTC
Reply
Permalink
Post by Vir Campestris
Post by Tim Rentsch
Post by Vir Campestris
On my system my code takes about 20 seconds to produce 1e9
primes. [...]
Do you mean 1e9 primes or just the primes less than 1e9? To
do the first 1e9 primes a sieve would need to go up to about
23.9e9 (so half that many bits if only odd numbers were
represented).
Primes up to 1e9.
I have another idea though, watch this space...
Does your have enough memory to compute all the primes up
to 24e9? If it does I suggest that for your next milestone.
Chris M. Thomasson
2023-12-21 22:23:05 UTC
Reply
Permalink
Post by Bonita Montero
union alignas(CL_SIZE) ndi_words_cl { word_t words[CL_SIZE /
sizeof(word_t)]; ndi_words_cl() {} };
alignas is pretty damn convenient. Iirc, it pads and aligns? Iirc, it
even works with std::vector.
Bonita Montero
2023-12-22 03:28:59 UTC
Reply
Permalink
Post by Chris M. Thomasson
alignas is pretty damn convenient. Iirc, it pads and aligns? Iirc, it
even works with std::vector.
The problem with that is that I can't have an alignas on a word_t since
the rest of the cacheline would be padded. So I join as much word_t's as
possible in a single ndi_word_s_cl and have a span later on it to have
full iterator debugging.
Chris M. Thomasson
2023-12-22 04:02:07 UTC
Reply
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
alignas is pretty damn convenient. Iirc, it pads and aligns? Iirc, it
even works with std::vector.
The problem with that is that I can't have an alignas on a word_t since
the rest of the cacheline would be padded. So I join as much word_t's as
possible in a single ndi_word_s_cl and have a span later on it to have
full iterator debugging.
Yup. This is a good practice indeed as long as they are related
logically, it will be beneficial to do this. Say a cache line is say 64
bytes and a word of 8 bytes, can be packed 8 fulls words indeed. No
cache line straddling allowed, damn it!

aligned on a l2 cache line boundary:

struct l2_cache_line
{
word m_words[8];
};

Fine. I think alignas auto packs. I need to to bust out the most recent
C++ mode on my compilers to revisit it. Actually, I need to do it
anyway. The fun part is that I have old code that is interesting, well
before alignas:

https://pastebin.com/raw/f37a23918

Fun times! :^)
Bonita Montero
2023-12-22 16:55:19 UTC
Reply
Permalink
Post by Chris M. Thomasson
Fine. I think alignas auto packs. I need to to bust out the most recent
C++ mode on my compilers to revisit it. Actually, I need to do it
anyway. The fun part is that I have old code that is interesting, well
False-sharing woudln't hurt my algorithm much since only the beginning
and the end of the thread-local segment overlaps with other thread; but
I do it anyway to have maximum performance.
Chris M. Thomasson
2023-12-23 20:52:25 UTC
Reply
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Fine. I think alignas auto packs. I need to to bust out the most
recent C++ mode on my compilers to revisit it. Actually, I need to do
it anyway. The fun part is that I have old code that is interesting,
False-sharing woudln't hurt my algorithm much since only the beginning
and the end of the thread-local segment overlaps with other thread; but
I do it anyway to have maximum performance.
That's is a good habit to get into. :^)
Bonita Montero
2023-12-24 10:03:37 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
False-sharing woudln't hurt my algorithm much since only the beginning
and the end of the thread-local segment overlaps with other thread; but
I do it anyway to have maximum performance.
That's is a good habit to get into. :^)
I experimentally removed the masking of the lower bits of the
partition bounds according to the cacheline size and there was
no measurable performance-loss.
Chris M. Thomasson
2023-12-24 21:24:19 UTC
Reply
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
False-sharing woudln't hurt my algorithm much since only the beginning
and the end of the thread-local segment overlaps with other thread; but
I do it anyway to have maximum performance.
That's is a good habit to get into. :^)
I experimentally removed the masking of the lower bits of the
partition bounds  according to the cacheline size and there was
no measurable performance-loss.
Still, imvvho, it _is_ a good practice to get into wrt padding and
aligning...
Bonita Montero
2023-12-26 05:00:00 UTC
Reply
Permalink
Post by Chris M. Thomasson
Still, imvvho, it _is_ a good practice to get into wrt padding and
aligning...
With an upper bound of 2 ^ 32 I've got 131070 cachleines per thread.
If I have false sharing at the beginning and end of the range that
doesn't hurt much.
Chris M. Thomasson
2023-12-26 05:39:54 UTC
Reply
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Still, imvvho, it _is_ a good practice to get into wrt padding and
aligning...
With an upper bound of 2 ^ 32 I've got 131070 cachleines per thread.
If I have false sharing at the beginning and end of the range that
doesn't hurt much.
Remember when Intel first started hyperthreading and god damn threads
could false share with each other (on the stacks!) the low and high 64
byte parts of the 128 byte cache lines? A workaround was to artificially
offset the threads stacks via alloca.
Bonita Montero
2023-12-26 09:27:54 UTC
Reply
Permalink
Post by Chris M. Thomasson
Remember when Intel first started hyperthreading and god damn threads
could false share with each other (on the stacks!) the low and high 64
byte parts of the 128 byte cache lines? A workaround was to artificially
offset the threads stacks via alloca.
If you have one core and two threads there's no false sharing.
It doesn't matter if the "conflicting" accesses come from either
thread of from one thread with that. False sharing is only pos-
sible with two cores or more.
Chris M. Thomasson
2023-12-26 20:24:50 UTC
Reply
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Remember when Intel first started hyperthreading and god damn threads
could false share with each other (on the stacks!) the low and high 64
byte parts of the 128 byte cache lines? A workaround was to
artificially offset the threads stacks via alloca.
If you have one core and two threads there's no false sharing.
So, are you familiar with Intel's early hyper threading problem? There
was false sharing between the hyperhtreads. The workaround did improve
performance by quite a bit. IIRC, my older appcore project had this
workaround incorporated into it logic. I wrote that sucker back in very
early 2000's. Humm... I will try to find the exact line.
Post by Bonita Montero
It doesn't matter if the "conflicting" accesses come from either
thread of from one thread with that. False sharing is only pos-
sible with two cores or more.
Kaz Kylheku
2023-12-26 23:35:11 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
Remember when Intel first started hyperthreading and god damn threads
could false share with each other (on the stacks!) the low and high 64
byte parts of the 128 byte cache lines? A workaround was to
artificially offset the threads stacks via alloca.
If you have one core and two threads there's no false sharing.
So, are you familiar with Intel's early hyper threading problem? There
was false sharing between the hyperhtreads. The workaround did improve
performance by quite a bit. IIRC, my older appcore project had this
workaround incorporated into it logic. I wrote that sucker back in very
early 2000's. Humm... I will try to find the exact line.
Could you be both right? The performance problem was real, but maybe
it wasn't "false sharing"? The hyper-thread are the same core; they
share the same caches.

Might you be describing a cache collision rather than false sharing?

A single processor can trigger degenerate cache uses (at any level
of a cache hierarchy). For instance, if pages of virtual memory
are accessed in certain patterns, they can trash the same TLB entry.

Was it perhaps the case that these thread stacks were allocated at such
a stride, that their addresses clashed on the same cache line set?

That could be a problem even on one processor, but it's obvious that
hyperthreading could exacerbate it because switches between hyperthreads
happen on a fine granularity. They don't get to run a full
scheduler-driven time quantum. A low-level processor event like a
pipeline stall (or other resource issue) can drive a hyper thread
switch.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
NOTE: If you use Google Groups, I don't see you, unless you're whitelisted.
Chris M. Thomasson
2023-12-26 23:37:16 UTC
Reply
Permalink
Post by Kaz Kylheku
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
Remember when Intel first started hyperthreading and god damn threads
could false share with each other (on the stacks!) the low and high 64
byte parts of the 128 byte cache lines? A workaround was to
artificially offset the threads stacks via alloca.
If you have one core and two threads there's no false sharing.
So, are you familiar with Intel's early hyper threading problem? There
was false sharing between the hyperhtreads. The workaround did improve
performance by quite a bit. IIRC, my older appcore project had this
workaround incorporated into it logic. I wrote that sucker back in very
early 2000's. Humm... I will try to find the exact line.
Could you be both right? The performance problem was real, but maybe
it wasn't "false sharing"? The hyper-thread are the same core; they
share the same caches.
Might you be describing a cache collision rather than false sharing?
Iirc, when I read the docs from Intel, it was the low 64 bytes being
falsely shared with the high 64 bytes of a 128 byte l2 cache line?
Post by Kaz Kylheku
A single processor can trigger degenerate cache uses (at any level
of a cache hierarchy). For instance, if pages of virtual memory
are accessed in certain patterns, they can trash the same TLB entry.
Was it perhaps the case that these thread stacks were allocated at such
a stride, that their addresses clashed on the same cache line set?
That could be a problem even on one processor, but it's obvious that
hyperthreading could exacerbate it because switches between hyperthreads
happen on a fine granularity. They don't get to run a full
scheduler-driven time quantum. A low-level processor event like a
pipeline stall (or other resource issue) can drive a hyper thread
switch.
Chris M. Thomasson
2023-12-27 05:59:51 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by Kaz Kylheku
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
Remember when Intel first started hyperthreading and god damn threads
could false share with each other (on the stacks!) the low and high 64
byte parts of the 128 byte cache lines? A workaround was to
artificially offset the threads stacks via alloca.
If you have one core and two threads there's no false sharing.
So, are you familiar with Intel's early hyper threading problem? There
was false sharing between the hyperhtreads. The workaround did improve
performance by quite a bit. IIRC, my older appcore project had this
workaround incorporated into it logic. I wrote that sucker back in very
early 2000's. Humm... I will try to find the exact line.
Could you be both right? The performance problem was real, but maybe
it wasn't "false sharing"? The hyper-thread are the same core; they
share the same caches.
Might you be describing a cache collision rather than false sharing?
Iirc, when I read the docs from Intel, it was the low 64 bytes being
falsely shared with the high 64 bytes of a 128 byte l2 cache line?
Something about false interference between threads that should not even
be interacting with one another to begin with. It was a problem. The fix
was based on alloca, so that is something to ponder on.
Post by Chris M. Thomasson
Post by Kaz Kylheku
A single processor can trigger degenerate cache uses (at any level
of a cache hierarchy). For instance, if pages of virtual memory
are accessed in certain patterns, they can trash the same TLB entry.
Was it perhaps the case that these thread stacks were allocated at such
a stride, that their addresses clashed on the same cache line set?
That could be a problem even on one processor, but it's obvious that
hyperthreading could exacerbate it because switches between hyperthreads
happen on a fine granularity. They don't get to run a full
scheduler-driven time quantum. A low-level processor event like a
pipeline stall (or other resource issue) can drive a hyper thread
switch.
Bonita Montero
2023-12-27 09:23:19 UTC
Reply
Permalink
Post by Chris M. Thomasson
Something about false interference between threads that should not even
be interacting with one another to begin with. It was a problem. The fix
was based on alloca, so that is something to ponder on.;
Like with any SMT-core there could be cache-thrashing between the
cores. The L1 data cache was only 8kB, maybe only two was associative
that could be thrashing between the cores. But I've no clue what this
would have to to with alloca().
Kaz Kylheku
2023-12-27 20:49:06 UTC
Reply
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Something about false interference between threads that should not even
be interacting with one another to begin with. It was a problem. The fix
was based on alloca, so that is something to ponder on.;
Like with any SMT-core there could be cache-thrashing between the
cores. The L1 data cache was only 8kB, maybe only two was associative
that could be thrashing between the cores. But I've no clue what this
would have to to with alloca().
Say you have thread stacks that are offset by some power of two amount,
like a megabyte. Stack addresses at the same depth (like the local
variables of threads executing the same function) are likely to collide
on the same cache set (at different levels of the caching hierachy: L1,
L2, translation caches).

With alloca, since it moves the stack pointer, we can carve variable
amounts of stack space to randomize the stack offsets (before calling
the work functions). In different threads, we use differently-sized
alloca allocations.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
NOTE: If you use Google Groups, I don't see you, unless you're whitelisted.
Bonita Montero
2023-12-28 11:00:27 UTC
Reply
Permalink
Post by Kaz Kylheku
Say you have thread stacks that are offset by some power of two amount,
like a megabyte. Stack addresses at the same depth (like the local
variables of threads executing the same function) are likely to collide
on the same cache set (at different levels of the caching hierachy: L1,
L2, translation caches).
Post by Kaz Kylheku
With alloca, since it moves the stack pointer, we can carve variable
amounts of stack space to randomize the stack offsets (before calling
the work functions). In different threads, we use differently-sized
alloca allocations.
That's not sth. alloca() could alleviate. The startup-code inside the
userspace-part of the thread should randomize the starting address of
the stack within the size of a set.
Most resources on the net say that the Pentium 4's L1 data cache asso-
ciativity is eight, so there's not much chance of aliasing inside the
L1D cache.
Chris M. Thomasson
2023-12-28 23:38:01 UTC
Reply
Permalink
Post by Bonita Montero
Post by Kaz Kylheku
Say you have thread stacks that are offset by some power of two amount,
like a megabyte. Stack addresses at the same depth (like the local
variables of threads executing the same function) are likely to collide
on the same cache set (at different levels of the caching hierachy: L1,
L2, translation caches).
Post by Kaz Kylheku
With alloca, since it moves the stack pointer, we can carve variable
amounts of stack space to randomize the stack offsets (before calling
the work functions). In different threads, we use differently-sized
alloca allocations.
That's not sth. alloca() could alleviate. The startup-code inside the
userspace-part of the thread should randomize the starting address of
the stack within the size of a set.
Most resources on the net say that the Pentium 4's L1 data cache asso-
ciativity is eight, so there's not much chance of aliasing inside the
L1D cache.
The use of alloca to try to get around the problem in their (Intel's)
early hyperthreaded processors was real, and actually helped. It was in
the Intel docs.
Bonita Montero
2023-12-29 03:17:09 UTC
Reply
Permalink
Post by Chris M. Thomasson
The use of alloca to try to get around the problem in their (Intel's)
early hyperthreaded processors was real, and actually helped. It was in
the Intel docs.
I don't believe it.
Chris M. Thomasson
2023-12-29 04:58:51 UTC
Reply
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
The use of alloca to try to get around the problem in their (Intel's)
early hyperthreaded processors was real, and actually helped. It was
in the Intel docs.
I don't believe it.
Why not?
Bonita Montero
2023-12-29 09:58:30 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
The use of alloca to try to get around the problem in their (Intel's)
early hyperthreaded processors was real, and actually helped. It was
in the Intel docs.
I don't believe it.
Why not?
Because I don't understand what's different with the access pattern
of alloca() and usual stack allocation. And if I google for "alloca
Pentium 4 site:intel.com" I can't find anything that fits.
David Brown
2023-12-29 12:56:54 UTC
Reply
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
The use of alloca to try to get around the problem in their
(Intel's) early hyperthreaded processors was real, and actually
helped. It was in the Intel docs.
I don't believe it.
Why not?
Because I don't understand what's different with the access pattern
of alloca() and usual stack allocation. And if I google for "alloca
Pentium 4 site:intel.com" I can't find anything that fits.
There is nothing different from alloca() and ordinary stack allocations.
But alloca() makes it quick and easy to make large allocations, and to
do so with random sizes (or at least sizes that differ significantly
between threads, even if they are running the same code). Without
alloca(), you'd need to do something like arranging to call a recursive
function a random number of times before it then calls the next bit of
code in your thread. alloca() is simply far easier and faster.
Bonita Montero
2023-12-29 16:04:31 UTC
Reply
Permalink
Post by David Brown
There is nothing different from alloca() and ordinary stack allocations.
 But alloca() makes it quick and easy to make large allocations, and to
do so with random sizes (or at least sizes that differ significantly
between threads, even if they are running the same code).
I've got my own class overflow_array<> which is like an array<> and a
vector<>. If you append more than the internal array<> can handle the
objects are moved to an internal vector. I think Boost's small_array
is similar to that.
Post by David Brown
Without alloca(), you'd need to do something like arranging to call a
recursive function a random number of times before it then calls the
next bit of code in your thread.  alloca() is simply far easier and
faster.
You've got strange ideas. alloca() has been completely removed from the
Linux kernel. The point is that if there's a fixed upper limit you would
allocate you could allocate it always statically.
David Brown
2023-12-30 18:27:30 UTC
Reply
Permalink
Post by Bonita Montero
Post by David Brown
There is nothing different from alloca() and ordinary stack
allocations.  But alloca() makes it quick and easy to make large
allocations, and to do so with random sizes (or at least sizes that
differ significantly between threads, even if they are running the
same code).
I've got my own class overflow_array<> which is like an array<> and a
vector<>. If you append more than the internal array<> can handle the
objects are moved to an internal vector. I think Boost's small_array
is similar to that.
I'm sure that's all very nice, but it is completely and utterly
irrelevant to the issue being discussed. Perhaps you didn't understand
your own post?
Post by Bonita Montero
Post by David Brown
Without  alloca(), you'd need to do something like arranging to call a
recursive  function a random number of times before it then calls the
next bit of  code in your thread.  alloca() is simply far easier and
faster.
You've got strange ideas. alloca() has been completely removed from the
Linux kernel.
Citation? You are usually wrong in your claims, or at least mixed-up,
so I won't trust you without evidence. (That does not mean you are
wrong here - I don't know either way.)

Of course, avoiding alloca() within the kernel is, again, utterly
irrelevant to the point under discussion.
Post by Bonita Montero
The point is that if there's a fixed upper limit you would
allocate you could allocate it always statically.
No, that would be useless.

The idea is to have /different/ allocation sizes in different threads,
so that the different threads have their stacks at wildly different
address bits in their tags for the processor caches.

I can't tell you how helpful or not this may be in practice. I am
merely trying to explain to you what the idea is, since you said you did
not understand it.
Bonita Montero
2023-12-31 10:22:23 UTC
Reply
Permalink
Post by David Brown
The idea is to have /different/ allocation sizes in different threads,
so that the different threads have their stacks at wildly different
address bits in their tags for the processor caches.
There's not much need to do that with four-way associativity.
Post by David Brown
I can't tell you how helpful or not this may be in practice. ...
Even Intel cant't do that.
Scott Lurndal
2023-12-31 18:49:22 UTC
Reply
Permalink
Post by David Brown
Post by Bonita Montero
You've got strange ideas. alloca() has been completely removed from the
Linux kernel.
Citation? You are usually wrong in your claims, or at least mixed-up,
so I won't trust you without evidence. (That does not mean you are
wrong here - I don't know either way.)
Kernel threads generally run with a small, fixed stack. Stack-based
dynamic allocation is generally avoided. I don't know of any
hard restrictions, but I suspect that Linus would look askance
at it.
Post by David Brown
Of course, avoiding alloca() within the kernel is, again, utterly
irrelevant to the point under discussion.
Very true
David Brown
2024-01-01 11:46:39 UTC
Reply
Permalink
Post by Scott Lurndal
Post by David Brown
Post by Bonita Montero
You've got strange ideas. alloca() has been completely removed from the
Linux kernel.
Citation? You are usually wrong in your claims, or at least mixed-up,
so I won't trust you without evidence. (That does not mean you are
wrong here - I don't know either way.)
Kernel threads generally run with a small, fixed stack. Stack-based
dynamic allocation is generally avoided. I don't know of any
hard restrictions, but I suspect that Linus would look askance
at it.
Oh, I have no doubts that it would be frowned upon in the kernel itself.
And I know that VLAs have been actively removed from the kernel. But
Bonita's claim implies that "alloca()" was used in the kernel earlier,
and has since been removed, presumably due to a specific decision.
Post by Scott Lurndal
Post by David Brown
Of course, avoiding alloca() within the kernel is, again, utterly
irrelevant to the point under discussion.
Very true
Scott Lurndal
2023-12-29 16:01:47 UTC
Reply
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
The use of alloca to try to get around the problem in their (Intel's)
early hyperthreaded processors was real, and actually helped. It was
in the Intel docs.
I don't believe it.
Why not?
Because I don't understand what's different with the access pattern
of alloca() and usual stack allocation. And if I google for "alloca
Pentium 4 site:intel.com" I can't find anything that fits.
See page 2-35 in the _Intel Pentium 4 Processor Optimization_
manual.
Bonita Montero
2023-12-29 16:06:44 UTC
Reply
Permalink
Post by Scott Lurndal
See page 2-35 in the _Intel Pentium 4 Processor Optimization_
manual.
Where can I find sth. referring to alloca() there ?
Chris M. Thomasson
2023-12-29 21:45:12 UTC
Reply
Permalink
Post by Bonita Montero
Post by Scott Lurndal
See page 2-35 in the _Intel Pentium 4 Processor Optimization_
manual.
Where can I find sth. referring to alloca() there ?
Huh?
Chris M. Thomasson
2023-12-29 22:09:02 UTC
Reply
Permalink
page 2-35 in the_Intel Pentium 4 Processor Optimization_
manual.
I think it was in chapter 5 of Developing Multithreaded Applications: A
Platform Consistent Approach cannot remember the damn section right now.
Chris M. Thomasson
2023-12-29 22:12:26 UTC
Reply
Permalink
Post by Chris M. Thomasson
page 2-35 in the_Intel Pentium 4 Processor Optimization_
manual.
I think it was in chapter 5 of Developing Multithreaded Applications: A
Platform Consistent Approach cannot remember the damn section right now.
Wait a minute! I might have found it, lets see:

https://www.intel.com/content/dam/www/public/us/en/documents/training/developing-multithreaded-applications.pdf

Ahhh section 5.3! Nice! I read this a while back, before 2005.

Thanks for sparking my memory Scott.
Bonita Montero
2023-12-30 04:42:50 UTC
Reply
Permalink
Post by Chris M. Thomasson
https://www.intel.com/content/dam/www/public/us/en/documents/training/developing-multithreaded-applications.pdf
Ahhh section 5.3! Nice! I read this a while back, before 2005.
This is nonsense what it says if the cache is really four-way
associative like in the other paper mentioned here. And of
course that has nothing to do with false sharing
Chris M. Thomasson
2023-12-30 04:45:35 UTC
Reply
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
https://www.intel.com/content/dam/www/public/us/en/documents/training/developing-multithreaded-applications.pdf
Ahhh section 5.3! Nice! I read this a while back, before 2005.
This is nonsense what it says if the cache is really four-way
associative like in the other paper mentioned here. And of
course that has nothing to do with false sharing
You should write Intel a letter. ;^o
Kaz Kylheku
2023-12-30 04:56:19 UTC
Reply
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
https://www.intel.com/content/dam/www/public/us/en/documents/training/developing-multithreaded-applications.pdf
Ahhh section 5.3! Nice! I read this a while back, before 2005.
This is nonsense what it says if the cache is really four-way
associative like in the other paper mentioned here. And of
course that has nothing to do with false sharing
If it's four way associative, you star to have a performance problem as
soon as five things collide on it. For instance, suppose you have two
thread stack tops mapped to the same cache lines, plus three more data
structures heavily being accessed.

Oh, and the above Intel paper does actually mention alloca:

"The easiest way to adjust the initial stack address for each thread is
to call the memory allocation function, _alloca, with varying byte
amounts in the intermediate thread function."
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
NOTE: If you use Google Groups, I don't see you, unless you're whitelisted.
Bonita Montero
2023-12-30 05:09:55 UTC
Reply
Permalink
Post by Kaz Kylheku
If it's four way associative, you star to have a performance problem as
soon as five things collide on it. ...
With four-way associativity that's rather unlikely.
Kaz Kylheku
2023-12-30 05:51:51 UTC
Reply
Permalink
Post by Bonita Montero
Post by Kaz Kylheku
If it's four way associative, you star to have a performance problem as
soon as five things collide on it. ...
With four-way associativity that's rather unlikely.
Under four-way associativity, five isn't greater than four?
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
NOTE: If you use Google Groups, I don't see you, unless you're whitelisted.
Bonita Montero
2023-12-30 09:15:34 UTC
Reply
Permalink
Post by Kaz Kylheku
Under four-way associativity, five isn't greater than four?
Two-way associativeness would leave no space if both threads have
synchronous stack frames. With four-way associativeness there's
not much likehood for that to happen.
Kaz Kylheku
2023-12-30 20:35:03 UTC
Reply
Permalink
Post by Bonita Montero
Post by Kaz Kylheku
Under four-way associativity, five isn't greater than four?
Two-way associativeness would leave no space if both threads have
synchronous stack frames. With four-way associativeness there's
not much likehood for that to happen.
My comment makes it clear that there are two thread stacks vying
for that cache line, plus a couple of other accesses that
are not thread stacks.

(By the way, set associative caches don't always have full LRU
replacement strategies.)
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
NOTE: If you use Google Groups, I don't see you, unless you're whitelisted.
Bonita Montero
2023-12-31 05:54:44 UTC
Reply
Permalink
Post by Kaz Kylheku
My comment makes it clear that there are two thread stacks
vying for that cache line, plus a couple of other accesses
that are not thread stacks.
With four way associativity that's rather unlikely to happen.
Kaz Kylheku
2023-12-31 07:01:07 UTC
Reply
Permalink
Post by Bonita Montero
Post by Kaz Kylheku
My comment makes it clear that there are two thread stacks
vying for that cache line, plus a couple of other accesses
that are not thread stacks.
With four way associativity that's rather unlikely to happen.
What? The associativity of the cache is not the determiner of
program behavior; if the program accesses five different
areas whose addresses are the same modulo 65536 bytes,
that happens whether there a direct mapped cache, fully
associative cache or no cache at all, or ...
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
NOTE: If you use Google Groups, I don't see you, unless you're whitelisted.
Bonita Montero
2023-12-31 10:20:50 UTC
Reply
Permalink
Post by Kaz Kylheku
Post by Bonita Montero
Post by Kaz Kylheku
My comment makes it clear that there are two thread stacks
vying for that cache line, plus a couple of other accesses
that are not thread stacks.
With four way associativity that's rather unlikely to happen.
What? The associativity of the cache is not the determiner
of program behavior; if the program accesses five different
areas whose addresses are the same modulo 65536 bytes, ...
The set size is smaller than 64kB an of course there can be aliasing
but with four way associativity it's unlikely that there's a need to
control aliasing. And if the set size would be 64kB aliasing would be
even less likely through page colouring.
Post by Kaz Kylheku
that happens whether there a direct mapped cache, fully
associative cache or no cache at all, or ...
Kaz Kylheku
2023-12-31 17:30:09 UTC
Reply
Permalink
Post by Bonita Montero
Post by Kaz Kylheku
Post by Bonita Montero
Post by Kaz Kylheku
My comment makes it clear that there are two thread stacks
vying for that cache line, plus a couple of other accesses
that are not thread stacks.
With four way associativity that's rather unlikely to happen.
What? The associativity of the cache is not the determiner
of program behavior; if the program accesses five different
areas whose addresses are the same modulo 65536 bytes, ...
The set size is smaller than 64kB an of course there can be aliasing
but with four way associativity it's unlikely that there's a need to
control aliasing. And if the set size would be 64kB aliasing would be
even less likely through page colouring.
When I describe a scenario in which five items are being frequently
accessed and collide to the same cache line, which is associative with a
set size of four, there is necessarily a conflict, because five is
greater than four.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
NOTE: If you use Google Groups, I don't see you, unless you're whitelisted.
Bonita Montero
2024-01-01 05:21:34 UTC
Reply
Permalink
Post by Kaz Kylheku
When I describe a scenario in which five items are being frequently
accessed and collide to the same cache line, which is associative with
a set size of four, there is necessarily a conflict, because five is
greater than four.
However, this scenario is unlikely to occur so often that
it makes sense to not let the stacks run synchronously.
Scott Lurndal
2023-12-31 18:44:54 UTC
Reply
Permalink
Post by Bonita Montero
Post by Kaz Kylheku
If it's four way associative, you star to have a performance problem as
soon as five things collide on it. ...
With four-way associativity that's rather unlikely.
Why? The set selection is based on specific bits in the address. As soon
as a fifth address hits the set, you lose one of the existing lines.

Given the aligned stacks in the specified processor, the next function
called would _always_ overwrite the elements of the set(s) used by the calling function.
Bonita Montero
2024-01-01 05:22:25 UTC
Reply
Permalink
Why? ...
Because there must be set-conflicts on the same line index.
That's rather unlikely with four-way associativeness.
Kaz Kylheku
2024-01-01 08:28:57 UTC
Reply
Permalink
Post by Bonita Montero
Why? ...
Because there must be set-conflicts on the same line index.
That's rather unlikely with four-way associativeness.
It's rather likely. Suppose you have a large number of threads,
with stacks that end on a multiple of 64 kB.

Only two of these threads are runnable concurrently on the
two hyperthreads. However, your application may be rapidly
switching between them.

If their stacks evacuate each other from the L1 cache,
that could be bad.

For compute-bound tasks with long quanta, it might not
matter so much; the same threads will occupy the hyperthreads
for tens of milliseconds at a time or whatever.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
NOTE: If you use Google Groups, I don't see you, unless you're whitelisted.
Bonita Montero
2024-01-01 13:11:02 UTC
Reply
Permalink
It's rather likely. ...
Absolutely not.
Suppose you have a large number of threads,
with stacks that end on a multiple of 64 kB.
The set size isn't 64kB.
Only two of these threads are runnable concurrently on the
two hyperthreads. However, your application may be rapidly
switching between them.
If you have two threads synchronous in terms of stack allocation
you need conflicting data that touches cacklines with the same
set indices. That's unlikely.
The suggestion Intel made here is just a Nerd-suggestion.
Chris M. Thomasson
2024-01-01 23:36:31 UTC
Reply
Permalink
Post by Bonita Montero
It's rather likely. ...
Absolutely not.
Suppose you have a large number of threads,
with stacks that end on a multiple of 64 kB.
The set size isn't 64kB.
Only two of these threads are runnable concurrently on the
two hyperthreads. However, your application may be rapidly
switching between them.
If you have two threads synchronous in terms of stack allocation
you need conflicting data that touches cacklines with the same
set indices. That's unlikely.
The suggestion Intel made here is just a Nerd-suggestion.
Are you are a full blown moron, or just a little?
Kaz Kylheku
2023-12-30 04:51:47 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by Chris M. Thomasson
page 2-35 in the_Intel Pentium 4 Processor Optimization_
manual.
I think it was in chapter 5 of Developing Multithreaded Applications: A
Platform Consistent Approach cannot remember the damn section right now.
https://www.intel.com/content/dam/www/public/us/en/documents/training/developing-multithreaded-applications.pdf
Ahhh section 5.3! Nice! I read this a while back, before 2005.
Wow, I guessed that one. Elsewhere in the thread, I made a remark
similar to "imagine that thread stacks are aligned at an address like
nnnnFFFF" I.e. the top of the stack starts at the top of a 64 kB
aligned window. I was exactly thinking of a typical L1 cache size
from aroudn that era, in fact.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
NOTE: If you use Google Groups, I don't see you, unless you're whitelisted.
Chris M. Thomasson
2023-12-30 20:00:00 UTC
Reply
Permalink
Post by Kaz Kylheku
Post by Chris M. Thomasson
Post by Chris M. Thomasson
page 2-35 in the_Intel Pentium 4 Processor Optimization_
manual.
I think it was in chapter 5 of Developing Multithreaded Applications: A
Platform Consistent Approach cannot remember the damn section right now.
https://www.intel.com/content/dam/www/public/us/en/documents/training/developing-multithreaded-applications.pdf
Ahhh section 5.3! Nice! I read this a while back, before 2005.
Wow, I guessed that one. Elsewhere in the thread, I made a remark
similar to "imagine that thread stacks are aligned at an address like
nnnnFFFF" I.e. the top of the stack starts at the top of a 64 kB
aligned window. I was exactly thinking of a typical L1 cache size
from aroudn that era, in fact.
Yup! You pretty much got it. Thanks Kaz )
Kaz Kylheku
2023-12-29 17:29:00 UTC
Reply
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
The use of alloca to try to get around the problem in their (Intel's)
early hyperthreaded processors was real, and actually helped. It was
in the Intel docs.
I don't believe it.
Why not?
Because I don't understand what's different with the access pattern
of alloca() and usual stack allocation. And if I google for "alloca
Pentium 4 site:intel.com" I can't find anything that fits.
I explained it. The allocation is not used. When you call alloca(n),
the stack pointer moves by n bytes. If you then call a function,
its stack frame will be offset by that much (plus any alignment if
n is not aligned).
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
NOTE: If you use Google Groups, I don't see you, unless you're whitelisted.
Bonita Montero
2023-12-30 04:45:50 UTC
Reply
Permalink
Post by Kaz Kylheku
I explained it. The allocation is not used. When you call alloca(n),
the stack pointer moves by n bytes. If you then call a function,
its stack frame will be offset by that much (plus any alignment if
n is not aligned).
According to the paper Scott mentioned the associativity of the
Pentium 4's L1 data cache is four. With that it's not necessary
to have such aliasing preventions.
Chris M. Thomasson
2023-12-30 19:58:45 UTC
Reply
Permalink
Post by Bonita Montero
Post by Kaz Kylheku
I explained it. The allocation is not used. When you call alloca(n),
the stack pointer moves by n bytes. If you then call a function,
its stack frame will be offset by that much (plus any alignment if
n is not aligned).
According to the paper Scott mentioned the associativity of the
Pentium 4's L1 data cache is four. With that it's not necessary
to have such aliasing preventions.
Huh? Wow, you really need to write Intel a letter about it wrt their
older hyperthreaded processors! Although, it seems like you simply do
not actually _understand_ what is going on here...

Intel's suggestions for how to mitigate the problem in their earlier
hyperhtreaded processors actually worked wrt improving performance. Keep
in mind this was a while back in 2004-2005. I am happy that the way back
machine has my older code.
red floyd
2023-12-30 22:58:06 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Kaz Kylheku
I explained it. The allocation is not used. When you call alloca(n),
the stack pointer moves by n bytes. If you then call a function,
its stack frame will be offset by that much (plus any alignment if
n is not aligned).
According to the paper Scott mentioned the associativity of the
Pentium 4's L1 data cache is four. With that it's not necessary
to have such aliasing preventions.
Huh? Wow, you really need to write Intel a letter about it wrt their
older hyperthreaded processors! Although, it seems like you simply do
not actually _understand_ what is going on here...
Intel's suggestions for how to mitigate the problem in their earlier
hyperhtreaded processors actually worked wrt improving performance. Keep
in mind this was a while back in 2004-2005. I am happy that the way back
machine has my older code.
Oh, come on, Chris. It's clear that Bonita knows more about what's
going on inside an Intel processor than Intel does.
Chris M. Thomasson
2023-12-31 19:49:41 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Kaz Kylheku
I explained it. The allocation is not used. When you call alloca(n),
the stack pointer moves by n bytes. If you then call a function,
its stack frame will be offset by that much (plus any alignment if
n is not aligned).
According to the paper Scott mentioned the associativity of the
Pentium 4's L1 data cache is four. With that it's not necessary
to have such aliasing preventions.
Huh? Wow, you really need to write Intel a letter about it wrt their
older hyperthreaded processors! Although, it seems like you simply do
not actually _understand_ what is going on here...
Intel's suggestions for how to mitigate the problem in their earlier
hyperhtreaded processors actually worked wrt improving performance.
Keep in mind this was a while back in 2004-2005. I am happy that the
way back machine has my older code.
Oh, come on, Chris.  It's clear that Bonita knows more about what's
going on inside an Intel processor than Intel does.
Sometimes, I hope it is trolling. Afaict, Bonita seems to be full of
crap to a point where I actually feel sorry for the damn toilets it has
to punish.
Bonita Montero
2023-12-31 05:51:45 UTC
Reply
Permalink
Post by Chris M. Thomasson
Huh? Wow, you really need to write Intel a letter about it wrt their
older hyperthreaded processors! ...
The suggestion Intel made at this point is superfluous.
Post by Chris M. Thomasson
Intel's suggestions for how to mitigate the problem in their earlier
hyperhtreaded processors actually worked wrt improving performance. ...
I'm pretty sure they never ran the numbers on that.
Chris M. Thomasson
2023-12-31 19:36:29 UTC
Reply
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Huh? Wow, you really need to write Intel a letter about it wrt their
older hyperthreaded processors! ...
The suggestion Intel made at this point is superfluous.
Post by Chris M. Thomasson
Intel's suggestions for how to mitigate the problem in their earlier
hyperhtreaded processors actually worked wrt improving performance. ...
I'm pretty sure they never ran the numbers on that.
I am pretty sure you are trolling, Bonita?
Bonita Montero
2024-01-01 06:28:36 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
I'm pretty sure they never ran the numbers on that.
I am pretty sure you are trolling, Bonita?
If you allocate memory the memory usually comes from a pool
where theset indices of the memory block are rather randomized.
It's rather unlikey that you'd get set conflicts with the both
thread to an extent that this really hurts performance.
Chris M. Thomasson
2024-01-01 06:53:40 UTC
Reply
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
I'm pretty sure they never ran the numbers on that.
I am pretty sure you are trolling, Bonita?
If you allocate memory the memory usually comes from a pool
where theset indices of the memory block are rather randomized.
It's rather  unlikey that you'd get set conflicts with the both
thread to an extent that this really hurts performance.
Oh my. You really did not read the paper I linked to up thread!
Bonita Montero
2024-01-01 13:11:42 UTC
Reply
Permalink
Post by Chris M. Thomasson
Oh my. You really did not read the paper I linked to up thread!
The suggestion Intel made here is just a Nerd-suggestion and
you're a nerd as well.
Chris M. Thomasson
2024-01-01 23:34:53 UTC
Reply
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Oh my. You really did not read the paper I linked to up thread!
The suggestion Intel made here is just a Nerd-suggestion and
you're a nerd as well.
LOL! Wow, you are the king of interesting responses!

:^) Happy New Year!
Bonita Montero
2024-01-02 10:55:54 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
Oh my. You really did not read the paper I linked to up thread!
The suggestion Intel made here is just a Nerd-suggestion and
you're a nerd as well.
LOL! Wow, you are the king of interesting responses!
You consider this suggestion to be important because it makes
you feel important.
Chris M. Thomasson
2024-01-02 18:38:07 UTC
Reply
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
Oh my. You really did not read the paper I linked to up thread!
The suggestion Intel made here is just a Nerd-suggestion and
you're a nerd as well.
LOL! Wow, you are the king of interesting responses!
You consider this suggestion to be important because it makes
you feel important.
Not true. Don't project on me, oh holy one.

Now, I was reading all of the Intel optimization guides and using VTune
at the time, early 2000's. I happened to remember this one (64k aliasing
problem), and even found my own C code on the wayback machine that used
it. Just because you have not read them does not make me feel
"important". You silly rabbit! Tricks are for kids. :^)
Bonita Montero
2024-01-03 05:48:08 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
Oh my. You really did not read the paper I linked to up thread!
The suggestion Intel made here is just a Nerd-suggestion and
you're a nerd as well.
LOL! Wow, you are the king of interesting responses!
You consider this suggestion to be important because it makes
you feel important.
Not true. Don't project on me, oh holy one.
I know you in this point - very good.
Post by Chris M. Thomasson
Now, I was reading all of the Intel optimization guides and using VTune
at the time, early 2000's. I happened to remember this one (64k aliasing
problem), ...
There is no 64kB aliasing problem with the L1-cache since the cache
is smaller.
Chris M. Thomasson
2024-01-03 21:32:12 UTC
Reply
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Chris M. Thomasson
Oh my. You really did not read the paper I linked to up thread!
The suggestion Intel made here is just a Nerd-suggestion and
you're a nerd as well.
LOL! Wow, you are the king of interesting responses!
You consider this suggestion to be important because it makes
you feel important.
Not true. Don't project on me, oh holy one.
I know you in this point - very good.
Post by Chris M. Thomasson
Now, I was reading all of the Intel optimization guides and using
VTune at the time, early 2000's. I happened to remember this one (64k
aliasing problem), ...
There is no 64kB aliasing problem with the L1-cache since the cache
is smaller.
Oh god. Have you not read the paper yet? If you have read it all, then
you do not seem to understand it at all. That's on you, Bonita. Like I
said you need to write Intel a letter telling them how that paper never
needed to be written. Lol! You make me laugh!

:^)
Bonita Montero
2024-01-04 03:37:04 UTC
Reply
Permalink
Post by Bonita Montero
There is no 64kB aliasing problem with the L1-cache since the cache
is smaller.
Oh god. Have you not read the paper yet? ...
The Pentium 4's L1 data cache is between 16 and 32kB, so there
can't be a 64kB aliasing. And aliasing can be only on a set basis
and the sets are 4kB or 8kB large.
Chris M. Thomasson
2024-01-06 03:21:18 UTC
Reply
Permalink
Post by Bonita Montero
Post by Bonita Montero
There is no 64kB aliasing problem with the L1-cache since the cache
is smaller.
Oh god. Have you not read the paper yet? ...
The Pentium 4's L1 data cache is between 16 and 32kB, so there
can't be a 64kB aliasing. And aliasing can be only on a set basis
and the sets are 4kB or 8kB large.
Are you trying to tell me that the aliasing problem on those older Intel
hyperthreaded processors and the workaround (from Intel) was a myth?
lol. ;^)
Bonita Montero
2024-01-06 07:18:35 UTC
Reply
Permalink
Post by Chris M. Thomasson
Are you trying to tell me that the aliasing problem on those older Intel
hyperthreaded processors and the workaround (from Intel) was a myth?
lol. ;^)
Intel just made a nerd-suggestion. With four-way associativity
there's no frequent aliasing problem in the L1 data dache of
Pentium 4.
Kaz Kylheku
2024-01-06 08:31:06 UTC
Reply
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Are you trying to tell me that the aliasing problem on those older Intel
hyperthreaded processors and the workaround (from Intel) was a myth?
lol. ;^)
Intel just made a nerd-suggestion. With four-way associativity
there's no frequent aliasing problem in the L1 data dache of
Pentium 4.
I think the L1 cache was 8K on that thing, and the blocks are 32 bytes.

I think how it works on the P4 is that the address is structured is like
this:

31 11 10 5 4 0
| | | | | |
[ 21 bit tag ] [ 6 bit cache set ] [ 5 bit offset into 32 bit block ]

Thus say we have an area of the stack with the address
range nnnnFF80 to nnnnFFFF (128 bytes, 4 x 32 byte cache blocks).

These four blocks all map to the same set: they have the same six
bits in the "cache set" part of the address.

So if a thread is accessing something in all four blocks, it will
completely use that cache set, all by itself.

If any other thread has a similar block in its stack, with the same
cache set ID, it will cause evictions against this thread.

Sure, if each of these threads confines itself to working with just one
cacheline-sized aperture of the stack, it looks better.

You're forgetting that the sets are very small and that groups of
adjacent four 32 byte blocks map to the same set. Touch four adjacent
cache blocks that are aligned on a 128 byte boundary, and you have
hit full occupancy in the cache set corresponding to that block!

(I suspect the references to 64K should not be kilobytes but sets.
The 8K cache has 64 sets.)

In memory, 128 byte blocks that is aligned maps to, and precisely covers
a cache set. If two such blocks addresses that are equal modulo 8K, they
collide to the same cache set. If one of those blocks is fully present
in the cache, the other must be fully evicted.

It's really easy to see how things can go south under hyperthreading.
If two hyperthreads are working with clashing 128 byte areas that each
want to hog the same cache set, and the core is switching between them
on a fine-grained basis, ... you get the picture.

It's very easy for the memory mapping allocations used for thread
stacks to produce addresses such tha the delta between them is a
multiple of 8K.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
NOTE: If you use Google Groups, I don't see you, unless you're whitelisted.
Bonita Montero
2024-01-06 09:30:25 UTC
Reply
Permalink
Post by Kaz Kylheku
Post by Bonita Montero
Post by Chris M. Thomasson
Are you trying to tell me that the aliasing problem on those older Intel
hyperthreaded processors and the workaround (from Intel) was a myth?
lol. ;^)
Intel just made a nerd-suggestion. With four-way associativity
there's no frequent aliasing problem in the L1 data dache of
Pentium 4.
I think the L1 cache was 8K on that thing, and the blocks are 32 bytes.
I think how it works on the P4 is that the address is structured is like
31 11 10 5 4 0
| | | | | |
[ 21 bit tag ] [ 6 bit cache set ] [ 5 bit offset into 32 bit block ]
Thus say we have an area of the stack with the address
range nnnnFF80 to nnnnFFFF (128 bytes, 4 x 32 byte cache blocks).
These four blocks all map to the same set: they have the same six
bits in the "cache set" part of the address.
So if a thread is accessing something in all four blocks, it will
completely use that cache set, all by itself.
If any other thread has a similar block in its stack, with the same
cache set ID, it will cause evictions against this thread.
Sure, if each of these threads confines itself to working with just one
cacheline-sized aperture of the stack, it looks better.
You're forgetting that the sets are very small and that groups of
adjacent four 32 byte blocks map to the same set. Touch four adjacent
cache blocks that are aligned on a 128 byte boundary, and you have
hit full occupancy in the cache set corresponding to that block!
(I suspect the references to 64K should not be kilobytes but sets.
The 8K cache has 64 sets.)
In memory, 128 byte blocks that is aligned maps to, and precisely covers
a cache set. If two such blocks addresses that are equal modulo 8K, they
collide to the same cache set. If one of those blocks is fully present
in the cache, the other must be fully evicted.
It's really easy to see how things can go south under hyperthreading.
If two hyperthreads are working with clashing 128 byte areas that each
want to hog the same cache set, and the core is switching between them
on a fine-grained basis, ... you get the picture.
It's very easy for the memory mapping allocations used for thread
stacks to produce addresses such tha the delta between them is a
multiple of 8K.
Of course it's easy to intentionally provoke frequent aliasing
with the P4's L1 cache, but actually this doesn't happen often.
Chris M. Thomasson
2024-01-06 21:15:58 UTC
Reply
Permalink
Post by Bonita Montero
Post by Kaz Kylheku
Post by Bonita Montero
Post by Chris M. Thomasson
Are you trying to tell me that the aliasing problem on those older Intel
hyperthreaded processors and the workaround (from Intel) was a myth?
lol. ;^)
Intel just made a nerd-suggestion. With four-way associativity
there's no frequent aliasing problem in the L1 data dache of
Pentium 4.
I think the L1 cache was 8K on that thing, and the blocks are 32 bytes.
I think how it works on the P4 is that the address is structured is like
  31          11 10                5 4                                0
  |            | |                 | |                                |
  [ 21 bit tag ] [ 6 bit cache set ] [ 5 bit offset into 32 bit block ]
Thus say we have an area of the stack with the address
range nnnnFF80 to nnnnFFFF (128 bytes, 4 x 32 byte cache blocks).
These four blocks all map to the same set: they have the same six
bits in the "cache set" part of the address.
So if a thread is accessing something in all four blocks, it will
completely use that cache set, all by itself.
If any other thread has a similar block in its stack, with the same
cache set ID, it will cause evictions against this thread.
Sure, if each of these threads confines itself to working with just one
cacheline-sized aperture of the stack, it looks better.
You're forgetting that the sets are very small and that groups of
adjacent four 32 byte blocks map to the same set. Touch four adjacent
cache blocks that are aligned on a 128 byte boundary, and you have
hit full occupancy in the cache set corresponding to that block!
(I suspect the references to 64K should not be kilobytes but sets.
The 8K cache has 64 sets.)
In memory, 128 byte blocks that is aligned maps to, and precisely covers
a cache set. If two such blocks addresses that are equal modulo 8K, they
collide to the same cache set. If one of those blocks is fully present
in the cache, the other must be fully evicted.
It's really easy to see how things can go south under hyperthreading.
If two hyperthreads are working with clashing 128 byte areas that each
want to hog the same cache set, and the core is switching between them
on a fine-grained basis, ...  you get the picture.
It's very easy for the memory mapping allocations used for thread
stacks to produce addresses such tha the delta between them is a
multiple of 8K.
Of course it's easy to intentionally provoke frequent aliasing
with the P4's L1 cache, but actually this doesn't happen often.
Fwiw, some people were complaining about bad performance using
hyperthreading. Turning it off in bios improved performance. Hence the
paper was written to show them how to vastly improve performance when
hyperthreading was turned on. You call it nerd stuff, and I still cannot
figure out why?
Chris M. Thomasson
2024-01-06 21:19:57 UTC
Reply
Permalink
Post by Chris M. Thomasson
Post by Bonita Montero
Post by Kaz Kylheku
Post by Bonita Montero
Post by Chris M. Thomasson
Are you trying to tell me that the aliasing problem on those older Intel
hyperthreaded processors and the workaround (from Intel) was a myth?
lol. ;^)
Intel just made a nerd-suggestion. With four-way associativity
there's no frequent aliasing problem in the L1 data dache of
Pentium 4.
I think the L1 cache was 8K on that thing, and the blocks are 32 bytes.
I think how it works on the P4 is that the address is structured is like
  31          11 10                5 4                                0
  |            | |                 | |                                |
  [ 21 bit tag ] [ 6 bit cache set ] [ 5 bit offset into 32 bit block ]
Thus say we have an area of the stack with the address
range nnnnFF80 to nnnnFFFF (128 bytes, 4 x 32 byte cache blocks).
These four blocks all map to the same set: they have the same six
bits in the "cache set" part of the address.
So if a thread is accessing something in all four blocks, it will
completely use that cache set, all by itself.
If any other thread has a similar block in its stack, with the same
cache set ID, it will cause evictions against this thread.
Sure, if each of these threads confines itself to working with just one
cacheline-sized aperture of the stack, it looks better.
You're forgetting that the sets are very small and that groups of
adjacent four 32 byte blocks map to the same set. Touch four adjacent
cache blocks that are aligned on a 128 byte boundary, and you have
hit full occupancy in the cache set corresponding to that block!
(I suspect the references to 64K should not be kilobytes but sets.
The 8K cache has 64 sets.)
In memory, 128 byte blocks that is aligned maps to, and precisely covers
a cache set. If two such blocks addresses that are equal modulo 8K, they
collide to the same cache set. If one of those blocks is fully present
in the cache, the other must be fully evicted.
It's really easy to see how things can go south under hyperthreading.
If two hyperthreads are working with clashing 128 byte areas that each
want to hog the same cache set, and the core is switching between them
on a fine-grained basis, ...  you get the picture.
It's very easy for the memory mapping allocations used for thread
stacks to produce addresses such tha the delta between them is a
multiple of 8K.
Of course it's easy to intentionally provoke frequent aliasing
with the P4's L1 cache, but actually this doesn't happen often.
Fwiw, some people were complaining about bad performance using
hyperthreading. Turning it off in bios improved performance. Hence the
paper was written to show them how to vastly improve performance when
hyperthreading was turned on. You call it nerd stuff, and I still cannot
figure out why?
Humm... I can see it know. Bonita works for Intel and received the
complaints... Bonita says shut up you stupid nerds! Humm... ;^o
Bonita Montero
2024-01-07 09:14:58 UTC
Reply
Permalink
Post by Chris M. Thomasson
Humm... I can see it know. Bonita works for Intel and received the
complaints... Bonita says shut up you stupid nerds! Humm... ;^o
Intel made a number of bad decisions with the Pentium 4 that the
architecture was dropped and replaced with a descendant of Pentium
3, whose descendants are still in current Intel CPUs today. In this
respect, not only was the architecture bad, but also the documenta-
tion, because it is certainly very rare that moving the stack of a
thread when threads are running synchronously is of any use.
Bonita Montero
2024-01-07 09:10:02 UTC
Reply
Permalink
Post by Chris M. Thomasson
Fwiw, some people were complaining about bad performance using
hyperthreading. Turning it off in bios improved performance.
Hence the paper was written to show them how to vastly improve
performance when hyperthreading was turned on. You call it nerd
stuff, and I still cannot figure out why?
We were talking about mutual cache flushing and you don't kow
that this was the issue with switching of hyperthreading.
Chris M. Thomasson
2024-01-07 20:46:17 UTC
Reply
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Fwiw, some people were complaining about bad performance using
hyperthreading. Turning it off in bios improved performance.
Hence the  paper was written to show them how to vastly improve
performance when  hyperthreading was turned on. You call it nerd
stuff, and I still cannot figure out why?
We were talking about mutual cache flushing and you don't kow
that this was the issue with switching of hyperthreading.
I know that they had a problem and the provided workaround from Intel
really did help out. I find it amusing that you are trying to tell me
that this is all nerd stuff, as if it does not mean anything. Oh well.
Bonita Montero
2024-01-08 05:48:45 UTC
Reply
Permalink
Post by Chris M. Thomasson
I know that they had a problem and the provided workaround from Intel
really did help out. ...
Absolutely not, not with four way associativity.

Chris M. Thomasson
2023-12-31 19:39:09 UTC
Reply
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
Huh? Wow, you really need to write Intel a letter about it wrt their
older hyperthreaded processors! ...
The suggestion Intel made at this point is superfluous.
Post by Chris M. Thomasson
Intel's suggestions for how to mitigate the problem in their earlier
hyperhtreaded processors actually worked wrt improving performance. ...
I'm pretty sure they never ran the numbers on that.
I am pretty sure you never read that paper before, even now.
Chris M. Thomasson
2023-12-29 21:52:57 UTC
Reply
Permalink
Post by Bonita Montero
Post by Chris M. Thomasson
The use of alloca to try to get around the problem in their (Intel's)
early hyperthreaded processors was real, and actually helped. It was
in the Intel docs.
I don't believe it.
I actually found an early version my old code on the wayback machine
that does this. Take note of the following function. It did improve
performance way back on those early hyperthreaded processors.

https://web.archive.org/web/20060214112446/http://appcore.home.comcast.net/appcore/src/ac_thread_c.html
___________________
void* AC_CDECL
prv_thread_entry
( void *state )
{
int ret;
void *uret;
ac_thread_t *_this = state;

ret = pthread_setspecific
( g_tls_key,
_this );
if ( ret ) { assert( ! ret ); abort(); }

if ( _this->id < 64 )
{
AC_OS_ALLOCA( 2048 * _this->id );
uret = _this->fp_entry( (void*)_this->state );
}

else
{
uret = _this->fp_entry( (void*)_this->state );
}

return uret;
}
___________________
Bonita Montero
2023-12-27 05:06:28 UTC
Reply
Permalink
Post by Chris M. Thomasson
So, are you familiar with Intel's early hyper threading problem?
There was false sharing between the ...
False sharing can only happen between different cores.
Loading...