Scanning memory forward vs. reverse

Discussion:

(too old to reply)

Bonita Montero

2024-01-28 10:19:18 UTC

With my thread pool there's some code that looks up a deque in reverse
order with find_if with reverse iterators to check if there's a "this"
pointer inside the deque. I also could have scanned the deque in forward
order but it's more likely to find a fitting element wenn searching from
the back first.
A deque usually consists of a number of linear parts in memory. This
lead me to the question if scanning memory is faster forward or back-
ward. I tried to test this with the below program:

#include <iostream>
#include <vector>
#include <atomic>
#include <chrono>

using namespace std;
using namespace chrono;

atomic_char aSum;

int main()
{
constexpr size_t GB = 1ull << 30;
vector<char> vc( GB );
auto sum = []( auto begin, auto end, ptrdiff_t step )
{
auto start = high_resolution_clock::now();
char sum = 0;
for( auto p = begin; end - p >= step; sum += *p, p += step );
::aSum.store( sum, memory_order_relaxed );
cout << duration_cast<nanoseconds>( high_resolution_clock::now() -
start ).count() / 1.0e6 << "ms" << endl;
};
constexpr size_t STEP = 100;
sum( vc.begin(), vc.end(), STEP );
sum( vc.rbegin(), vc.rend(), STEP );
}

On my Windows 7050X Zen4 computer scanning memory in both directions
has the same speed. On my Linux 3990X Zen2 computer scanning forward
is 22% faster. On my small Linux PC, a HP EliteDesk Mini PC with a
Skylake Pentium G4400 scanning memory forward is about 38% faster.
I'd first have guessed that the prefetchers between the memory-levels
are as effective for both directions. So I'd like to see some results
from you.

Marcel Mueller

2024-01-28 10:32:56 UTC

Permalink

Post by Bonita Montero
A deque usually consists of a number of linear parts in memory. This
lead me to the question if scanning memory is faster forward or back-
On my Windows 7050X Zen4 computer scanning memory in both directions
has the same speed. On my Linux 3990X Zen2 computer scanning forward
is 22% faster. On my small Linux PC, a HP EliteDesk Mini PC with a
Skylake Pentium G4400 scanning memory forward is about 38% faster.
I'd first have guessed that the prefetchers between the memory-levels
are as effective for both directions. So I'd like to see some results
from you.

Reverse memory access is typically slower simply because the last data
of a cache line (after a cache miss) arrives at last. If you read
forward the process continues when the first few bytes of the cache line
are read. The further data is read in parallel.

But details depend on many other factors. First of all the placement of
the memory chunks and the used prefetching technique (if any).

Marcel

Bonita Montero

2024-01-28 13:07:58 UTC

Permalink

Post by Marcel Mueller
Reverse memory access is typically slower simply because the
last data of a cache line (after a cache miss) arrives at last.

I tested this and for all offsets within a cacheline I get thes
same timing for all three of my computers:

#include <iostream>
#include <vector>
#include <chrono>
#include <atomic>

using namespace std;
using namespace chrono;

#if defined(__cpp_lib_hardware_interference_size)
constexpr size_t CL_SIZE = hardware_constructive_interference_size;
#else
constexpr size_t CL_SIZE = 64;
#endif

atomic_char aSum;

int main()
{
constexpr size_t
BLOCK_SIZE = 1ull << 20,
BLOCKS = BLOCK_SIZE / CL_SIZE,
ROUNDS = 1000;
vector<char> block( BLOCK_SIZE );
for( size_t offset = 0; offset != CL_SIZE; ++offset )
{
auto start = high_resolution_clock::now();
char sum = 0;
for( size_t round = ROUNDS; round--; )
for( size_t i = offset; i < BLOCK_SIZE; sum += block[i], i += CL_SIZE );
::aSum.store( sum, memory_order_relaxed );
cout << offset << ": " <<
duration_cast<nanoseconds>(high_resolution_clock::now() - start).count()
/ ((double)BLOCKS * ROUNDS) << endl;
}
}

Chris M. Thomasson

2024-01-28 19:18:47 UTC

Permalink

Post by Bonita Montero

Post by Marcel Mueller
Reverse memory access is typically slower simply because the
last data of a cache line (after a cache miss) arrives at last.

I tested this and for all offsets within a cacheline I get thes
#include <iostream>
#include <vector>
#include <chrono>
#include <atomic>
using namespace std;
using namespace chrono;
#if defined(__cpp_lib_hardware_interference_size)
constexpr size_t CL_SIZE = hardware_constructive_interference_size;
#else
constexpr size_t CL_SIZE = 64;
#endif
atomic_char aSum;
int main()
{
    constexpr size_t
        BLOCK_SIZE = 1ull << 20,
        BLOCKS = BLOCK_SIZE / CL_SIZE,
        ROUNDS = 1000;
    vector<char> block( BLOCK_SIZE );

Try padding and aligning the blocks. iirc, std::vector works with
alignas. Actually, it's pretty nice.

Post by Bonita Montero
    for( size_t offset = 0; offset != CL_SIZE; ++offset )
    {
        auto start = high_resolution_clock::now();
        char sum = 0;
        for( size_t round = ROUNDS; round--; )
            for( size_t i = offset; i < BLOCK_SIZE; sum += block[i], i
+= CL_SIZE );
        ::aSum.store( sum, memory_order_relaxed );
        cout << offset << ": " <<
duration_cast<nanoseconds>(high_resolution_clock::now() - start).count()
/ ((double)BLOCKS * ROUNDS) << endl;
    }
}

Bonita Montero

2024-01-29 08:56:43 UTC

Permalink

Post by Chris M. Thomasson
Try padding and aligning the blocks. iirc, std::vector works with
alignas. Actually, it's pretty nice.

I'm testing all 64 offsets. If offset zero becomes physically offset
one in the cacheline doesn't matter since physical offset zero would
then be occupied by logical offset 63.

Chris M. Thomasson

2024-01-29 22:12:59 UTC

Permalink

Post by Bonita Montero

Post by Chris M. Thomasson
Try padding and aligning the blocks. iirc, std::vector works with
alignas. Actually, it's pretty nice.

I'm testing all 64 offsets. If offset zero becomes physically offset
one in the cacheline doesn't matter since physical offset zero would
then be occupied by logical offset 63.

You don't want to straddle any cache lines. Properly aligning and
padding the blocks gets around that...

Bonita Montero

2024-01-30 10:47:48 UTC

Permalink

Post by Bonita Montero

Post by Chris M. Thomasson
Try padding and aligning the blocks. iirc, std::vector works with
alignas. Actually, it's pretty nice.

I'm testing all 64 offsets. If offset zero becomes physically offset
one in the cacheline doesn't matter since physical offset zero would
then be occupied by logical offset 63.

You don't want to straddle any cache lines. ...

I'm testing all 64 offsets and for my measurement it doesn't matter if
the beginning of the block is at offset zero inside a cacheline since
the result show equal access times for all offsets.
If there were different results it might have made sense to have proper
alignment.

Vir Campestris

2024-01-31 15:56:01 UTC

Permalink

Post by Bonita Montero
On my Windows 7050X Zen4 computer scanning memory in both directions
has the same speed. On my Linux 3990X Zen2 computer scanning forward
is 22% faster. On my small Linux PC, a HP EliteDesk Mini PC with a
Skylake Pentium G4400 scanning memory forward is about 38% faster.
I'd first have guessed that the prefetchers between the memory-levels
are as effective for both directions. So I'd like to see some results
from you.

On my Linux box with an AMD Ryzen 5 3400G it's about 11% slower for the
second number. But that's a very about, it's doing something else right
now and that's the average from several runs - where the ratio is
between 97% and 130%.

Andy