Namespaces
Variants
Views
Actions

Difference between revisions of "cpp/thread/hardware destructive interference size"

From cppreference.com
< cpp‎ | thread
(Example: replaced. The new one demonstrates (something:) the influence of data alignment (relative to L1-cache) on performance.)
m (Example: () -> {})
Line 39: Line 39:
 
===Example===
 
===Example===
 
{{example
 
{{example
|The program uses two threads that (atomically) write to the data-members of given global objects. The first object fits in one cache-line, which results in "hardware interference". The second object keeps its data-members in separate cache lines, so possible "cache synchronization" after thread-writes is avoided.
+
|The program uses two threads that (atomically) write to the data-members of given global objects. The first object fits in one cache-line, which results in "hardware interference". The second object keeps its data-members on separate cache lines, so possible "cache synchronization" after thread-writes is avoided.
 
|code=
 
|code=
 
#include <atomic>
 
#include <atomic>
Line 136: Line 136:
 
     int oneCacheLiner_average{0};
 
     int oneCacheLiner_average{0};
 
     for (auto i{0}; i != max_runs; ++i) {
 
     for (auto i{0}; i != max_runs; ++i) {
         std::thread th1(oneCacheLinerThread<0>);
+
         std::thread th1{oneCacheLinerThread<0>};
         std::thread th2(oneCacheLinerThread<1>);
+
         std::thread th2{oneCacheLinerThread<1>};
 
         th1.join(); th2.join();
 
         th1.join(); th2.join();
 
         oneCacheLiner_average += oneCacheLiner.x + oneCacheLiner.y;
 
         oneCacheLiner_average += oneCacheLiner.x + oneCacheLiner.y;
Line 145: Line 145:
 
     int twoCacheLiner_average{0};
 
     int twoCacheLiner_average{0};
 
     for (auto i{0}; i != max_runs; ++i) {
 
     for (auto i{0}; i != max_runs; ++i) {
         std::thread th1(twoCacheLinerThread<0>);
+
         std::thread th1{twoCacheLinerThread<0>};
         std::thread th2(twoCacheLinerThread<1>);
+
         std::thread th2{twoCacheLinerThread<1>};
 
         th1.join(); th2.join();
 
         th1.join(); th2.join();
 
         twoCacheLiner_average += twoCacheLiner.x + twoCacheLiner.y;
 
         twoCacheLiner_average += twoCacheLiner.x + twoCacheLiner.y;

Revision as of 20:27, 28 March 2021

 
 
Concurrency support library
Threads
(C++11)
(C++20)
hardware_destructive_interference_sizehardware_constructive_interference_size
(C++17)(C++17)
this_thread namespace
(C++11)
(C++11)
(C++11)
Cooperative cancellation
Mutual exclusion
(C++11)
Generic lock management
(C++11)
(C++11)
(C++11)
(C++11)
(C++11)
Condition variables
(C++11)
Semaphores
Latches and Barriers
(C++20)
(C++20)
Futures
(C++11)
(C++11)
(C++11)
(C++11)
Safe Reclamation
(C++26)
Hazard Pointers
Atomic types
(C++11)
(C++20)
Initialization of atomic types
(C++11)(deprecated in C++20)
(C++11)(deprecated in C++20)
Memory ordering
Free functions for atomic operations
Free functions for atomic flags
 
Defined in header <new>
inline constexpr std::size_t
    hardware_destructive_interference_size = /*implementation-defined*/;
(1) (since C++17)
inline constexpr std::size_t
    hardware_constructive_interference_size = /*implementation-defined*/;
(2) (since C++17)
1) Minimum offset between two objects to avoid false sharing. Guaranteed to be at least alignof(std::max_align_t)
struct keep_apart {
  alignas(std::hardware_destructive_interference_size) std::atomic<int> cat;
  alignas(std::hardware_destructive_interference_size) std::atomic<int> dog;
};
2) Maximum size of contiguous memory to promote true sharing. Guaranteed to be at least alignof(std::max_align_t)
struct together {
  std::atomic<int> dog;
  int puppy;
};
struct kennel {
  // Other data members...
  alignas(sizeof(together)) together pack;
  // Other data members...
};
static_assert(sizeof(together) <= std::hardware_constructive_interference_size);

Notes

These constants provide a portable way to access the L1 data cache line size.

Example

The program uses two threads that (atomically) write to the data-members of given global objects. The first object fits in one cache-line, which results in "hardware interference". The second object keeps its data-members on separate cache lines, so possible "cache synchronization" after thread-writes is avoided.

#include <atomic>
#include <chrono>
#include <cstddef>
#include <iomanip>
#include <iostream>
#include <mutex>
#include <new>
#include <thread>
 
#ifdef __cpp_lib_hardware_interference_size
    using std::hardware_constructive_interference_size;
    using std::hardware_destructive_interference_size;
#else
    // 64 bytes on x86-64 │ L1_CACHE_BYTES │ L1_CACHE_SHIFT │ __cacheline_aligned │ ...
    constexpr std::size_t hardware_constructive_interference_size
        = 2 * sizeof(std::max_align_t);
    constexpr std::size_t hardware_destructive_interference_size
        = 2 * sizeof(std::max_align_t);
#endif
 
std::mutex cout_mutex;
 
constexpr int max_write_iterations{10'000'000}; // benchmark time tuning
 
struct alignas(hardware_constructive_interference_size)
OneCacheLiner { // occupies one cache line
    std::atomic_uint64_t x{};
    std::atomic_uint64_t y{};
} oneCacheLiner;
 
struct TwoCacheLiner { // occupies two cache lines
    alignas(hardware_destructive_interference_size) std::atomic_uint64_t x{};
    alignas(hardware_destructive_interference_size) std::atomic_uint64_t y{};
} twoCacheLiner;
 
inline auto now() noexcept { return std::chrono::high_resolution_clock::now(); }
 
template<bool xy>
void oneCacheLinerThread() {
    const auto start { now() };
 
    for (uint64_t count{}; count != max_write_iterations; ++count)
        if constexpr (xy)
             oneCacheLiner.x.fetch_add(1, std::memory_order_relaxed);
        else oneCacheLiner.y.fetch_add(1, std::memory_order_relaxed);
 
    const std::chrono::duration<double, std::milli> elapsed { now() - start };
    std::lock_guard lk{cout_mutex};
    std::cout << "oneCacheLinerThread() spent " << elapsed.count() << " ms\n";
    if constexpr (xy)
         oneCacheLiner.x = elapsed.count();
    else oneCacheLiner.y = elapsed.count();
}
 
template<bool xy>
void twoCacheLinerThread() {
    const auto start { now() };
 
    for (uint64_t count{}; count != max_write_iterations; ++count)
        if constexpr (xy)
             twoCacheLiner.x.fetch_add(1, std::memory_order_relaxed);
        else twoCacheLiner.y.fetch_add(1, std::memory_order_relaxed);
 
    const std::chrono::duration<double, std::milli> elapsed { now() - start };
    std::lock_guard lk{cout_mutex};
    std::cout << "twoCacheLinerThread() spent " << elapsed.count() << " ms\n";
    if constexpr (xy)
         twoCacheLiner.x = elapsed.count();
    else twoCacheLiner.y = elapsed.count();
}
 
int main() {
    std::cout << "__cpp_lib_hardware_interference_size "
#   ifdef __cpp_lib_hardware_interference_size
        " = " << __cpp_lib_hardware_interference_size << "\n";
#   else
        "is not defined\n";
#   endif
 
    std::cout
        << "hardware_destructive_interference_size == "
        << hardware_destructive_interference_size << '\n'
        << "hardware_constructive_interference_size == "
        << hardware_constructive_interference_size << '\n'
        << "sizeof( std::max_align_t ) == " << sizeof(std::max_align_t) << "\n\n";
 
    std::cout
        << std::fixed << std::setprecision(2)
        << "sizeof( OneCacheLiner ) == " << sizeof( OneCacheLiner ) << '\n'
        << "sizeof( TwoCacheLiner ) == " << sizeof( TwoCacheLiner ) << "\n\n";
 
    constexpr int max_runs{4};
 
    int oneCacheLiner_average{0};
    for (auto i{0}; i != max_runs; ++i) {
        std::thread th1{oneCacheLinerThread<0>};
        std::thread th2{oneCacheLinerThread<1>};
        th1.join(); th2.join();
        oneCacheLiner_average += oneCacheLiner.x + oneCacheLiner.y;
    }
    std::cout << "Average time: " << (oneCacheLiner_average / max_runs / 2) << " ms\n\n";
 
    int twoCacheLiner_average{0};
    for (auto i{0}; i != max_runs; ++i) {
        std::thread th1{twoCacheLinerThread<0>};
        std::thread th2{twoCacheLinerThread<1>};
        th1.join(); th2.join();
        twoCacheLiner_average += twoCacheLiner.x + twoCacheLiner.y;
    }
    std::cout << "Average time: " << (twoCacheLiner_average / max_runs / 2) << " ms\n\n";
}

Possible output:

__cpp_lib_hardware_interference_size is not defined
hardware_destructive_interference_size == 64
hardware_constructive_interference_size == 64
sizeof( std::max_align_t ) == 32
 
sizeof( OneCacheLiner ) == 64
sizeof( TwoCacheLiner ) == 128
 
oneCacheLinerThread() spent 275.23 ms
oneCacheLinerThread() spent 330.37 ms
oneCacheLinerThread() spent 320.65 ms
oneCacheLinerThread() spent 389.14 ms
oneCacheLinerThread() spent 388.48 ms
oneCacheLinerThread() spent 448.34 ms
oneCacheLinerThread() spent 420.10 ms
oneCacheLinerThread() spent 459.01 ms
Average time: 378 ms
 
twoCacheLinerThread() spent 123.79 ms
twoCacheLinerThread() spent 130.48 ms
twoCacheLinerThread() spent 119.03 ms
twoCacheLinerThread() spent 132.32 ms
twoCacheLinerThread() spent 116.26 ms
twoCacheLinerThread() spent 122.64 ms
twoCacheLinerThread() spent 116.42 ms
twoCacheLinerThread() spent 128.11 ms
Average time: 123 ms

See also

returns the number of concurrent threads supported by the implementation
(public static member function of std::thread) [edit]
returns the number of concurrent threads supported by the implementation
(public static member function of std::jthread) [edit]