Parallelizing C++ using Execution Policies

C++’s standard library provides a large collection of generic algorithms that can be used in many different situations. Some of the examples are sort, foreach, binary_search, min, max and accumulate.

The usage of these algorithms used to be pretty cumbersome because most of the algorithms required a callable object which meant you had to create a new class just to use the algorithm. However, after Lambda support was added in C++11, it became a lot easier to use the algorithms.

Since these algorithms are pretty generic, the library implementors can’t assume how the algorithm is being invoked. For example, the library implementors can’t assume that it is safe to parallelize the algorithm since race conditions may be possible. Similarly, the library implements can’t assume that it is safe to vectorize in the same thread.

In C++17, the algorithm interfaces where enhanced to support an optional execution policy. An execution policy tells the library what assumptions it can make so that it can make certain optimizations.

In particular, there are 4 execution policies defined in the standard:

std::execution::seq: This is the default behavior when no policy is provided. It declares that no additional assumptions should be made so the algorithm should not be parallelized or vectorized by the library.
std::execution::par: This policy declares that this invocation of the algorithm is safe from race conditions and deadlocks. Therefore the library is free to parallelize the algorithm (but it is not free to vectorize within the same thread).
std::execution::unseq (added in C++20): This policy declares that the invocation of this algorithm can be vectorized. This means that the body of the function can be run for multiple values at the same time in the same thread.
std::execution::par_unseq: This policy combines the guarantees provided in par and unseq and should provide the highest level of optimization.
Additional implementation-defined policies are allowed by the standard.

It is important to note that the standard does not require that an implementation do anything with these flags. An implementation can simply ignore them and still be compliant. However, at this point GCC, Clang and Microsoft all provide parallelization for at least some algorithms.

How To Use

Recently I was building a basic ray tracer by following the popular guide “Ray Tracing in One Weekend”. As part of that, I wrote a function that had to repeat an expensive calculation for each pixel.

The code would have looked something like this:

std::vector<Color> pixels(num_pixels);
for (size_t i = 0; i < num_pixels; ++i) {
    pixels[i] = expensive_calculation(i);
}

In my situation the function accepted arguments but it was a const function with no side-effects. This was a good target to be parallelized so I chose to try to use the execution policies so I wouldn’t have to manually manage threads.

Instead of writing the code above, I first wrote the code using the transform algorithm.

std::vector<size_t> indices(num_pixels);
 // initialize indices with 0, 1, ..
std::iota(indices.begin(), indices.end(), 0); // needs <numeric>

std::transform( // needs <algorithm>
    indices.begin(), indices.end(), pixels.begin(), 
    [](size_t index){
        return expensive_calculation(index);
    }
);

Once this was working, I was able to add an execution policy to make it run in parallel.

std::vector<size_t> indices(num_pixels);
 // initialize indices with 0, 1, ..
std::iota(indices.begin(), indices.end(), 0); // needs <numeric>

std::transform( // needs <algorithm>
    std::execution::par, // <-- needs <execution>
    indices.begin(), indices.end(), pixels.begin(), 
    [](size_t index){
        return expensive_calculation(index);
    }
);

Results

You can find my actual code on GitHub. Note that Microsoft behaves the same whether you use std::execution::par or std::execution::par_unseq as documented here.

Before adding the parallel execution policy my code was taking 3.5 seconds. After adding the parallel execution policy, the runtime was down to 0.7s (700ms). 5x performance improvement is very respectable considering how little effort I had to put in to parallelize it. Additionally, using an execution policy keeps the code much simpler than managing your own threads so the long term maintainability of the codebase remains pretty high as well.

Overall I think its a great tool to have for situations where parallelization is pretty safe.

How To Use#

Results#

How To Use

Results