Using the AVX instruction set to boost performance

The AVX instruction set uses CPU registers to perform SIMD (Single Instruction Multiple Data) operations, which can be used to significantly speed up mathematical operations.

I recently decided to create a digital filter in C++, and starting with the basics I came across a Finite Impulse Response (FIR) filter. Mathematically a FIR filter outputs a weighted sum of the most recent input values of the signal. The mathematics of a FIR filter are outlined here.

In practice, we typically work on sections of a signal. This is quite simply because we can not hold the entire signal in RAM. Consider the following signal which has been split into three chunks each of size 6,
$$ [012123][123332][123342] $$
and the following filter $b = [1,2]$.

In the routines discussed in this blog the argument $data$ is a chunk of signal data and not the entire signal and so for the current example $SIZE=6$, which is the length of $data$. In a similar fashion $FILTERSIZE = 2$.

Following the formula given here we would calculate $$y[1] = b[0]x[1] + b[1]x[0].$$ However, we also need to calculate $$y[0] = b[0]x[0] + b[1]x[-1].$$ This highlights a problem (the fact that we need $x[-1]$) that we solve by setting $x[-1] = 0$. Okay, lets say that we have dealt with the first chunk then we need $$y[6] = b[0]x[6] + b[1]x[5].$$ That is, we need to cache the last entry from the first chunk of the signal. With a little bit of thought it is easy to see how this can be generalised to arbitrary length signals and filters.

The above formulation applies to real input and real filter coefficients. In the applications that I have encountered we have to read in the data from a binary file which contains complex pairs of numbers, even though the filter is still real valued. This is the reason behind the strange "jumps by 2". Of course, we can first convert to real and imaginary components then write back, which will be necessary in the AVX version but not here.

All the code in this blog will be available in this repo (along with plots and plotting code). For reference I am using a machine with an Intel i7-7500U and both -Ofast and -mavx flags are active. For the test I just execute the filter 1000 times on some data to iron out any particularly good or bad runs.

The initial C++ version is given below,

Looking at some initial benchmarks for our "vanilla" version we obtain the following plot

As we can see, the meat of the filter is this $O(N*FILTERSIZE)$ loop in lines 18-27. So this is where our optimisation journey begins. We want to use AVX to vectorise calculations. One way in which we can do this is to convert the above "reverse dot product" thing to a "dot product" thing. This is achieved by applying std::reverse on our filter. Since we do not need to modify our filter coefficients once the filter is generated, we can do this for "free" (in the grand scheme of things, at least). This is the code we end up with:

Running some benchmarks we obtain

It seems as though the compiler likes this. All across the board it seems that the code runs faster. I am guessing the compiler "sees" that we are trying to do a "dot product" thing and generates instructions to speed up this process. Since this is an iterative process I am not going to delve into to generated machine code (Although I feel like I might in the future). Once thing that will make life easier for us during the vectorisation process is to split into real and imaginary parts, so let us do that

How does this impact performance? To see a plot of the results check here. But if you look at the raw data (some of which is provided at the end of this post) you can see this change increases performance by 2x in most cases! In the form we have written it now the compiler is more aware that we are trying to compute a "dot product like" thing, or a "multiply and add" thing.

Now we need to think about vectorising the dot product ourselves. We all know how to calculate the dot product, so I will not discuss that. Instead we need to talk about calculating 8 dot products at once! We create a mask of size 8 containing the first value of the filter coefficients and then multiply 8 data elements by that same value all at once and store it in a result register.

[x x x x x x x x] x x x (1)
x [x x x x x x x x] x x (2)
x x [x x x x x x x x] x (3)
...

We then load the second value of the filter coefficients into a mask and repeat the process on the next 8 elements and then add it to the result register. After we have done this for each of the filter coefficients, the first element in the result register will be the dot product given in (1), the second element in the results register will be the dot product given in (2), etc. This is implemented in code as follows

Lets us see whether or not we have gained anything from hardcoding AVX intrinsics by looking at some plots

As we can see, our AVX implementation dominates for larger values of $SIZE$. Looking at a particlar case where $SIZE = 80000, FILTERSIZE=512$ the avx version takes 3.3 seconds and the vanilla version takes 23 seconds. This is quite the boost in performance. Although this by itself is not enough to conclude that we have beaten the compiler as we had manipulated the code a fair bit before hardcoding AVX. Looking at a partial dump of the raw data (an entire up-to-data data dump is given here)

where the columns are ordered [$FILTERSIZE$, $SIZE$, avx, real-imag, rev-filter, vanilla]. The biggest saving, by far, is going from vanilla to rev-filter. In some cases we get a 3x speedup and across the board we get a 2x speed up. Let's now consider the transition between real-imag and avx. For $FILTERSIZE = 32,64$ we get approx. 2x speed up, then for larger filter sizes our gains are in the range 1.3x to almost 2x. While this might come across as a little disappointing, we are still gaining 20 seconds in some cases! bearing in mind that the compiler is already vectorising code before we hardcoded it.

So we have definitely beaten the compiler. This begs the question, can we go further? The answer is yes. First off we can use AVX512. This is not a possibility for me because I do not have a machine that is AVX512 capable. However, I do have a machine that is capable of Fused Add Multiply (FMA). All I have to do to enable this feature is add the -mfma flag to the appropriate line in my CMakeLists.txt file. We can also hardcode the fma intrinsic by making the following change:

We get the following results ( again, the raw data is available on GitHub):

These results are absolutely fantastic. The rev-filter variant of the FIR filter is a faster than the hardcoded avx variant with FMA enabled. With our additional flag we get a serious speed up on the hardcoded avx. And then with some extra changes in our kernel (change from EXECUTE to EXECUTE2) we get further speed ups again.

As a final note I should say that using AVX intrinsics is not always a "add a flag and fly" sort of solution. It can definitely provide speedups when done this way but, as we have seen, strong arming the compiler by refactoring can provide even greater speedups. And then hardcoding can get you even further. On looking at the raw benchmark results we can see that we have achieved an almost 8x speed up in going from our original solution to the final FMA/AVX solution in many cases (both with -mavx enabled). This is quite astounding considering we have not even resorted to threads yet.

Search This Blog

Down and out in Swansea

Using the AVX instruction set to boost performance

Comments

Post a Comment

Popular posts from this blog

The magic behind deep learning : Obtaining gradients using reverse Automatic Differentiation.

Information Entropy

Optimising matrix multiplication by exploiting cache lines

Numerical Solution of the Black Scholes equation for European Options

std::equal_range