Optimising matrix multiplication by exploiting cache lines
In order to keep hold of some sort of sanity, I had better start working on something interesting. Here is solvant !
When working with matrices we usually translate directly the algorithm for matrix-matrix multiplication. And it works. Most of the time, we just don't care. So long as we get the result. But, if it's one thing I learned from working with AVX (see here) is that if we can work on data that is contiguous in memory the compiler will automatically apply vectorised operations or SIMD operations.
In series notation, the matrix multiplication $\mathbf{C} = \mathbf{A}*\mathbf{B}$ looks like
$$C_{i,j} = \sum_{k=0}^K{A_{i,k}B_{k,j}}.$$
Translating this directly into code gives us
We can look at the inner most loop and note that we are doing some unnecessary multiplication and attempt to do some optimisation. We get something like this
This is the code we are going to use to test our optimisations
When we look at the numbers at the end of this post, we will see that our first optimisation does offer some improvement, but when AVX is enabled we don't really see much improvement at all. The reason for this is that the compiler can not figure out a way to vectorise this code : for every iteration of the inner loop we are requesting a piece of data that is outside of a cache line. Specifically, when we increment k we are requesting data (the next row of b)that is 8000 bytes away. Which means loading a new cache line, possibly shunting out values of the matrix a that we want for the next calculation.
Now all of the data iterated over in the inner most loop is contiguous and hopefully the compiler will see this and apply SIMD instructions to speed up our code. If we look at our table of results we can see that this is exactly the case.
No compiler optimisation, no AVX
matrix_prod 7.23437
matrix_prod_2 3.50215
matrix_prod_3 2.24789
Optimisation and AVX
matrix_prod 1.50129
matrix_prod_2 1.43557
matrix_prod_3 0.287679
As we can see, without any optimisations replacing the unnecessary multiplications with additions almost doubles the speed of our code. However, from switching on some optimisations, it seems as though the compiler does this for us. By switching the order of the loops and partially calculating the resultant entry in the inner loop we have shaved off a further second, even without optimisations turned on. I seem to think that this is purely because the data is in the same cache line.
Now for the magic! By forcing the inner loop to work on contiguous data the compiler (with AVX on) has vectorised the inner loop giving us almost 10x performance!
Being naive
When working with matrices we usually translate directly the algorithm for matrix-matrix multiplication. And it works. Most of the time, we just don't care. So long as we get the result. But, if it's one thing I learned from working with AVX (see here) is that if we can work on data that is contiguous in memory the compiler will automatically apply vectorised operations or SIMD operations.
In series notation, the matrix multiplication $\mathbf{C} = \mathbf{A}*\mathbf{B}$ looks like
$$C_{i,j} = \sum_{k=0}^K{A_{i,k}B_{k,j}}.$$
Translating this directly into code gives us
template <typename T, std::size_t R, std::size_t K, std::size_t C> inline void matrix_prod(const matrix<T, R, K>& a, const matrix<T, K, C>& b, matrix<T, R, C>& c) { for (std::size_t i = 0; i < R; i++) { const auto ai = a[i]; // obtain row i of a auto ci = c[i]; // obtain row i of c for (std::size_t j = 0; j < C; j++) { T r = 0; for (size_t k = 0; k < K; k++) { r += ai[k] * b[k][j]; } ci[j] = r; } } }
We can look at the inner most loop and note that we are doing some unnecessary multiplication and attempt to do some optimisation. We get something like this
template <typename T, std::size_t R, std::size_t K, std::size_t C> inline void matrix_prod_2(const matrix<T, R, K>& a, const matrix<T, K, C>& b, matrix<T, R, C>& c) { for (std::size_t i = 0; i < R; i++) { const auto ai = a[i]; // obtain row i of a auto ci = c[i]; // obtain row i of c for (std::size_t j = 0; j < C; j++) { T r = 0; const double* bkj = &b[0][j]; for (size_t k = 0; k < K; k++) { r += ai[k] * *bkj; bkj += R; } ci[j] = r; } } }
#include <iostream> #include "solvant/base/matrix.hpp" #include "solvant/utils/timer.hpp" int main(int argc, const char* argv[]) { solvant::utils::timer t; solvant::base::matrix<double, 1000, 1000> A; solvant::base::matrix<double, 1000, 1000> B; solvant::base::matrix<double, 1000, 1000> C; for (int i = 0; i < 1000; i++) { for (int j = 0; j < 1000; j++) { A(i, j) = 0.1*i; B(i, j) = 0.1*j; } } auto begin = t.elapsed(); solvant::base::matrix_prod(A, B, C); auto end = t.elapsed(); std::cout << end - begin << std::endl; }
The next step
Where do we go from here? Well, above we use the inner most loop to calculate a single entry of $\mathbf{C}$. Suppose instead we update the first entire row of $\mathbf{C}$. That we have the followingtemplate <typename T, std::size_t R, std::size_t K, std::size_t C> inline void matrix_prod_3(const matrix<T, R, K>& a, const matrix<T, K, C>& b, matrix<T, R, C>& c) { for (std::size_t i = 0; i < R; i++) { const auto ai = a[i]; // obtain row i of a auto ci = c[i]; // obtain row i of c for (std::size_t k = 0; k < K; k++) { const auto bk = b[k]; // obtain row k of b const auto aik = ai[k]; // obtain raw data - cir. mul. indices for (std::size_t j = 0; j < C; j++) { ci[j] += aik * bk[j]; } } } }
No compiler optimisation, no AVX
matrix_prod 7.23437
matrix_prod_2 3.50215
matrix_prod_3 2.24789
Optimisation and AVX
matrix_prod 1.50129
matrix_prod_2 1.43557
matrix_prod_3 0.287679
As we can see, without any optimisations replacing the unnecessary multiplications with additions almost doubles the speed of our code. However, from switching on some optimisations, it seems as though the compiler does this for us. By switching the order of the loops and partially calculating the resultant entry in the inner loop we have shaved off a further second, even without optimisations turned on. I seem to think that this is purely because the data is in the same cache line.
Now for the magic! By forcing the inner loop to work on contiguous data the compiler (with AVX on) has vectorised the inner loop giving us almost 10x performance!
Comments
Post a Comment