Going lock-free with atomic variables - Cushion for the Pushing
Recap of locking data structures
When working with more than one thread care needs to be taken to ensure that we eliminate problem like data races, which sometimes lead to undefined behavior, and dead locks, which will stall the application altogether. These problems are related. We use locks to try and prevent data races, but when using locks we might introduce a dead lock situation (see this). Our favorite C++ containers (std::deque, std::stack, etc ) can be wrapped (coarse grained locking) to obtain thread safe variants. Although this does not harness a lot of the available concurrency, it is usually sufficient for the job. If we want to extract more concurrency out of the data structure we can use fine grained locking. To illustrate the difference, I will give two code examples. One where we implement a push on a thread-safe stack using coarse grained locking and another using coarse grained locking.template<typename T> class coarse_stack { private: std::deque<T> s; mutable std::mutex m; public: coarse_stack() { s.resize(0); } void push(T item) { std::lock_guard<std::mutex> lk(m); s.push_back(std::move(item)); } }; template<typename T> class fine_stack { private: struct node { T data; node* next; node(T const& data_) : data(data_), next(nullptr){} }; std::unique_ptr<node> head; std::mutex m; public: void push(T item) { std::unique_ptr<node> new_node = std::make_unique<node>(std::move(item)); std::lock_guard<std::mutex> lk(m); new_node->next = head.get(); head = std::move(new_node); } };
In the first stack we use the standard library's std::deque instead of std::stack. This is because std::stack is an adapter that only exposes the necessary methods for working with a stack, which probably includes pre-allocating a linked list of some size which would give it an unfair advantage over our corresponding vanilla implementation in the fine grained variant (any further increases in size involve stitching together linked lists). If we are to compare performance based on granularity only then we need to remove this advantage. However, if you were to implement this professionally we would take full advantage of the standard library and use std::stack, in the obvious way (no resize to 0).
The main difference between the coarse grained and fine grained implementations is that we acquire the lock after we have reserved memory. Since this is an expensive operation and we are now able to execute it concurrently, we should obtain a massive boost in performance. We use std::move to shift the ownership of the memory around as oppose to deep copying anything.
// thread_count 1 4 7
// coarse 1.50 0.61 0.63
// fine 1.00 0.48 0.47
Looking at some preliminary tests (see here) above we can see that there is an immediate advantage using a fine grained locking scheme.
Finally going lock free
Understanding how we do away with locks requires some knowledge of atomic variables and the operations that we ask of them. On the most basic level we can say that if we make a change to an atomic variable on one thread, other threads can "see" the change. Basically, the synchronisation is achieved at a lower level (we will talk about this later). The memory orderings we can impose on atomic variables make them very flexible, but for the meantime we wont be fiddling around with that. Let us have a look at the equivalent push operation in a lock-free implementation of a stack
What is going on? At 1) we load the current value of head into new_node->next. Looking at the documentation we can see that we can be assured that we do not encounter any races using an atomic load operation on an atomic pointer. At 2) we compare the current value of head with the value of new_node->next. If they are not equal then the head could have been modified by a different thread. In the head differs from new_node->next, then we the value supplied as the first parameter is updated to the current value of head. and 3) we try again and again.... Unless head is equal to new_node->next. In this case we set head to equal new_node and break 4).
Here are some numbers:
// thread_count 1 4 7
// coarse 1.50 0.61 0.63
// fine 1.00 0.48 0.47
// lock-free 1.40 0.63 0.55
Fine-grained locking still seems to be the winner, but as we increase the thread-count above 4 threads the performance gains start to decrease. I am guessing that if we were to carry out the tests on processors with 16+ threads we would see the locked data-structures struggle. Both locked data structures seem to show no real performance increase when distributing the set task among 4+ threads. The fine-grained lock does seem to show improvements for 4+ threads but even that seems to be dwindling. Why is this?
"If they are not equal then the head could have been modified by a different thread."
Okay, so how was the current thread made aware of this change (made by this different thread)? The secret lies in the memory ordering we spoke about. By default our load and compare_exchange_weak has memory order memory_order_seq_cst. That is, sequentially consistent:
All operations using this memory order are ordered to happen once all accesses to memory that may have visible side effects on the other threads involved have already happened.
This is the strictest memory order, guaranteeing the least unexpected side effects between thread interactions though the non-atomic memory accesses.
This means that there is some communication going on between threads and to some degree, some blocking in some sense.
// thread_count 1 4 7
// coarse 3.79 2.09 2.09
// fine 3.66 1.82 2.10
// lock-free 3.81 2.03 1.79
This confirms my suspicions : The granularity of the tasks were too large. By pushing a larger amount of smaller "packets" to the stack we can see that the lock-free data-structure does seem to shine for larger thread-counts. This test case gives the threads a greater opportunity to get locked by a mutex and so we can see the improvement clearly.
So that is it for pushing. Next time we will be pulling.
template<typename T> class lock_free_stack { private: struct node { T data; node* next; node(T const& data_):data(data_), next(nullptr){} }; std::atomic<node*> head; public: void push(T const& data) { node* const new_node=new node(data); new_node->next=head.load(); // 1 while(true) { if(!head.compare_exchange_weak(new_node->next,new_node)) { // 2 continue; // 3 } else { break; // 4 } } } };
What is going on? At 1) we load the current value of head into new_node->next. Looking at the documentation we can see that we can be assured that we do not encounter any races using an atomic load operation on an atomic pointer. At 2) we compare the current value of head with the value of new_node->next. If they are not equal then the head could have been modified by a different thread. In the head differs from new_node->next, then we the value supplied as the first parameter is updated to the current value of head. and 3) we try again and again.... Unless head is equal to new_node->next. In this case we set head to equal new_node and break 4).
Here are some numbers:
// thread_count 1 4 7
// coarse 1.50 0.61 0.63
// fine 1.00 0.48 0.47
// lock-free 1.40 0.63 0.55
Fine-grained locking still seems to be the winner, but as we increase the thread-count above 4 threads the performance gains start to decrease. I am guessing that if we were to carry out the tests on processors with 16+ threads we would see the locked data-structures struggle. Both locked data structures seem to show no real performance increase when distributing the set task among 4+ threads. The fine-grained lock does seem to show improvements for 4+ threads but even that seems to be dwindling. Why is this?
A quick analysis
First, we are not actually carrying out any blocking. If our while loop executes a couple of spins this would very likely be due to multiple threads attempting to push. So are we getting true concurrency? No. If we look at my explanation above there is something quite subtle"If they are not equal then the head could have been modified by a different thread."
Okay, so how was the current thread made aware of this change (made by this different thread)? The secret lies in the memory ordering we spoke about. By default our load and compare_exchange_weak has memory order memory_order_seq_cst. That is, sequentially consistent:
All operations using this memory order are ordered to happen once all accesses to memory that may have visible side effects on the other threads involved have already happened.
This is the strictest memory order, guaranteeing the least unexpected side effects between thread interactions though the non-atomic memory accesses.
This means that there is some communication going on between threads and to some degree, some blocking in some sense.
A proper analysis
To do a proper analysis we would have to look at how long the threads are being blocked for in each of our cases. The poor performance of the lock-free push could be due to the granularity of our test task being too large. That is, currently we are pushing 210000/num_threads "1000-vectors" on to the stack on each thread. Maybe we would see some performance gain by trying 2100000/num_threads "100-vectors" on the stack?// thread_count 1 4 7
// coarse 3.79 2.09 2.09
// fine 3.66 1.82 2.10
// lock-free 3.81 2.03 1.79
This confirms my suspicions : The granularity of the tasks were too large. By pushing a larger amount of smaller "packets" to the stack we can see that the lock-free data-structure does seem to shine for larger thread-counts. This test case gives the threads a greater opportunity to get locked by a mutex and so we can see the improvement clearly.
So that is it for pushing. Next time we will be pulling.
Comments
Post a Comment