The magic behind deep learning : Obtaining gradients using reverse Automatic Differentiation.

Every now and again a wave of rage washes over me when I hear someone (be it a news outlet, etc) refer to deep learning, and the like, as magic. Although, I must admit that I was a bit vague on how you can generate an error function given a network of nodes then minimise it without using numerical differentiation or providing the actual derivative (Although there are (totally) derivative-free minimisation methods such as Simplex etc.). The process of writing software for generating this error function (and a neural network) is not of my concern here, although it might be in a future blog post. What is of my concern is obtaining the gradient of this error function.

Automatic Differentiation

First, lets discuss the process of obtaining the derivative of a expression that we already have the computation graph for. You will see that obtaining the computation graph is half the battle and many of the resources on the internet explaining Automatic Differentiation (AD) tend to brush this under the carpet. Lets say we have $f(x,y) = (x+y) + x(x+y)$:

Note that

$z_5 = f = z_3 + z_4 $
$z_3 = z_2 + z_1$
$z_4 = z_1 * z_3$
$z_1 = x$
$z_2 = y$

Each node in the graph has some cumulative gradient $\tilde{x}$ associated with it. At the end of the process the $x$ and $y$ node will have $f_x$, $f_y$ stored at $\tilde{x}$, where $f_x$ means the partial derivative of $f$ with respect to $x$.
The process of obtaining gradients is as follows:

We start at the top node and set its cumulative gradient to 1. (All other nodes have theirs set to 0 at initialisation time).
For each node we calculate $f_{z_j} = \sum_{i \text{ parent of } j} f_{z_i}{z_i}_{z_j} $

Step 2 in the above algorithm clearly comes from the chain rule. This is clear from the fact that we have broken down the expression into nodes corresponding to operations on the variables. So how do we transfer step 2 into code? Note that all terms of the form ${z_i}_{z_j}$ are going to end up in evaluations, so we can do a forward pass and cache all the values of the nodes in the graph. For example consider

$z_i = z_j + z_k$ then ${z_i}_{z_j} = 1$
$z_i = z_j z_k$ then ${z_i}_{z_j} = z_k$ (a.k.a the other child)
$z_i = z_j / z_k$ then ${z_i}_{z_j} = 1/z_k$
... etc

Then to obtain the gradients we can "fill" out the sum $f_{z_j}$ for node $j$ in a reverse manner. That is we start at the top of the graph (the head, the node with no parents, etc) and then build the gradient up cumulatively. When we start the algorithm we have to pass a state, which contains the values of the variables in the expression. The first part of the algorithm just requires us to fire off a recursive evaluation. This is so we have the value of the node at the state we have passed, and we'll store this at each node. Lets go through the second part of the above algorithm so we understand:

We start at $z_5$ and set $z_5.\tilde{x}$ to 1.
Then $z_3.\tilde{x} \mathrel{+}= \ z_5.\tilde{x} $ and $z_4.\tilde{x} \mathrel{+}= z_5.\tilde{x} \ z_3[state]$. That is we have added single components of the sums $f_{z_3},f_{z_4}$ that involve the parent $z_5$.
Repeat for lower nodes in the graph.

So that was fairly easy, right? Now let us look at generating the computation graph.

Generating the expression graph

This, for me, was by far the most interesting part of the problem. A few months ago I started work on a project called prsr. For the same equation above this would generate a graph like

then would find distinct variables and obtain this

The arrows here represent some indirection. Sort of. Nodes are represented by pointers. Here I am redirecting nodes with the same label to point to the same area of memory so that when the above algorithm works on $x$, it has 3 parents instead of 1.

First, how did I generate the graph? I used a very convoluted recursive algorithm (that involved coroutines) that parsed arithmetic expressions in infix notation (i.e $x+y$) directly to a computation graph. Incredibly ugly. I will show how I simplified this part of the pipeline later in the post, but for people who want to look at this hideous monster it lies here.

Second, why is the first graph wrong? Simply, the AD procedure thinks that each instance of $x$ is a distinct variable. To put it more simply, it thinks that $f(x,y) = (x+y) + x(x+y)$ and $g(a,b,c,d,e) = (a+b) + c(d+e)$ are the "same". So I had to thread the variables together. If you want to see this in code look here at the method fixExpressionGraph. Although, I would rather you don't. Just like the generation of the graph, it is not nice.

Improvements

So when designing qsts (note that qsts is prsr +1 for each letter, thanks RG ❤) I wanted to efficiently generate efficient graphs in an elegant way. Yes, you read that correct. More specicfically, qsts generates the graph in the first figure(with all the $z$'s). This pipeline can be split up into the following stages

Parsing the input to a postfix structure.
Parsing the postfix structure into a expression.

The first state is well documented and with a knowledge of infix (e.g. $x+y$) and postfix (e.g. $+xy$) formats you can easily put together some code to achieve this. The second stage is interesting, as we need to maintain uniqueness of nodes going in. The following code achieves this (in the constructor of expression):

where unique stack is as follows

What makes this work is the imposed lexicographic ordering ( essentially mapping 3-space to 1-space) on the node objects that can be seen in the source code. Essentially each node considers itself as a point in $\mathbb{R}^3$. The first component being himself and the last 2 components being its left and right component. For example, a "$+$" node with children "$x$" and "$y$" is considered as $(+,x,y)$.

Now remember the function $f$ we looked at in the beginning? This function can be written as $f(x,y) = (x + 1)(x+y)$. It is important to note that the computation graphs for these expressions are different, although they are the same function. The reason I have written it like this is that it highlights the unique node aspect and generates sufficiently small graphs for a blog post.

Of course, generating a graph with unique nodes comes with some performance hit. But the way in which qsts generates graphs requires approximately half the memory that prsr requires. We have halved the required memory for the procedure and once generated for a given expression, we can evaluate the gradient at as many states as we like. With less nodes the graph is also cheaper upon resetting the values of each node in the graph (as there are less nodes). This occurs every subsequent grad call on a previously generated expression.

While indeed there is a performance loss in generating the graph itself, this is more than made up for in use-cases. For example, AD is used in gradient descent, which requires us to continually reset and revaluate the graph at different states. With unique nodes, this process is roughly twice as efficient such as in the example below:

Search This Blog

Down and out in Swansea