In this post we'll define the softmax classifier loss function and compute its gradient. We'll work step-by-step starting from scratch. We carry out the calculus required to compute the partial derivatives, write out some Python (and numpy) code based on this, then show how to "vectorize" the code.
We'll start with the softmax function, which is a basic component of the softmax loss function we will define.
The softmax function takes an n-tuple $(x_1, \dots, x_n)$ of real numbers and outputs another n-tuple of values. It's notationally easier to give the definition of $\text{softmax}(x_1,\dots,x_n)$ by saying what each particular entry of the resulting tuple is.
For each $i=1,\dots,n$, the $i$-th coordinate of $\text{softmax}(x_1,\dots,x_n)$ is defined to be
$$ \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}} $$Let's discuss this a bit to get a sense of what it does and where it comes from.
We can view the term above as a composition of basic functions we're more familiar with. In particular, we can view it as a composition of $\exp$ functions: $$x \mapsto e^x$$ and functions we'd typically encounter in basic probability: $$(y_1,\dots,y_n) \mapsto \frac{y_i}{\sum_{j=1}^{n} y_j}$$ We can thus view the softmax function as taking a tuple $(x_1,\dots,x_n)$, running each coordinate through the exponential function to get $(e^{x_1},\dots,e^{x_n})$, and then (separately, for each coordinate) taking the ratio of a coordinate to the sum of all coordinates.
If we think of the coordinates $e^{x_j}$ in the last step as weights, we can interpret the ratio as "computing the probability" of the $i$-th coordinate. If we think of exponentiation as "stretching" the inputs, then we can view the whole softmax function as stretching our inputs and then computing their probability, relative to the other coordinates.
Because of the growth of $\exp$, this exponential stretching has the effect of exaggerating the weight of the largest element.
This helps to understand why we might want to compute the softmax values rather than just the relative probabilities of the original coordinates (viewed as weights). Applying $\exp$ first has the effect of exaggerating the weights: bigger weights get relatively bigger, smaller weights get relatively smaller. If we were to repeatedly apply $\exp$ before computing the ratios, the largest term would eventually dominate, and the resulting probabilities would be $1$ for this largest term, and $0$ for the others.
This last point also explains the "softmax" name. Think of a "max" function as taking the tuple $(x_1,\dots,x_n)$ and returning a tuple $(0,\dots,1,\dots,0)$, where the $1$ is in the coordinate of the largest $x_j$ and all other coordinates are $0$. This can be thought of as returning a tuple of probabilties, with all of the weight assigned to the coordinate of the largest element. Then "softmax" can be thought of as an approximation of this behavior.
An advantage of the "smooth" behavior of softmax over the "max" function just described is that softmax doesn't have sudden jumps in the output when we vary the inputs. That is, for the "max" function above, if one input coordinate becomes larger than another, the output will suddenly jump, for example from $(1,0)$ to $(0,1)$, representing discontinuity. However, the softmax outputs will transition smoothly.
To summarize intuitively: softmax is a nicer version of taking a tuple and sticking a $1$ in the max entry and $0$s elsewhere.
This post was motivated by having to compute a gradient in an assignment from Stanford's CS231n course on visual recognition with neural networks.
In that context, we are training a neural network with 1 hidden layer for a classification task (with C many classes). We are training on batches of $N$ samples at a time. This means that on each forward pass of our network, we return an $N\times C$ matrix, where each row consists of the $C$ many output scores for one of the $N$ many samples. We will denote the $(i,j)$-th coordinate of this matrix (i.e. the score of sample $i$ on the $j$-th class) as
$$s_{i,j}$$For each sample $X_i$, we consider the network's prediction to be the $j$ such that $s_{i,j}$ is maximal (for that fixed $i$). To train the network, the real classes of the input samples are provided. Specifically, for each sample $X_i$ we are provided with the label $y_i$, which is the index $j$ of the correct score. In short, the network has predicted correctly for $X_i$ if $s_{i,y_i}$ is the largest score among the $s_{i,j}$ with that $i$ fixed.
For example, for each input $X_i$ in our batch, we compute a row of scores for each class, say something like $s_i = (0.1, 5, -0.3)$. We can interpret the coordinate with the largest score as the network's prediction; in this case the 2nd coordinate (i.e. index 1). The real label might be "class 0" or "index 0", in which case our network has incorrectly predicted, since the score $s_{i,0} = 0.1$ is not our maximal score in $s_i$.
When our network predicts inaccurately, we'd like to update the parameters to improve it. To do this, we write a loss function to measure the error in our predictions, and we use gradient descent to adjust the parameters based on this loss.
Because we want a loss function that we can use for gradient descent (i.e. need to compute derivatives for), but we're using a "max" function for prediction (which is discontinuous), we decide to use the softmax function above as part of our loss function.
We will also include regularization terms, though this will not be the focus of our discussion.
We'll now define the loss function we use in this context. We will continue to use the notation from this section (such as $s_{i,j}$ and $y_j$).
Because we are training our network in batches, it will be easier to define the loss function in pieces. We'll define each piece and provide an intuitive interpretation as we go.
First, we define the loss $L_i$ due to sample $i$.
$$ L_i = -\log\left(\frac{ e^{s_{i,y_i}} }{ \sum_j e^{s_{i,j}} }\right) \\ $$For an ideal accurate prediction, the score $s_{i,y_i}$ would be large, and the other $s_{i,j}$ would be small. This would correspond to all of the weight being be on the correct class. For these very good predictions, $L_i$ would evaluate to something close to $-\log(1/1)$, which is 0. For incorrect predictions, we would get things more like $-\log(1/3)$, which evaluates to about $1.1$. Worse predictions increase the denominator relative to the numerator, and increase the overall value of $L_i$.
So, we can think of each $L_i$ as measuring the failure to make the correct label's score much larger than the other scores. Minimizing $L_i$ means moving the network toward a network that shifts the relative weighting of scores toward the right labels.
The regularization loss $R$ is just a sum of squares of entries from the weight matrices used in the neural network. While making this portion of the loss $0$ would be nonsensical (it would be the same as all connection weights being 0, i.e. having no connections in the neural network), including this term in the more general loss function has a desirable effect.
Here is the definition: we will mention the new terms below.
$$ R = \text{reg} \frac{1}{2} (\sum (W_{1,i,j}) ^ 2 + \sum (W_{2,i,j}) ^ 2)\\ $$The term $\text{reg}$ is the "regularization parameter", and is just a real number that controls how much focus will be put on this term in the overall loss function. Notice that if $\text{reg} = 0$, then the whole $R$ term is already $0$. The terms $W_{1,i,j}$ and $W_{2,i,j}$ are entries of the corresponding weight matrices. In this network, there is only one hidden layer, and so there are two connection weight matrices for modeling the connections between layers. The sums are over all relevant $i$ and $j$, the number of which depend on the sizes of the layers. This isn't particularly important for our discussion, so we won't go into more detail on this point.
When $R$ is minimized (in conjunction with other loss terms) it has the effect of making the connection weights in our matrix smaller in magnitude. Understanding the impact of this requires some intuition for the inner workings of neural networks. But, roughly, it means that the overall predictions of the network will be less dependent on particular coordinates of the input values, and more influenced by the overall input.
To help understand this point, notice that if a single entry of the first weight matrix was very large relative to the other entries, this would correspond to a single input coordinate having a large impact on the overall computation. This is not always inherently bad, for example if the prediction really does just depend on that particular coordinate (like: decide whether an image has a white pixel in the top left corner). But generally, an extreme case like this would be unlikely to help build a model with good properties. Tuning the regularization parameter helps adapt the model to the classification problem at hand.
We obtain the overall loss function $L$ as the sum of the regularization loss and the mean of the losses due to each sample in our batch:
$$ L = \frac{1}{n} \sum_i L_i + R $$Because this is just a (weighted) sum of the previous loss functions, its interpretation is now fairly straightforward. Minimizing this function corresponds to minimizing the previous functions simultaneously.
As we move into the next section, remember that $L$ (and each $L_i$) is a function of the score matrix $s$ (i.e. its entries $s_{i,j}$) and the correct labels $y_i$.
In this section, we'll push a differential operator through $L$.
Applying gradient descent requires that we have the gradient of $L$ with respect to the parameters of the network. We either need to analytically find the gradient (which we do here), or approximate it. The basic approach to analytically finding the gradient is to compute partial derivatives starting from the loss function (which is a function of things at the "end" of the network), and then use the chain rule to take steps "back" through the network toward the inputs. In doing this, we obtain the gradients for all of the connection weights along the way.
Since our loss function $L$ is a function of the score values $s_{i,j}$, we start by computing each $\frac{\partial L}{\partial s_{i,j}}$. We will spend most of our effort on these terms, just commenting briefly on the next steps for propagating further back into the network.
As a result, we will get an expression for each coordinate of the gradient of our loss function with respect to the score outputs. In following sections, we translate this directly into Python/numpy code, and then show how to "vectorize" the code.
We want to compute $\frac{\partial L}{\partial s_{i,j}}$, so let's start. To avoid clashing indices with the summations, we'll write this as $\frac{\partial L}{\partial s_{a,b}}$. In the work below, we fix these indices $a$ and $b$ of the network's output score matrix.
$$ \frac{\partial L}{\partial s_{a,b}} = \frac{\partial}{\partial s_{a,b}} \left( \frac{1}{N} \sum_i L_i + R \right) $$We will continue, updating the term by applying differentiation rules, and commenting on the rules we use as we go. Understand that we are listing terms implicitly as a chain of equalities.
By linearity of differentiation, we can distribute the partial across the sum, and also through the $\frac{1}{N}$, and into the $\sum_i$.
$$ \frac{1}{N} \sum_i \left( \frac{\partial}{\partial s_{a,b}} L_i \right) + \frac{\partial}{\partial s_{a,b}} R $$Since $R$ does not depend on any $s_{a,b}$, its derivative is 0 and we can drop that term.
$$ \frac{1}{N} \sum_i \left( \frac{\partial}{\partial s_{a,b}} L_i \right) $$Now, $L_i$ only depends on terms $s_{i,j}$ with that fixed $i$, and so only depends on $s_{a,b}$ when $i=a$. This means most of the derivatives in this sum are 0 and can be dropped.
$$ \frac{1}{N} \frac{\partial}{\partial s_{a,b}} L_a $$We've reduced the problem to finding the partial derivative of one of the loss functions (for one sample) with respect to a score for the corresponding sample. We'll expand $L_a$ now so we can continue working.
$$ \frac{1}{N} \frac{\partial}{\partial s_{a,b}} \left( -\log\left(\frac{ e^{s_{a,y_a}} }{ \sum_j e^{s_{a,j}} }\right) \right) $$Again by linearity of differentiation, we can pass through the negative sign. Moreover, we can pull that negative sign out in front of the whole expression. We will put it in the numerator of the fraction out front.
$$ \frac{-1}{N} \frac{\partial}{\partial s_{a,b}} \left( \log\left(\frac{ e^{s_{a,y_a}} }{ \sum_j e^{s_{a,j}} }\right) \right) $$Now, we differentiate the $\log$ function. This involves flipping the fraction inside and applying the chain rule, so we multiply by the derivative of the input.
$$ \frac{-1}{N} \left(\frac{ \sum_j e^{s_{a,j}} }{ e^{s_{a,y_a}} }\right) \frac{\partial}{\partial s_{a,b}} \left(\frac{ e^{s_{a,y_a}} }{ \sum_j e^{s_{a,j}} }\right) $$Now we apply the quotient rule, resulting in the following lengthy expression.
$$ \frac{-1}{N} \left( \left(\frac{ \sum_j e^{s_{a,j}} }{ e^{s_{a,y_a}} }\right) \frac{ \left(\sum_j e^{s_{a,j}}\right) \frac{\partial}{\partial s_{a,b}} e^{s_{a,y_a}} - e^{s_{a,y_a}} \frac{\partial}{\partial s_{a,b}} \left(\sum_j e^{s_{a,j}}\right) }{ \left(\sum_j e^{s_{a,j}}\right)^2 } \right) $$There are several cancellations ahead involving the sums. But first, we will evaluate the right-most derivative. By linearity, we can pass into the sum, and then the only term $e^{s_{a,j}}$ which depends on $e^{s_{a,b}}$ is $e^{s_{a,b}}$ itself. The other terms become 0 and can be dropped. Differentiating the remaining exponential just leaves it as it is, so we get the following.
$$ \frac{-1}{N} \left( \left(\frac{ \sum_j e^{s_{a,j}} }{ e^{s_{a,y_a}} }\right) \frac{ \left(\sum_j e^{s_{a,j}}\right) \frac{\partial}{\partial s_{a,b}} e^{s_{a,y_a}} - e^{s_{a,y_a}} e^{s_{a,b}} }{ \left(\sum_j e^{s_{a,j}}\right)^2 } \right) $$We will leave the remaining partial alone for now, and work on the cancellations possible. We can split the right-side fraction into a difference of two fractions, then distribute the left-side fraction across those two. This lets us cancel several summations and cancel a pair of $e^{s_{a,y_a}}$ terms (but note: the term with the partial operator still-to-be-applied cannot be cancelled). We won't explain each of these steps, but this is the result after much algebra:
$$ \frac{-1}{N} \left( \frac{ \frac{\partial}{\partial s_{a,b}} e^{s_{a,y_a}} }{ e^{s_{a,y_a}} } - \frac{ e^{s_{a,b}} }{ \sum_j e^{s_{a,j}} } \right) $$All that's left is to comment on the remaining partial derivative. The value of this remaining term depends on the value of $y_a$.
If $y_a = b$, then $\frac{\partial}{\partial s_{a,b}} e^{s_{a,y_a}} = e^{s_{a,y_a}}$. In this case, the left fraction reduces to just 1.
Otherwise, when $y_a \neq b$, the partial derivative evaluates to 0, since the term does not depend on $s_{a,b}$. In this case, the fraction reduces to just 0.
So we have two cases:
$$ \frac{\partial L}{\partial s_{a,b}} = \cases{ \frac{-1}{N} \left( 1 - \frac{ e^{s_{a,b}} }{ \sum_j e^{s_{a,j}} } \right) \text{ when } y_a = b \\ \frac{-1}{N} \left( 0 - \frac{ e^{s_{a,b}} }{ \sum_j e^{s_{a,j}} } \right) \text{ when } y_a \neq b } $$This is the form we will use when we write our Python implementation in the next section.
However, there are various rearrangements we could still do. Notably, we can recognize the fraction on the right as the value of the softmax function's $b$-th coordinate when evaluated on $s_a = (s_{a,1},\dots,s_{a,C})$. This means we can write the result as below. Since the index collisions aren't an issue once we rewrite that term and stop using the summation, we've switched back to writing the derivative with respect to $s_{i,j}$. We will use $\text{softmax}_j$ to refer to the $j$-th coordinate of the softmax output.
$$ \frac{\partial L}{\partial s_{i,j}} = \cases{ \frac{-1}{N} \left( 1 - \text{softmax}_j (s_i) \right) \text{ when } y_i = j \\ \frac{-1}{N} \left( 0 - \text{softmax}_j (s_i) \right) \text{ when } y_i \neq j } $$If by chance you're reading along and found yourself here without the exact context, remember that we are working with a loss function that is defined assuming our network is trained on batches of samples. The loss function averages over losses for $N$ many samples. This explains the presence of the $1/N$ term. If we trained on single examples at a time, we would have $N=1$ and the function would simplify even more.
The final representation we gave above shows another desirable property of the softmax function and the loss functions we're using. Namely, after all the differentation, we're left with a very simple function (at least, in terms of the softmax function).
In the next section, we will give a Python implementation to define the gradient using these partial derivatives. The section after that will "vectorize" the code.
First, we will directly fill an $N\times C$ matrix (one entry per $s_{i,j}$) with the partial derivatives, using two for loops.
We loop over $i$ and then $j$ to access the $(i,j)$-th entry of our matrix of partial derivatives, which we name D_loss_D_scores. This matrix represents the gradient of the loss function with respect to the score values output by the neural network. We most directly copy the 2nd to last representation from the previous section, but we will precompute the summations and use the indices $i,j$.
Some reminders and notes for the code below:
exp_scores = np.exp(scores)
sums = np.sum(exp_scores,axis=1)
D_loss_D_scores = np.zeros(scores.shape)
for i in xrange(N): # for each training sample
    for j in xrange(C): # for each class
        if j == y[i]: # when the index is the "correct class" index for the sample
            D_loss_D_scores[i,j] = (-1.0/N) * (1 - (exp_scores[i,j] / sums[i]))
        else: # every other index
            D_loss_D_scores[i,j] = (-1.0/N) * (0 - (exp_scores[i,j] / sums[i]))
This is a fairly straightforward translation from the equations we found above. In the next section, we show how to write this without any for loops.
In this section, we will give a Python / numpy implementation just using vector and matrix operations. We'll do this in two steps.
First we will remove the for loop over j. The first two lines are the same (as they will be for the rest of this discussion). Our change is to the inner for loop.
We will give the code and then comment on the changes.
exp_scores = np.exp(scores)
sums = np.sum(exp_scores,axis=1)
DlossDscores = np.zeros(scores.shape)
for i in xrange(N):
    DlossDscores[i] = (1.0/N) * (exp_scores[i] / sums[i])
    DlossDscores[i,y[i]] -= (1.0/N)
The most notable change is that we've removed the ",j" part from the indices and the for loop. Numpy takes care of replicating the computations across the j indices.
However, giving up access to the particular j means that we can't directly compare it against y[i]. So we have to change the conditional structure of our code. To do so, we've done a few things:
Now we will remove the for loop over i. Again, we give the code and then comment on the changes.
exp_scores = np.exp(scores)
sums = np.sum(exp_scores,axis=1)
DlossDscores = exp_scores / (N * np.matrix(sums).T)
DlossDscores[range(N),y] -= (1.0/N)
We've removed the for loop and the "[i]" indexing from the main assignment. Numpy now handles the computation across both the i and j indices.
We've also done a bit more algebra (moving the N into the denominator) to make things look a little cleaner.
To make these changes possible, we had to convert the sums array to a column vector so that numpy would correctly "broadcast" it across the matrix when we "divide" our matrix by it. Understanding this point requires some familiarity with the way numpy carries out its vector and matrix operations.
We also had to change how we account for the $y_i = j$ case once more, because we don't have a loop giving us direct access to i. This time, we use numpy's convenient index accessing to change just the relevant entries of our gradient. The python function range just lists the values 1,...,N-1 which correspond to all of our i values, and the vector y contains the corresponding values y[i] for each i. So we've effectively just asked numpy to change the entries with indices of the form [i, y[i]]
We've finished the computation and implementation of the gradient calculation for the output layer of our neural network. We will comment on the remaining calculations needed to update a neural network, but won't go into much detail.
To finish computing all the gradients for a neural network, we would now have to continue computing the gradients with respect to the weight matrices and earlier layers. These computations are simpler than (or similar to) the above partial derivatives. In the context of the CS231n assignment, this just involves some matrix multiplications and a slight trick for handling the ReLU step.
Unlike the computations above, the regularization loss affects the gradients of the weight matrices. So, when computing the gradients for the weight matrices, we have to both account for the part of the gradient due to the regularization loss, and also the part of the gradient computed via the chain rule from the score gradient. In a bit more detail, if we have computed the score matrix $s$ as
$$ s = H_1 \cdot W_2 + b_2 $$where $H_1$ is the activations of the hidden layer, $W_2$ is the relevant matrix of connection weights, and $b_2$ is a bias, then
$$ \frac{ \partial L }{ \partial W_2 } = \left( H_1^t \cdot D^\text{Loss}_\text{scores} \right) + \left( D^\text{R}_{W_2} \right) $$where $D^{A}_{B}$ denotes "the gradient of A as a function of B", and R is the regularization loss from earlier in the post. The equality here can be established by considering the chain rule, how the loss function splits as a sum, and that the regularization loss does not depend on the scores.
As another note regarding the implementation in Python / numpy, be mindful that when numpy performs "broadcasting" to conveniently replicate operations across rows or columns, there is an implicit matrix multiplication. Being unaware of this can make applying the chain rule difficult when trying to compute gradients with respect to earlier parts of the network.
For example, if scores is computed as follows when doing a forward pass on a batch of training examples
scores = np.dot(H1, W2) + b2
then numpy uses "broadcasting" to add the array b2 of biases to each row (i.e. each sample's output) of the output matrix. This is equivalent to constructing a matrix whose rows are each a copy of b2. Forming such a matrix can be done by multiplying b2 by an appropriately oriented vector of 1s. This is important to note, because the partial derivative of scores with respect to b2 is not just 1 (as we might mistakenly think by looking at the above Python code as an equation) but rather a vector of 1s whose length depends on the number of rows (i.e. number of samples) we had to make a copy for.
In our context, this means that the relevant chain rule computation for the partial derivatives of the loss with respect to these biases is something like the following.
D_loss_D_b2 = np.dot(np.ones((1,N)), D_loss_D_scores)
The hardest part of the remaining gradient computations is accounting for these vectors, and more generally making sure that matrices and vectors are oriented in the correct way.
We started from the bottom now we're here.