Categorical Cross Entropy


Deriving categorical cross entropy is quite trivial, and could help understanding the loss function better.

Let $K$ be the number of categories.

$$ L = - \sum_{j=1}^K t_j \hat{y_i} $$

Using chain rule we can write the partial derivative as follows;

$$ \frac{\delta L}{z_i} = \sum_j \frac{\delta L}{\hat{y_j}} \frac{\delta \hat{y_j}}{\delta z_i}. $$

We can substitute $\frac{\delta L}{\delta \hat{y_j}}$;

$$ \frac{\delta L}{\delta \hat{y_j}} = \frac{\delta \left (- \sum_{j=1}^K t_j \hat{y_i} \right)}{\delta \hat{y_j}} = - t_i \frac{1}{\hat{y_j}}. $$

Then $\frac{\delta L}{z_i}$ becomes;

$$ \frac{\delta L}{z_i} = \sum_j - t_i \frac{1}{\hat{y_j}} \frac{\delta \hat{y_j}}{\delta z_i} $$

We can also substitute $\frac{\hat{y_j}}{z_i}$ for $i = j$ and $i \neq j$ separately,

$$ \frac{\delta \hat{y_j}}{\delta z_i} = \frac{\delta \left (\frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}} \right )}{\delta z_i}. $$

If $i=j$ it becomes;

$$ \frac{\delta \hat{y_i}}{\delta z_i} = \frac{\delta \left (\frac{e^{z_i}}{\sum_{k=1}^K e^{z_k}} \right )}{\delta z_i} = \frac{e^{z_i}\sum_{\substack{k=1 \ k \neq i}}^K e^{z_k}}{\left (\sum_{k=1}^K e^{z_k} \right )^2} = \frac{\sum_{\substack{k=1 \ k \neq i}}^K e^{z_k}}{\left (\sum_{k=1}^K e^{z_k} \right )} \frac{e^{z_i}}{\left (\sum_{k=1}^K e^{z_k} \right )} = \left ( 1- \frac{e^{z_i}}{\left (\sum_{k=1}^K e^{z_k} \right )} \right ) \frac{e^{z_i}}{\left (\sum_{k=1}^K e^{z_k} \right )}, $$

$$ \frac{\delta \hat{y_i}}{\delta z_i} = \hat{y_i} (1-\hat{y_i}). $$

Otherwise if $i \neq j$ it becomes;

$$ \frac{\delta \hat{y_i}}{\delta z_i} = \frac{\delta \left (\frac{e^{z_i}}{\sum_{k=1}^K e^{z_k}} \right )}{\delta z_i} = - \frac{e^{z_j} e^{z_i}}{\left ( \sum_{k=1}^K e^{z_k} \right )^2} = - \hat{y_j} \hat{y_i}. $$

Going back to initial statement and substituting $\frac{\delta \hat{y_j}}{\delta z_i}$ gives;

$$ \frac{\delta L}{z_i} = \frac{-t_i}{\hat{y_i}}\frac{\delta \hat{y_i}}{\delta z_i} + \sum_{\substack{j=1 \ j \neq i}}^K \frac{-t_j}{\hat{y_j}} \frac{\delta \hat{y_j}}{\delta z_i} $$

$$ \frac{\delta L}{z_i} = \frac{-t_i}{\hat{y_i}}\hat{y_i}(1-\hat{y_i}) + \sum_{\substack{j=1 \ j \neq i}}^K \left ( \frac{-t_j}{\hat{y_j}} \right ) (-\hat{y_j} \hat{y_i}) $$

$$ \frac{\delta L}{z_i} = - t_i + t_i \hat{y_i} + \sum_{\substack{j=1 \ j \neq i}}^K t_j \hat{y_i} $$

Since sum of the targets $\sum_{k=1}^K t_k = 1$, we finally end up with

$$ \frac{\delta L}{z_i} = t_i + \sum_{j=1}^K t_j \hat{y_i} = \hat{y_i} - t_i. $$