Convolutional Layer
This section intends to detail the equations ruling a (2D) convolutional layer
(see bgd.layers.conv2d.Convolutional2D
).
If layer \(k\) is a 2D-convolutional layer (Conv2D layer), the layer
parameters are the biases (intercepts) \(b^{(k)}\), a 1D real vector,
and the filters \(\omega^{(k)}\), a 4D tensor of dimension
\((n_F^{(k)}, F_H^{(k)}, F_W^{(k)}, n_C^{(k)})\), respectively the number
of filters, the width of the filters, the height of the filters, and the number
of channels in the input.
Strides
The strides, denoted by \(\sigma = [\sigma_1 \; \sigma_2]' \in \mathbb {\mathbb N^*}^2\),
represent the step taken between two pixels of the output image.
They induce the following dimension for the output image:
\[(\ell, H, W, n_C^{(k)}) * (n_F^{(k)}, F_H^{(k)}, F_W^{(k)}, n_C^{(k)}) \mapsto
(\ell, \lfloor(H - F_H^{(k)}) / \sigma_1\rfloor + 1, \lfloor(W - F_W^{(k)}) / \sigma_2\rfloor + 1, n_F^{(k)}).\]
Dilations
The dilations, denoted by \(\delta = [\delta_1 \; \delta_2]' \in \mathbb {\mathbb N^*}^2\),
represent a dilation on the filters, i.e. the number of rows/columns of pixels in the input
image that are skipped between two rows/columns of the filters.
The effect of the dilations can be viewed as a literal dilation of the filters where a
2D filter \(\omega\) of shape \((m, n)\) is dilated into another 2D filter
\(\hat \omega\) of shape \((\hat m, \hat n) = (\delta_1 \cdot (m-1) + 1, \delta_2 \cdot (n-1) + 1)\).
Therefore, by using the notation \(\hat .\) in order to denote the shape of a
dilated filter, we have that the dimension of the output image is:
\[(\ell, \lfloor(H - \hat F_H^{(k)}) / \sigma_1\rfloor + 1, \lfloor(W - \hat F_W^{(k)}) / \sigma_2\rfloor + 1, n_F^{(k)}),\]
where \(\hat F_H^{(k)} = 1 + \delta_1(F_H^{(k)} - 1)\) and \(\hat F_W^{(k)} = 1 + \delta_2(F_W^{(k)} - 1)\).
Convolution function
The layer function is then given by:
\[\Lambda^{(k)}_{\theta^{(k)}} :
\mathbb R^{\ell \times H \times W \times n_C^{(k)}}
\to \mathbb R^{\ell \times (\lfloor(H - \hat F_H^{(k)}) / \sigma_1\rfloor+1) \times (\lfloor(W - \hat F_W^{(k)}) / \sigma_2\rfloor+1) \times n_F^{(k)}} :
X \mapsto \Lambda^{(k)}_{\theta^{(k)}}(X; \sigma, \delta),\]
where, for \(\beta\) a multi-index of the output image:
\[\Lambda^{(k)}_{\theta^{(k)}}(X^{(k-1)}; \sigma, \delta)_\beta = \sum_{\gamma_1=1}^{F_H^{(k)}}\sum_{\gamma_2=1}^{F_W^{(k)}}\sum_{\gamma_3=1}^{n_C^{(k)}}\omega^{(k)}_{\beta_3,\gamma_1,\gamma_2,\gamma_3}X^{(k-1)}_{\beta_0,\beta_1+\sigma_1\gamma_1,\beta_2+\sigma_2\gamma_2,\gamma_3} + b^{(k)}_{\beta_3}.\]
Backpropagation
The backpropagation algorithm requires \(\partial_{\omega^{(k)}_\alpha}\mathcal L\),
\(\partial_{b^{(k)}_i}\mathcal L\) and \(\partial_{X^{(k-1)}_\alpha}\mathcal L\).
For the weights update:
\[\begin{split}\partial_{\omega^{(k)_\alpha}}X^{(k)}_\beta &= \sum_{\gamma_1}\sum_{\gamma_2}\sum_{\gamma_3}\delta^{\beta_3}_{\alpha_0}\delta^{\gamma_1}_{\alpha_1}\delta^{\gamma_2}_{\alpha_2}\delta^{\gamma_3}_{\alpha_3}X^{(k-1)}_{\beta_0,\sigma_1\beta_1 + \delta_1\gamma_1, \sigma_2\beta_2 + \delta_2\gamma_2,\gamma_3} \\
&= \delta^{\beta_3}_{\alpha_0}X^{(k-1)}_{\beta_0,\sigma_1\beta_1 + \delta_1\alpha_1,\sigma_2\beta_2 + \delta_2\alpha_2,\alpha_3}.\end{split}\]
Therefore:
\[\begin{split}\partial_{\omega^{(k)}_\alpha}\mathcal L &= \sum_{\beta_0,\beta_1,\beta_2,\beta_3}\varepsilon^{(k)}_\beta\partial_{\omega^{(k)}_\alpha}X^{(k)}_\beta
= \sum_{\beta_0,\beta_1,\beta_2}\varepsilon^{(k)}_{\beta_0,\beta_1,\beta_2,\alpha_0}X^{(k)}_{\beta_0,\sigma_1\beta_1 + \delta_1\alpha_1,\sigma_2\beta_2 + \delta_2\alpha_2,\alpha_3} \\
&= \sum_{\beta_0,\beta_1,\beta_2}\pi_{\tau_{0,3}}\varepsilon^{(k)}_{\alpha_0,\beta_1,\beta_2,\beta_0}\pi_{\tau_{0,3}}X^{(k-1)}_{\alpha_3,\sigma_1\beta_1 + \delta_1\alpha_1,\sigma_2\beta_2 + \delta_2\alpha_2,\beta_0}
= \pi_{\tau_{0,3}}\Lambda^{(k)}_{(\pi_{\tau_{0,3}}\varepsilon^{(k)}, \mathbf{0})}(\pi_{\tau_{0,3}}X^{(k-1)}; \delta, \sigma)_\alpha,\end{split}\]
i.e. the backward pass is also a convolution where strides and dilations have been swapped and without biases.
For the biases, it is more trivial:
\[\partial_{b^{(k)}_i}\mathcal L = \sum_{\beta}\varepsilon^{(k)}_\beta\partial_{b^{(k)}_i}X^{(k)}_\beta
= \sum_{\beta}\varepsilon^{(k)}_\beta\delta_i^{\beta_3} = \sum_{\beta_0,\beta_1,\beta_2}\varepsilon^{(k)}_{\beta_0,\beta_1,\beta_2,i}.\]
Finally, for the signal propagation, the formula is way less trivial:
\[\partial_{X^{(k-1)}_\alpha}\mathcal L = \sum_{\beta}\varepsilon^{(k)}_\beta\partial_{X^{(k-1)}_\alpha}X^{(k)}_\beta
= \sum_{\beta_1}\sum_{\beta_2}\sum_{\beta_3}\sum_{\gamma_1}\sum_{\gamma_2}\varepsilon^{(k)}_{\alpha_0,\beta_1,\beta_2,\beta_3}\omega^{(k)}_{\beta_3,\gamma_1,\gamma_2,\alpha_3}\delta_{\sigma_1\beta_1+\delta_1\gamma_1}^{\alpha_1}\delta_{\sigma_2\beta_2+\delta_2\gamma_2}^{\alpha_2}.\]