Convolutional Layer

This section intends to detail the equations ruling a (2D) convolutional layer (see bgd.layers.conv2d.Convolutional2D).

If layer \(k\) is a 2D-convolutional layer (Conv2D layer), the layer parameters are the biases (intercepts) \(b^{(k)}\), a 1D real vector, and the filters \(\omega^{(k)}\), a 4D tensor of dimension \((n_F^{(k)}, F_H^{(k)}, F_W^{(k)}, n_C^{(k)})\), respectively the number of filters, the width of the filters, the height of the filters, and the number of channels in the input.

Strides

The strides, denoted by \(\sigma = [\sigma_1 \; \sigma_2]' \in \mathbb {\mathbb N^*}^2\), represent the step taken between two pixels of the output image.

They induce the following dimension for the output image:

\[(\ell, H, W, n_C^{(k)}) * (n_F^{(k)}, F_H^{(k)}, F_W^{(k)}, n_C^{(k)}) \mapsto (\ell, \lfloor(H - F_H^{(k)}) / \sigma_1\rfloor + 1, \lfloor(W - F_W^{(k)}) / \sigma_2\rfloor + 1, n_F^{(k)}).\]

Dilations

The dilations, denoted by \(\delta = [\delta_1 \; \delta_2]' \in \mathbb {\mathbb N^*}^2\), represent a dilation on the filters, i.e. the number of rows/columns of pixels in the input image that are skipped between two rows/columns of the filters.

The effect of the dilations can be viewed as a literal dilation of the filters where a 2D filter \(\omega\) of shape \((m, n)\) is dilated into another 2D filter \(\hat \omega\) of shape \((\hat m, \hat n) = (\delta_1 \cdot (m-1) + 1, \delta_2 \cdot (n-1) + 1)\).

Therefore, by using the notation \(\hat .\) in order to denote the shape of a dilated filter, we have that the dimension of the output image is:

\[(\ell, \lfloor(H - \hat F_H^{(k)}) / \sigma_1\rfloor + 1, \lfloor(W - \hat F_W^{(k)}) / \sigma_2\rfloor + 1, n_F^{(k)}),\]

where \(\hat F_H^{(k)} = 1 + \delta_1(F_H^{(k)} - 1)\) and \(\hat F_W^{(k)} = 1 + \delta_2(F_W^{(k)} - 1)\).

Convolution function

The layer function is then given by:

\[\Lambda^{(k)}_{\theta^{(k)}} : \mathbb R^{\ell \times H \times W \times n_C^{(k)}} \to \mathbb R^{\ell \times (\lfloor(H - \hat F_H^{(k)}) / \sigma_1\rfloor+1) \times (\lfloor(W - \hat F_W^{(k)}) / \sigma_2\rfloor+1) \times n_F^{(k)}} : X \mapsto \Lambda^{(k)}_{\theta^{(k)}}(X; \sigma, \delta),\]

where, for \(\beta\) a multi-index of the output image:

\[\Lambda^{(k)}_{\theta^{(k)}}(X^{(k-1)}; \sigma, \delta)_\beta = \sum_{\gamma_1=1}^{F_H^{(k)}}\sum_{\gamma_2=1}^{F_W^{(k)}}\sum_{\gamma_3=1}^{n_C^{(k)}}\omega^{(k)}_{\beta_3,\gamma_1,\gamma_2,\gamma_3}X^{(k-1)}_{\beta_0,\beta_1+\sigma_1\gamma_1,\beta_2+\sigma_2\gamma_2,\gamma_3} + b^{(k)}_{\beta_3}.\]

Backpropagation

The backpropagation algorithm requires \(\partial_{\omega^{(k)}_\alpha}\mathcal L\), \(\partial_{b^{(k)}_i}\mathcal L\) and \(\partial_{X^{(k-1)}_\alpha}\mathcal L\).

For the weights update:

\[\begin{split}\partial_{\omega^{(k)_\alpha}}X^{(k)}_\beta &= \sum_{\gamma_1}\sum_{\gamma_2}\sum_{\gamma_3}\delta^{\beta_3}_{\alpha_0}\delta^{\gamma_1}_{\alpha_1}\delta^{\gamma_2}_{\alpha_2}\delta^{\gamma_3}_{\alpha_3}X^{(k-1)}_{\beta_0,\sigma_1\beta_1 + \delta_1\gamma_1, \sigma_2\beta_2 + \delta_2\gamma_2,\gamma_3} \\ &= \delta^{\beta_3}_{\alpha_0}X^{(k-1)}_{\beta_0,\sigma_1\beta_1 + \delta_1\alpha_1,\sigma_2\beta_2 + \delta_2\alpha_2,\alpha_3}.\end{split}\]

Therefore:

\[\begin{split}\partial_{\omega^{(k)}_\alpha}\mathcal L &= \sum_{\beta_0,\beta_1,\beta_2,\beta_3}\varepsilon^{(k)}_\beta\partial_{\omega^{(k)}_\alpha}X^{(k)}_\beta = \sum_{\beta_0,\beta_1,\beta_2}\varepsilon^{(k)}_{\beta_0,\beta_1,\beta_2,\alpha_0}X^{(k)}_{\beta_0,\sigma_1\beta_1 + \delta_1\alpha_1,\sigma_2\beta_2 + \delta_2\alpha_2,\alpha_3} \\ &= \sum_{\beta_0,\beta_1,\beta_2}\pi_{\tau_{0,3}}\varepsilon^{(k)}_{\alpha_0,\beta_1,\beta_2,\beta_0}\pi_{\tau_{0,3}}X^{(k-1)}_{\alpha_3,\sigma_1\beta_1 + \delta_1\alpha_1,\sigma_2\beta_2 + \delta_2\alpha_2,\beta_0} = \pi_{\tau_{0,3}}\Lambda^{(k)}_{(\pi_{\tau_{0,3}}\varepsilon^{(k)}, \mathbf{0})}(\pi_{\tau_{0,3}}X^{(k-1)}; \delta, \sigma)_\alpha,\end{split}\]

i.e. the backward pass is also a convolution where strides and dilations have been swapped and without biases.

For the biases, it is more trivial:

\[\partial_{b^{(k)}_i}\mathcal L = \sum_{\beta}\varepsilon^{(k)}_\beta\partial_{b^{(k)}_i}X^{(k)}_\beta = \sum_{\beta}\varepsilon^{(k)}_\beta\delta_i^{\beta_3} = \sum_{\beta_0,\beta_1,\beta_2}\varepsilon^{(k)}_{\beta_0,\beta_1,\beta_2,i}.\]

Finally, for the signal propagation, the formula is way less trivial:

\[\partial_{X^{(k-1)}_\alpha}\mathcal L = \sum_{\beta}\varepsilon^{(k)}_\beta\partial_{X^{(k-1)}_\alpha}X^{(k)}_\beta = \sum_{\beta_1}\sum_{\beta_2}\sum_{\beta_3}\sum_{\gamma_1}\sum_{\gamma_2}\varepsilon^{(k)}_{\alpha_0,\beta_1,\beta_2,\beta_3}\omega^{(k)}_{\beta_3,\gamma_1,\gamma_2,\alpha_3}\delta_{\sigma_1\beta_1+\delta_1\gamma_1}^{\alpha_1}\delta_{\sigma_2\beta_2+\delta_2\gamma_2}^{\alpha_2}.\]