Convolutional Layer
^^^^^^^^^^^^^^^^^^^

This section intends to detail the equations ruling a (2D) convolutional layer
(see :class:`bgd.layers.conv2d.Convolutional2D`).

If layer :math:`k` is a 2D-convolutional layer (Conv2D layer), the layer
parameters are the biases (intercepts) :math:`b^{(k)}`, a 1D real vector,
and the filters :math:`\omega^{(k)}`, a 4D tensor of dimension
:math:`(n_F^{(k)}, F_H^{(k)}, F_W^{(k)}, n_C^{(k)})`, respectively the number
of filters, the width of the filters, the height of the filters, and the number
of channels in the input.

Strides
"""""""

The strides, denoted by :math:`\sigma = [\sigma_1 \; \sigma_2]' \in \mathbb {\mathbb N^*}^2`,
represent the step taken between two pixels of the output image.

They induce the following dimension for the output image:

.. math::

   (\ell, H, W, n_C^{(k)}) * (n_F^{(k)}, F_H^{(k)}, F_W^{(k)}, n_C^{(k)}) \mapsto
    (\ell, \lfloor(H - F_H^{(k)}) / \sigma_1\rfloor + 1, \lfloor(W - F_W^{(k)}) / \sigma_2\rfloor + 1, n_F^{(k)}).

Dilations
"""""""""

The dilations, denoted by :math:`\delta = [\delta_1 \; \delta_2]' \in \mathbb {\mathbb N^*}^2`,
represent a dilation on the filters, i.e. the number of rows/columns of pixels in the input
image that are skipped between two rows/columns of the filters.

The effect of the dilations can be viewed as a *literal* dilation of the filters where a
2D filter :math:`\omega` of shape :math:`(m, n)` is dilated into another 2D filter
:math:`\hat \omega` of shape :math:`(\hat m, \hat n) = (\delta_1 \cdot (m-1) + 1, \delta_2 \cdot (n-1) + 1)`.

Therefore, by using the notation :math:`\hat .` in order to denote the shape of a
dilated filter, we have that the dimension of the output image is:

.. math::

   (\ell, \lfloor(H - \hat F_H^{(k)}) / \sigma_1\rfloor + 1, \lfloor(W - \hat F_W^{(k)}) / \sigma_2\rfloor + 1, n_F^{(k)}),

where :math:`\hat F_H^{(k)} = 1 + \delta_1(F_H^{(k)} - 1)` and :math:`\hat F_W^{(k)} = 1 + \delta_2(F_W^{(k)} - 1)`.

Convolution function
""""""""""""""""""""

The layer function is then given by:

.. math::

	\Lambda^{(k)}_{\theta^{(k)}} :
	 \mathbb R^{\ell \times H \times W \times n_C^{(k)}}
	   \to \mathbb R^{\ell \times (\lfloor(H - \hat F_H^{(k)}) / \sigma_1\rfloor+1) \times (\lfloor(W - \hat F_W^{(k)}) / \sigma_2\rfloor+1) \times n_F^{(k)}} :
	 X \mapsto \Lambda^{(k)}_{\theta^{(k)}}(X; \sigma, \delta),

where, for :math:`\beta` a multi-index of the output image:

.. math::

   \Lambda^{(k)}_{\theta^{(k)}}(X^{(k-1)}; \sigma, \delta)_\beta = \sum_{\gamma_1=1}^{F_H^{(k)}}\sum_{\gamma_2=1}^{F_W^{(k)}}\sum_{\gamma_3=1}^{n_C^{(k)}}\omega^{(k)}_{\beta_3,\gamma_1,\gamma_2,\gamma_3}X^{(k-1)}_{\beta_0,\beta_1+\sigma_1\gamma_1,\beta_2+\sigma_2\gamma_2,\gamma_3} + b^{(k)}_{\beta_3}.

Backpropagation
"""""""""""""""

The backpropagation algorithm requires :math:`\partial_{\omega^{(k)}_\alpha}\mathcal L`,
:math:`\partial_{b^{(k)}_i}\mathcal L` and :math:`\partial_{X^{(k-1)}_\alpha}\mathcal L`.


For the weights update:

.. math::

   \partial_{\omega^{(k)_\alpha}}X^{(k)}_\beta &= \sum_{\gamma_1}\sum_{\gamma_2}\sum_{\gamma_3}\delta^{\beta_3}_{\alpha_0}\delta^{\gamma_1}_{\alpha_1}\delta^{\gamma_2}_{\alpha_2}\delta^{\gamma_3}_{\alpha_3}X^{(k-1)}_{\beta_0,\sigma_1\beta_1 + \delta_1\gamma_1, \sigma_2\beta_2 + \delta_2\gamma_2,\gamma_3} \\
     &= \delta^{\beta_3}_{\alpha_0}X^{(k-1)}_{\beta_0,\sigma_1\beta_1 + \delta_1\alpha_1,\sigma_2\beta_2 + \delta_2\alpha_2,\alpha_3}.

Therefore:

.. math::

   \partial_{\omega^{(k)}_\alpha}\mathcal L &= \sum_{\beta_0,\beta_1,\beta_2,\beta_3}\varepsilon^{(k)}_\beta\partial_{\omega^{(k)}_\alpha}X^{(k)}_\beta
   = \sum_{\beta_0,\beta_1,\beta_2}\varepsilon^{(k)}_{\beta_0,\beta_1,\beta_2,\alpha_0}X^{(k)}_{\beta_0,\sigma_1\beta_1 + \delta_1\alpha_1,\sigma_2\beta_2 + \delta_2\alpha_2,\alpha_3} \\
   &= \sum_{\beta_0,\beta_1,\beta_2}\pi_{\tau_{0,3}}\varepsilon^{(k)}_{\alpha_0,\beta_1,\beta_2,\beta_0}\pi_{\tau_{0,3}}X^{(k-1)}_{\alpha_3,\sigma_1\beta_1 + \delta_1\alpha_1,\sigma_2\beta_2 + \delta_2\alpha_2,\beta_0}
   = \pi_{\tau_{0,3}}\Lambda^{(k)}_{(\pi_{\tau_{0,3}}\varepsilon^{(k)}, \mathbf{0})}(\pi_{\tau_{0,3}}X^{(k-1)}; \delta, \sigma)_\alpha,

i.e. the backward pass is also a convolution where strides and dilations have been swapped and without biases.

For the biases, it is more trivial:

.. math::

   \partial_{b^{(k)}_i}\mathcal L = \sum_{\beta}\varepsilon^{(k)}_\beta\partial_{b^{(k)}_i}X^{(k)}_\beta
   = \sum_{\beta}\varepsilon^{(k)}_\beta\delta_i^{\beta_3} = \sum_{\beta_0,\beta_1,\beta_2}\varepsilon^{(k)}_{\beta_0,\beta_1,\beta_2,i}.

Finally, for the signal propagation, the formula is way less trivial:

.. math::

   \partial_{X^{(k-1)}_\alpha}\mathcal L = \sum_{\beta}\varepsilon^{(k)}_\beta\partial_{X^{(k-1)}_\alpha}X^{(k)}_\beta
    = \sum_{\beta_1}\sum_{\beta_2}\sum_{\beta_3}\sum_{\gamma_1}\sum_{\gamma_2}\varepsilon^{(k)}_{\alpha_0,\beta_1,\beta_2,\beta_3}\omega^{(k)}_{\beta_3,\gamma_1,\gamma_2,\alpha_3}\delta_{\sigma_1\beta_1+\delta_1\gamma_1}^{\alpha_1}\delta_{\sigma_2\beta_2+\delta_2\gamma_2}^{\alpha_2}.