Introduction and Notations
--------------------------

Dataset
"""""""

Starting from here, we assume that :math:`\mathcal D = \{(x^{(i)}, y^{(i)}\}_{i=1}^N`
is the dataset such that:

.. math::

   \forall i \in \{1, \ldots, N\} : x^{(i)} \in \mathbb R^n \text{ and } y^{(i)} \in \mathbb R^m.

Let :math:`\mathcal N` denote the neural net function:

.. math::

   \mathcal N : \mathbb R^n \to \mathbb R^m : x \mapsto \mathcal N(x).

For :math:`X \in \mathbb R^{\ell \times n}` a matrix of samples (each row is a
sample in :math:`\mathcal D`), we naturally extend :math:`\mathcal N` such that:

.. math::

   \mathcal N(X) = [\mathcal N(X_i)]_{i=1}^\ell \in \mathbb R^{\ell \times m}.

Neural Stack
""""""""""""

A linear neural network (LNN for short, also called neural stack, see
:class:`bgd.nn.NeuralStack`) is a composition of a list of layers. Let's denote
the number of layers by :math:`K \in \mathbb N^*`. For :math:`k \in \{1, \ldots, K\}`,
the :math:`k` th layer is denoted by :math:`\Lambda^{(k)}` such that:

.. math::

   \Lambda^{(k)}_{\theta^{(k)}} : \mathbb R^{n^{(k-1)}} \to \mathbb R^{n^{(k)}},

where :math:`\theta^{(k)}` denotes the parameters of layer :math:`k` and is a
tensor or rank :math:`r^{(k)} \in \mathbb N` (0 if non-parametrized layer) and
dimension :math:`\delta^{(k)} \in \mathbb {N^*}^{r^{(k)}}` such that:

.. math::

   \theta^{(k)} \in \mathbb R^{\delta^{(k)}} := \mathbb R^{\prod_{i=1}^{r^{(k)}}\delta^{(k)}_i},

:math:`n^{(j)}` denotes the dimension of the (tensor) output of the :math:`j` th layer.
For sake of notation, we also introduce :math:`n^{(0)} = n`.

The neural net is therefore defined as the composition of all the layers:

.. math::

   \mathcal N(X) := \Big(\bigcirc_{k=1}^K\Lambda_{\theta^{(k)}}^{(k)}\Big)(X),

where the composition must be read :math:`\Lambda^{(K)}_{\theta^{(k)}} \circ \ldots \circ \Lambda^{(1)}_{\theta^{(1)}}`
so that dimensions stay consistent.

Miscelanous Notations
"""""""""""""""""""""

We also write :math:`X \in \mathbb R^{\ell \times n}` for a batch input of the LNN.

For sake of convenience, for :math:`k \in \{1, \ldots, K\}`, we write:
:math:`X^{(k)} := \Lambda_{\theta^{(k)}}^{(k)}(X^{(k-1)})` and :math:`X^{(0)} := X`.

For :math:`X` a rank :math:`n` tensor and :math:`\sigma \in \mathfrak S_n` a permutation on
:math:`n` elements, we introduce the following notation:

.. math::

   \pi_\sigma X := \pi_\sigma(X) := [X_{\sigma\alpha}]_\alpha,

an extension of the transposition of matrices: if :math:`\sigma = \tau_{i,j}`, then
:math:`\pi_\sigma X` is still a rank :math:`n` tensor but whose indices :math:`i`
and :math:`j` have been swapped.

Cost Function
"""""""""""""

A cost function (or *loss function* which is equivalent) is a function:

.. math::

   L : \mathbb R^{m} \times \mathbb R^m \to \mathbb R^+ : (y, \hat y) \mapsto L(y, \hat y)

that represents the error between ground truth :math:`y` and estimation :math:`\hat y = \mathcal N(x)`.
The training phase will attempt to reach the minimum of the loss :math:`L` by adapting the
parameters of each layer.

We introduce the notation:

.. math::

   \mathcal L : \mathbb R^{\ell \times n} \to \mathbb R^+ : X \mapsto L(\mathbf y, \mathcal N(X))

for the loss computed on samples of the dataset :math:`\mathcal D` where :math:`\mathbf y`
is the label vector associated to samples :math:`X` (i.e. :math:`\mathbf y = [y^{(i_j)}]_{j=1}^\ell`
for :math:`i_j \in \{1, \ldots, N\}` such that :math:`X = [x^{(i_j)}]_{j=1}^\ell`).

The list of loss functions implemented in BGD can be found in the :mod:`bgd.cost`.