Introduction and Notations

Dataset

Starting from here, we assume that \(\mathcal D = \{(x^{(i)}, y^{(i)}\}_{i=1}^N\) is the dataset such that:

\[\forall i \in \{1, \ldots, N\} : x^{(i)} \in \mathbb R^n \text{ and } y^{(i)} \in \mathbb R^m.\]

Let \(\mathcal N\) denote the neural net function:

\[\mathcal N : \mathbb R^n \to \mathbb R^m : x \mapsto \mathcal N(x).\]

For \(X \in \mathbb R^{\ell \times n}\) a matrix of samples (each row is a sample in \(\mathcal D\)), we naturally extend \(\mathcal N\) such that:

\[\mathcal N(X) = [\mathcal N(X_i)]_{i=1}^\ell \in \mathbb R^{\ell \times m}.\]

Neural Stack

A linear neural network (LNN for short, also called neural stack, see bgd.nn.NeuralStack) is a composition of a list of layers. Let’s denote the number of layers by \(K \in \mathbb N^*\). For \(k \in \{1, \ldots, K\}\), the \(k\) th layer is denoted by \(\Lambda^{(k)}\) such that:

\[\Lambda^{(k)}_{\theta^{(k)}} : \mathbb R^{n^{(k-1)}} \to \mathbb R^{n^{(k)}},\]

where \(\theta^{(k)}\) denotes the parameters of layer \(k\) and is a tensor or rank \(r^{(k)} \in \mathbb N\) (0 if non-parametrized layer) and dimension \(\delta^{(k)} \in \mathbb {N^*}^{r^{(k)}}\) such that:

\[\theta^{(k)} \in \mathbb R^{\delta^{(k)}} := \mathbb R^{\prod_{i=1}^{r^{(k)}}\delta^{(k)}_i},\]

\(n^{(j)}\) denotes the dimension of the (tensor) output of the \(j\) th layer. For sake of notation, we also introduce \(n^{(0)} = n\).

The neural net is therefore defined as the composition of all the layers:

\[\mathcal N(X) := \Big(\bigcirc_{k=1}^K\Lambda_{\theta^{(k)}}^{(k)}\Big)(X),\]

where the composition must be read \(\Lambda^{(K)}_{\theta^{(k)}} \circ \ldots \circ \Lambda^{(1)}_{\theta^{(1)}}\) so that dimensions stay consistent.

Miscelanous Notations

We also write \(X \in \mathbb R^{\ell \times n}\) for a batch input of the LNN.

For sake of convenience, for \(k \in \{1, \ldots, K\}\), we write: \(X^{(k)} := \Lambda_{\theta^{(k)}}^{(k)}(X^{(k-1)})\) and \(X^{(0)} := X\).

For \(X\) a rank \(n\) tensor and \(\sigma \in \mathfrak S_n\) a permutation on \(n\) elements, we introduce the following notation:

\[\pi_\sigma X := \pi_\sigma(X) := [X_{\sigma\alpha}]_\alpha,\]

an extension of the transposition of matrices: if \(\sigma = \tau_{i,j}\), then \(\pi_\sigma X\) is still a rank \(n\) tensor but whose indices \(i\) and \(j\) have been swapped.

Cost Function

A cost function (or loss function which is equivalent) is a function:

\[L : \mathbb R^{m} \times \mathbb R^m \to \mathbb R^+ : (y, \hat y) \mapsto L(y, \hat y)\]

that represents the error between ground truth \(y\) and estimation \(\hat y = \mathcal N(x)\). The training phase will attempt to reach the minimum of the loss \(L\) by adapting the parameters of each layer.

We introduce the notation:

\[\mathcal L : \mathbb R^{\ell \times n} \to \mathbb R^+ : X \mapsto L(\mathbf y, \mathcal N(X))\]

for the loss computed on samples of the dataset \(\mathcal D\) where \(\mathbf y\) is the label vector associated to samples \(X\) (i.e. \(\mathbf y = [y^{(i_j)}]_{j=1}^\ell\) for \(i_j \in \{1, \ldots, N\}\) such that \(X = [x^{(i_j)}]_{j=1}^\ell\)).

The list of loss functions implemented in BGD can be found in the bgd.cost.