Introduction and Notations
Dataset
Starting from here, we assume that \(\mathcal D = \{(x^{(i)}, y^{(i)}\}_{i=1}^N\) is the dataset such that:
Let \(\mathcal N\) denote the neural net function:
For \(X \in \mathbb R^{\ell \times n}\) a matrix of samples (each row is a sample in \(\mathcal D\)), we naturally extend \(\mathcal N\) such that:
Neural Stack
A linear neural network (LNN for short, also called neural stack, see
bgd.nn.NeuralStack
) is a composition of a list of layers. Let’s denote
the number of layers by \(K \in \mathbb N^*\). For \(k \in \{1, \ldots, K\}\),
the \(k\) th layer is denoted by \(\Lambda^{(k)}\) such that:
where \(\theta^{(k)}\) denotes the parameters of layer \(k\) and is a tensor or rank \(r^{(k)} \in \mathbb N\) (0 if non-parametrized layer) and dimension \(\delta^{(k)} \in \mathbb {N^*}^{r^{(k)}}\) such that:
\(n^{(j)}\) denotes the dimension of the (tensor) output of the \(j\) th layer. For sake of notation, we also introduce \(n^{(0)} = n\).
The neural net is therefore defined as the composition of all the layers:
where the composition must be read \(\Lambda^{(K)}_{\theta^{(k)}} \circ \ldots \circ \Lambda^{(1)}_{\theta^{(1)}}\) so that dimensions stay consistent.
Miscelanous Notations
We also write \(X \in \mathbb R^{\ell \times n}\) for a batch input of the LNN.
For sake of convenience, for \(k \in \{1, \ldots, K\}\), we write: \(X^{(k)} := \Lambda_{\theta^{(k)}}^{(k)}(X^{(k-1)})\) and \(X^{(0)} := X\).
For \(X\) a rank \(n\) tensor and \(\sigma \in \mathfrak S_n\) a permutation on \(n\) elements, we introduce the following notation:
an extension of the transposition of matrices: if \(\sigma = \tau_{i,j}\), then \(\pi_\sigma X\) is still a rank \(n\) tensor but whose indices \(i\) and \(j\) have been swapped.
Cost Function
A cost function (or loss function which is equivalent) is a function:
that represents the error between ground truth \(y\) and estimation \(\hat y = \mathcal N(x)\). The training phase will attempt to reach the minimum of the loss \(L\) by adapting the parameters of each layer.
We introduce the notation:
for the loss computed on samples of the dataset \(\mathcal D\) where \(\mathbf y\) is the label vector associated to samples \(X\) (i.e. \(\mathbf y = [y^{(i_j)}]_{j=1}^\ell\) for \(i_j \in \{1, \ldots, N\}\) such that \(X = [x^{(i_j)}]_{j=1}^\ell\)).
The list of loss functions implemented in BGD can be found in the bgd.cost
.