Today’s post reviews a recent talk given at the Simons Institute for the Theory of Computing in their current workshop series Computational Challenges in Machine Learning.

**tl;dr:** Sufficiently regular functions (roughly: having Lipschitz,
invertible derivatives) can be represented as compositions of
decreasing, small perturbations of the identity. Furthermore, critical
points of the quadratic loss for these target functions are proven to
be always minima, thus ensuring loss-reducing gradient descent
steps. This makes this class of functions “easily” approximable by
Deep Residual Networks.

Deep networks are deep compositions of non-linear functions

\[ h = h_m \circ h_{m - 1} \circ \ldots \circ h_1 . \]

Depth provides effective, parsimonious representations of features and
nonlinearities provide better rates of approximation (known from
representation theory). Even more, shallow representations of some
functions are necessarily more complex than deeper ones (*no
flattening theorems*).^{1} But optimization is hard with many layers:
conventional DNNs show increasingly poorer performance at the same
task with growing depth, even though they can approximate a strictly
larger set of functions.

**D**eep **R**esidual **N**etworks^{2} overcome this limitation by
introductig skip-paths connecting the input of layer $j$ to the output
of layer $j+1$:

\[ f_{j + 1} (x_j) = w_{j + 1} \sigma (w_j x_j) + x_j, \]

where $\sigma$ is typically a ReLU.^{3}

**Why?** First consider the following fact about linear maps:

Theorem 1:^{4}Any $A$ with $\det A > 0$ can be written as a product of perturbations of the identity with decreasing norm, i.e.\[ A = (I + A_m) \cdots (I + A_1)\]

with the spectral norm of each perturbation fulfilling $| A_i | =\mathcal{O} (1 / m)$.

Consider now a linear Gaussian model $y = Ax + \varepsilon$ with $\varepsilon \sim \mathcal{N} (0, \sigma^2 I)$. If one sets to find $A_i$ which minimize the quadratic loss

$$ \mathbb{E} | (I + A_m) \cdots (I + A_1) x - y |^2, $$

over all $A_{i}$ such that $I + A_{i}$ is near the identity,
i.e. among all $|A_{i} | \ll 1$ in some suitable norm, then it can
be shown that *every stationary point is a global optimum*, that is:
if one has $\nabla _{A_1, \ldots, A_m} R (A^{\star}_{1}, \ldots,
A_{m}^{\star}) = 0$, then one has found the true $A = (I +
A^{\star}_{m}) \cdots (I + A^{\star}_{1})$ (according to the
model). Note that this a property of stationary points in this region
and does not say that one can attain these points by some particular
discretization of gradient descent, i.e. this is neither a
generalization bound nor a global result.

The goal of the talk is to show that similar staments hold in the
non-linear case as well. The **main result** is (roughly):

Theorem 2: The computation of a “smooth invertible map” $h$ can be spread throughout a deep network\[ h = h_m \circ h_{m - 1} \circ \ldots \circ h_1, \]

so that all layers compute near-identity functions:

\[ | h_i - Id |_L =\mathcal{O} \left( \frac{\log m}{m} \right). \]

Where the $| f |_L$ semi-norm is the optimal Lipschitz constant of $f$ and “smoothly invertible” means invertible and differentiable, with Lipschitz derivative and inverse

How does this relate to DRNs? Recall that a standard unit of the kind $A_i \sigma (B_i x)$ is known to be a universal approximator so any of the $h_i$ in the decomposition of Theorem 2 can be approximated by units of DRNs of the form

\[ \hat{h}_i(x) = x + A_i \sigma (B_i x), \]

with the perturbations “becoming flatter and flatter” with $i$. This means that DRNs allow for compositional representation of functions where terms are increasingly “flat” as one goes deeper. Note however that this is only proved for functions which are invertible and differentiable, with Lipshitz derivative and inverse.

The functions $h_i$ of the theorem can be explicitly constructed via adequate scalings $a_1, \dots, a_m \in \mathbb{R}$ such that:

\[ h_1 (x) = h (a_1 x) / a_1, h_2 (h_1 (x)) = h (a_2 x) / a_2, \ldots, h_m \circ \cdots \circ h_1 (x) = h (a_m x) / a_m, \]

and the $a_i$ small enough that $h_i \approx Id$.

Analogously to the linear case, for the class of functions which may
be represented as such nested, near-identity compositions of maps,
**stationary points of the risk function**

\[ Q (h) = \frac{1}{2} \mathbb{E} | h (X) - Y |_2^2 \]

are global minima.^{5} Then

Theorem 3: for any function\[ h = h_m \circ h_{m - 1} \circ \ldots \circ h_1, \]

where $| h_i - Id |_L \leqslant \varepsilon < 1$, it holds that for all $i$:

\[ | D_{h_i} Q (h) | \geqslant \frac{(1 - \varepsilon)^{m - 1}}{| h - h^{\ast} |} (Q (h) - Q (h^{\ast})) . \]

This means that if we start with any $h$ in this class of functions
near the identity which is *suboptimal* (i.e. $Q (h) - Q (h^{\ast}) >
0$), then the (Fréchet) gradient is bounded below and a gradient
descent step can be taken to improve the risk.

Note that this is in the whole space of such nested functions, not in the particular parameter space of some instance $\hat{h}$: it can happen that the optimal direction in the whole space is “orthogonal” to the whole subspace allowed by changing weights in the layers of $\hat{h}$. Or it can (and will!) happen that there are local minima among all possible parametrizations of any layer $\hat{h}$. The following statement remains a bit unclear to me:

We should expect suboptimal stationary points in the ReLU or sigmoid parameter space, but these cannot arise because of interactions between parameters in different layers; they arise only within a layer.

Basically: if we are able to optimize in the space of architectures, we should always be able to improve performance (assuming invertible $h$ with Lipschitz derivative and so on).

- See e.g.
Why does deep and cheap learning work so well?
.
^{⇧} -
Deep Residual Learning for image recognition
.
^{⇧} -
Deep sparse rectifier neural networks
.
^{⇧} -
Identity matters in Deep Learning
.
^{⇧} - Recall that the minimizer of the risk with quadratic loss is the $L^2$ projection, i.e. the conditional expectation $h^{\ast} (x) = \mathbb{E} [Y|X = x]$.
^{⇧}