This is PaperWhy. Our sisyphean endeavour not to drown in the immense Machine Learning literature.

With thousands of papers every month, keeping up with and making sense of recent research in machine learning has become almost impossible. By routinely reviewing and reporting papers we help ourselves and hopefully someone else.

Spectral Clustering based on Local PCA

Actually appeared in 2011.

tl;dr This paper develops an algorithm in manifold clustering1, Connected Component Extraction, which attempts to resolve the issue of intersecting manifolds. The idea is to use a local version of PCA at each point to determine the “principal” or “approximate tangent space” at that point in order to compute a set of weights for neighboring points. Then these weights are used to build a graph and Spectral Graph Partitioning2 is applied to compute its connected components.


Because the goal is to resolve intersections, it is necessary to assume that the input data is distributed around smooth, smoothly intersecting manifolds.3

Three variants of CCE are presented: comparing covariances, using projections the eigenvectors of the covariance matrix and using local PCA.

Highlights

  • In their comparison of the methods to prior work, the authors review multiple affinities and their relative weaknesses.
  • There is guaranteed clustering for CCE using local covariances and local projections (but not local PCA), assuming that:
    • the data has been sampled uniformly from smooth $d$-dimensional manifolds with bounded additive noise.4
    • the right (hyper-) parameters have been set. Note that these values depend on the geometric configuration but in principle cross-validation should work.
  • Detailed and self-contained proofs are given: the authors provide a characterization of the covariance matrix for data as above (uniformly sampled, additive noise), as estimates of relevant geometric quantities, bounds on distances of covariance functions (and those of their push-forwards), as well as a form of Hoeffding’s inequality for matrices, among other things.
  • Better performance than … Check the paper for detailed examples.

Method

All their three variants begin by subsampling the dataset, then computing the sample covariance matrix at each point.

The first variant then discards points with a heuristic which is believed to approximate the condition of “being close to an intersection” and builds an unweighted graph with edges between points $x_i, x_j$ which are both close enough and have similar enough covariance matrices $| C_i - C_j | \le \eta r^2 $. This is encoded in the affinity:

\[ W_{i j} = \mathbb{1}_{ \lbrace | x_i - x_j | < \varepsilon \rbrace } \mathbb{1}_{ \lbrace | C_i - C_j | < \eta r^2 \rbrace }, \]

Here the matrix norm is the spectral norm: intuitively, the smaller $| C_i - C_j |$ is, the “more parallel” the covariance matrices will be.

This graph is then fed to SGP to compute the connected components and the data is clustered.

The second method, local projection, skips the heuristic but instead computes projections $Q_i$ onto the principal vectors of the $C_i$ (with eigenvalues $>\sqrt{\eta}|Q_i|$) and uses the affinity

\[ W_{i j} = \mathbb{1}_{ \lbrace | x_i - x_j | < \varepsilon \rbrace } \mathbb{1}_{ \lbrace | Q_i - Q_j | < \eta \rbrace }, \]

Here is the third and best performing method, using local PCA, in detail:

  1. Create a subsample of adequately spaced data points $S = \lbrace x_1, \ldots, x_{n_0} \rbrace $, which are centers of disjoint balls of radius $r > 0$ a fixed hyperparameter.
  2. For each $x_i \in S$, perform local PCA in the nhood $N_r (x_i)$ to compute the principal directions of an approximate tangent space at $x_i$.5 Let $Q_i$ be the projection onto the first $d$ vectors of $C_i$. The local dimension $d$ is another (essential) hyperparameter, but the gap in the singular values can provide good splitting points for the cross-validation.
  3. For each $x_i, x_j \in S$ compute the affinities \[ W_{i j} = \exp \left( - \frac{| x_i - x_j |^2}{\varepsilon^2} \right) \exp \left( - \frac{| Q_i - Q_j |^2}{\eta^2} \right), \] intuitively, the smaller $| Q_i - Q_j |$ is, the “more parallel” the tangent spaces will be and the closer to 1 the second exponential will be. On the contrary, a large norm will make the affinity close to 0.
  4. Use SGP on $W$. Cluster all data according to closest center.

Drawbacks

  • Assumes a known number of clusters (required for SGP).
  • Assumes the dimension of the manifolds is known. Estimates using the jump in eigenvalues after PCA didn’t perform well.
  • Assumes smooth manifolds intersect at non zero angles (but consider the graphs of $x^2$ and $x^3$ intersecting at 0).6

  1. See e.g. the review Subspace Clustering, Vidal, R. (2011)
  2. On Spectral Clustering: Analysis and an algorithm, Ng, A. , Jordan, M. , Weiss, Y. (2002)
  3. Manifold clustering is particularly relevant in motion segmentation and some specific cases of face recognition. But what has been the effect of the development of deep convolutional neural networks on the relevance of these techniques?
  4. Check me: Is this essential for the proofs?
  5. Local PCA amounts to an eigenvalue decomposition of the local covariance matrix $C_i$.
  6. Maybe one could use some idea of momentum as one moves through the manifold? Assuming enough regularity it should be possible to determine when a “trajectory” escapes the manifold.