lec 22 SPECTRAL GRAPH CLUSTERING
Table of Contents
lec 22 SPECTRAL GRAPH CLUSTERING
=======================
Input: Weighted, undirected graph G = (V, E). No self-edges. w_ij = weight of edge (i, j) = (j, i); zero if (i, j) not in E.
[Think of the edge weights as a similarity measure. A big weight means that the two vertices want to be in the same cluster. So the circumstances are the opposite of the last lecture on clustering. Then, we had a distance or dissimilarity function, so small numbers meant that points wanted to stay together. Today, big numbers mean that vertices want to stay together.]
Goal: Cut G into 2 (or more) pieces G_i of similar sizes, but don't cut too much edge weight. [That's a vague goal. There are many ways to make this precise. Here's a typical goal, which we'll solve approximately.] Cut(G_1, G_2) e.g. Minimize the _sparsity_ -------------------, aka _cut_ratio_ Mass(G_1) Mass(G_2) where Cut(G_1, G_2) = total weight of cut edges Mass(G_1) = # of vertices in G_1 OR assign masses to vertices
[The denominator "Mass(G_1) Mass(G_2)" penalizes imbalanced cuts.]
[Show illustration of four cuts (graph.pdf). All edges have weight 1. Upper left: the _minimum_bisection_; a _bisection_ is perfectly balanced. Upper right: the _minimum_cut_. Usually very unbalanced; not what we want. Lower left: the _sparsest_cut_, which is good for many applications. Lower right: the _maximum_cut_; in this case also the maximum bisection.] Sparsest cut is NP-hard. [We will look for an approximate solution.]
[We will turn this combinatorial graph cutting problem into algebra.]
Let n = |V|. Let y in R^n be an _indicator_vector_: / 1 vertex i in G_1, y = < i \ -1 vertex i in G_2. w_ij 2 / w (i, j) is cut, Then ---- (y - y ) = < ij 4 i j \ 0 (i, j) is not cut. w_ij 2 Cut(G , G ) = sum ---- (y - y ) 1 2 (i,j) in E 4 i j 1 2 2 = - sum (w y - 2 w y y + w y ) 4 (i,j) in E ij i ij i j ij j 1 n 2 = - ( sum - 2 w y y + sum y sum w ) 4 (i,j) in E ij i j i=1 i k!=i ik \_______________________/ \_____________/ y^T L y off-diagonal terms diagonal terms = ------- 4 (1) - - / - w_ij i != j, 1 / \ 3 | 4 -1 -3 | where L = < / \ L = | -1 6 -5 | ij \ sum w i = j. (2)-----(3) | -3 -5 8 | k!=i ik 5 - - L is symmetric, n-by-n _Laplacian_matrix_ for G.
[L is effectively a matrix representation of G. For the purpose of partitioning a graph, there is no need to distinguish edges of weight zero from edges that are not in the graph.] [We see that minimizing the weight of the cut is equivalent to minimizing the Laplacian_quadratic_form y^T L y. This lets us turn graph partitioning into a problem in matrix algebra.] [Usually we assume there are no negative weights, in which case Cut(G_1, G_2) can never be negative, so it follows that L is positive semidefinite.]
If y = 1 = [ 1 1 ... 1 ]^T, then Cut(G_1, G_2) = 0, so [*1* = bold 1.] 1 is an eigenvector of L with eigenvalue 0.
[If G is connected and all the edge weights are positive, then this is the only zero eigenvalue. But if G is not connected, L has one zero eigenvalue for each connected component of G. It's easy to prove, but time prevents me.]
Bisection: exactly n/2 vertices in G_1, n/2 in G_2. Write *1*^T y = 0. [So we have reduced graph bisection to this constrained optimization problem.]
-------------------------------------------- | T | | Find y that minimizes y L y | | | | subject to for all i, y = 1 or y = -1 | <- _binary_constraint_ | i i | | T | | and *1* y = 0 | <- _balance_constraint_ -------------------------------------------- Also NP-hard. We _relax_ the binary constraint. -> fractional vertices!
[A very common approach in combinatorial optimization algorithms is to relax some of the constraints so a discrete problem becomes a continuous problem. Intuitively, this means that you can put 1/3 of vertex 7 in graph G_1 and the other 2/3 of vertex 7 in graph G_2. You can even put -1/2 of vertex 7 in graph G_1 and 3/2 of vertex 7 in graph G_2. This sounds crazy, but the continuous problem is much easier to solve than the combinatorial problem. After we solve it, we will round the vertex values to +1/-1, and we'll hope that our solution is still close to optimal.] [We can't just drop the binary constraint, though. We still need some constraint to rule out the solution y = 0.]
New constraint: y must lie on sphere of radius sqrt(n). [Draw figure showing constraint before--y lies at a vertex of the hypercube-- and the constraint after--y lies on the sphere through those vertices.]
Relaxed problem: -------------------------- | T | T | Minimize y L y | \ y L y | T | > = Minimize ------ = Rayleigh quotient of L & y | subject to y y = n | / T | | y y | T | | and *1* y = 0 | --------------------------
[Show illustration of isosurfaces of y^T Ly (cylinder.pdf) and illustration restricted to the hyperplane 1^T y = 0 (endview.pdf).]
[You should remember this Rayleigh quotient from the lecture on PCA. As I said then, when you see a Rayleigh quotient, you should smell eigenvalues nearby. The y that minimizes this Rayleigh quotient is the eigenvector with the smallest eigenvalue. We already know what that eigenvector is: it's 1. But that violates our balance constraint. As you should recall from PCA, when you've used the most extreme eigenvector and you need an orthogonal one, the next-best optimizer of the Rayleigh quotient is the next eigenvector.]
Let lambda_2 = second-smallest eigenvalue of L. Eigenvector v_2 is the Fiedler_vector. [It would be wonderful if every component of the Fiedler vector was 1 or -1, but that happens more or less never. So we round it. The simplest way is to round all positive entries to 1 and all negative entries to -1. But in both theory and practice, it's better to choose the threshold as follows.]
Spectral partitioning alg.:
- Compute Fiedler vector v_2 of L - _Round_ v_2 with a _sweep_cut_: * Sort components of v_2. * Try the n - 1 cuts between successive components. Choose min-sparsity cut. [If we're clever about it, we can try all these cuts in time linear in the number of edges in G.]
[Show example of graph partitioned by the sweep cut (specgraph.pdf).] [Show what the un-rounded Fiedler vector looks like (specvector.pdf).] [One consequence of relaxing the binary constraint is that the balance constraint no longer forces an exact bisection. But that's okay; we're cool with a slightly off-balance constraint if it means we cut fewer edges.] [Show illustration where an off-balance cut is better (lopsided.pdf).]
L_ii Fact: Sweep cut finds a cut w/sparsity <= sqrt(2 lambda_2 max_i ----); _Cheeger's_inequality_. M_ii The optimal cut has sparsity >= lambda_2 / 2.
[So the spectral partitioning algorithm is an approximation algorithm, albeit not one with a constant factor of approximation. Cheeger's inequality is a very famous result in spectral graph theory, because it's one of the most important cases where you can relax a combinatorial optimization problem to a continuous optimization problem, round the solution, and still have a provably decent solution to the original combinatorial problem.]
Vertex Masses
[Sometimes you want the notion of balance to accord more prominence to some vertices than others. We can assign masses to vertices.] Let M be diagonal matrix with vertex masses on diagonal. New balance constraint: *1*^T M y = 0. [This new balance constraint says that G_1 and G_2 should each have the same total mass. It turns out that this new balance constraint is easier to satisfy if we also revise the sphere constraint a little bit.] New ellipsoid constraint: y^T M y = Mass(G) = sum M_ii. [Instead of a sphere, now we constrain y to lie on an axis-aligned ellipsoid.] [Draw the ellipsoid, which passes through points of hypercube.]
Now we want Fiedler vector of generalized_eigensystem Lv = lambda Mv. [Most algorithms for computing eigenvectors and eigenvalues of symmetric matrices can easily be adapted to compute eigenvectors and eigenvalues of symmetric generalized eigensystems like this too.]
Vibration Analogy
[Show figure of system of springs and masses (vibrate.pdf).] [For intuition about spectral partitioning, think of the eigenvectors as vibrational modes in a physical system of springs and masses. Each vertex models a point mass that is constrained to move freely along a vertical rod. Each edge models a vertical spring with rest length zero and stiffness proportional to its weight, pulling two point masses together. The masses are free to oscillate sinusoidally on their rods. The eigenvectors of the generalized eigensystem Lv = lambda Mv are the vibrational modes of this physical system, and their eigenvalues are proportional to their frequencies.]
[Show figure of vibrational modes in path graph and grid graph (grids.pdf).] [These illustrations show the first four eigenvectors for two simple graphs. On the left, we see that the first eigenvector is the eigenvector of all 1's, which represents a vertical translation of all the masses in unison. That's not really a vibration, which is why the eigenvalue is zero. The second eigenvector is the Fiedler vector, which represents the vibrational mode with the lowest frequency. Each component indicates the amplitude with which the corresponding point mass oscillates. At any point in time as the masses vibrate, roughly half the mass is moving up while half is moving down. So it makes sense to cut between the positive components and the negative components. The third eigenvector also gives us a nice bisection of the grid graph, entirely different from the Fiedler vector. Some more sophisticated graph clustering algorithms use multiple eigenvectors.]
[I want to emphasize that spectral partitioning takes a global view of a graph. It looks at the whole gestalt of the graph and finds a good cut. By comparison, the clustering algorithms we saw last lecture were much more local in nature, so they're easier to fool.]
Greedy Divisive Clustering
Partition G into 2 subgraphs; recursively cluster them. [The sparsity is a good criterion for graph clustering. Use G's sparsest cut to divide it into two subgraphs, then recursively cut them. You can stop when you have the right number of clusters, or you could keep going until each subgraph is a single vertex and create a dendrogram.] Can form a dendrogram, but it may have inversions. [There's no reason to expect that the sparsity of a subgraph is smaller than the sparsity of the parent graph, so the dendrogram can have inversions. But it's still useful for getting an arbitrary number of clusters on demand.]
The Normalized Cut
Set vertex i's mass M_ii = L_ii. [Sum of edge weights adjoining vertex i.] [That is how we define a "normalized cut", which turns out to be a good choice for many different applications.] Popular for image_segmentation. [Image segmentation is the problem of looking at a photograph and separating it into different objects. To do that, we define a graph on the pixels.] For pixels with location w_i, brightness b_i, use graph weights
2 2 w = exp( - |w - w | / alpha - |b - b | / beta ) ij i j i j or zero if |w_i - w_j| large.
[We choose a distance threshold, typically less than 4 to 10 pixels apart. Pixels that are far from each other aren't connected. alpha and beta are empirically chosen constants. It often makes sense to choose beta proportional to the variance of the brightness values.]
[Show segmentation of baseball image (baseballsegment.pdf). The upper left figure is a photo of a scene during a baseball game. The other figures show segments of the image extracted by recursive spectral partitioning.] [Show eigenvectors 2-9 from the baseball image (baseballvectors.pdf).]
Invented by [our own] Prof. Jitendra Malik and his student Jianbo Shi.