Limits on the recent progress on matrix multiplication algorithms (guest post by Yuval Filmus)

by

Virginia Vassilevska Williams stunned the world of computer science by presenting the first improvement to the matrix multiplication constant \omega in twenty years (later it was found that a more modest progress had been obtained independently and slightly earlier by Andrew James Stothers). The current world record is held by François Le Gall, who showed that \omega < 2.3728639 by perfecting the methods of Vassilevska Williams. For an exposition of this line of work, check out my recent lecture notes. For the full details, I recommend Le Gall’s paper and the journal version of Stothers’ thesis (with his advisor A. M. Davie). The recent progress heavily builds on classic work by Coppersmith and Winograd. Briefly, they came up with a basic identity known as the Coppersmith–Winograd identity. Applying Strassen’s laser method with their own ingenious construction (relying on Salem–Spencer sets), they obtained the bound \omega < 2.388. Applying their method to the tensor square of the basic identity, they obtained the improved bound \omega < 2.376. That’s where things had been standing since 1990, until Stothers managed to perform the computations for the fourth tensor power, obtaining the bound \omega < 2.373 in 2010. A year later (but independently), Vassilevska Williams analyzed the eighth tensor power and obtained the bound \omega < 2.3729. Further progress was obtained by Le Gall, who analyzed the sixteenth and thirty-second tensor powers, obtaining the current world record stated above. Although taking ever higher tensor powers improves the bound on \omega, the improvements are diminishing, and it seems safe to conjecture that taking the Nth tensor power does not yield the conjectured \omega = 2 as N \to \infty. However, how can we be certain? Perhaps the improvements slow down, but like a slowly divergent series, eventually go all the way down to 2? In the rest of the blog post, we describe a recent result of Andris Ambainis and myself, which shows that the best bound that this method can produce is \omega < 2.3078, for any value of N. In fact, our result allows for a wider class of methods which utilize the Coppersmith–Winograd identity, and hint to a new technique which can potentially lead to better algorithms. Very recently, Le Gall was able to use our methods to show that the best bound that can be obtained by taking an Nth tensor power of the Coppersmith–Winograd identity is \omega < 2.3725, and so the current analysis of the identity is quite tight. The rest of this post assumes that the reader is versed in the basics of the theory of matrix multiplication algorithms. Specifically, we expect you to know the statement of Schönhage’s asymptotic sum inequality, which is described in my earlier blog post.

The laser method

Schönhage’s asymptotic sum inequality allows us to obtain an upper bound on \omega given an upper bound on the border rank of a disjoint sum of matrix multiplication tensors \langle n_i,m_i,p_i \rangle. Strassen’s laser method is a framework for achieving the same for non-disjoint sums of matrix multiplication tensors. It is best illustrated with the following example from the paper of Coppersmith and Winograd:

\displaystyle \epsilon^3 \sum_{i=1}^q (x_0^{[0]} y_i^{[1]} z_i^{[1]} + x_i^{[1]} y_0^{[0]} z_i^{[1]} + x_i^{[1]} y_1^{[1]} z_0^{[0]}) + O(\epsilon^4) =

\displaystyle \epsilon \sum_{i=1}^q (x_0^{[0]} + \epsilon x_i^{[1]}) (y_0^{[0]} + \epsilon y_i^{[1]}) (z_0^{[0]} + \epsilon z_i^{[1]}) -

\displaystyle \left(x_0^{[0]} + \epsilon^2 \sum_{i=1}^q x_i^{[1]}\right) \left(y_0^{[0]} + \epsilon^2 \sum_{i=1}^q y_i^{[1]}\right) \left(z_0^{[0]} + \epsilon^2 \sum_{i=1}^q z_i^{[1]}\right) +

(1-q\epsilon) x_0^{[0]} y_0^{[0]} z_0^{[0]}.

Here q is some integer parameter. The superscripts on the variables partition them into groups. For example, the two groups of x-variables are x_0^{[0]} and x_1^{[1]}, \ldots, x_q^{[1]}. We can write this identity succinctly as

\underline{R}(\langle 1,1,q \rangle^{0,1,1} + \langle q,1,1 \rangle^{1,0,1} + \langle 1,q,1 \rangle^{1,1,0}) \leq q+2.

The superscripts have the same meaning, for example the first constituent tensor \langle 1,1,q \rangle^{0,1,1} involves the x-variables from group 0, the y-variables from group 1, and the z-variables from group 1. Each constituent tensor uses all variables from each group, and this convention gives a unique interpretation of the short form of the identity.

Denote the tensor whose border rank is estimated by T. This tensor is a sum of three matrix multiplication tensors, but we cannot apply the asymptotic sum inequality since these constituent tensors are not disjoint: for example, the first two constituent tensors share the z-variables. Strassen’s idea was to take a high tensor power of the original identity, and zero variables so that the remaining constituent tensors are disjoint (Strassen actually considered a different operation, degeneration, instead of zeroing, but Coppersmith and Winograd preferred to use the simpler operation of zeroing). The Nth tensor power of T has 3^N constituent tensors, and border rank at most (q+2)^N. For example, when N=2, we obtain

T^{\otimes 2} = \langle 1,1,q^2 \rangle^{00,11,11} + \langle q,1,q \rangle^{01,10,11} + \langle 1,q,q \rangle^{01,11,10}

+ \langle q,1,q \rangle^{10,01,11} + \langle q^2,1,1 \rangle^{11,00,11} + \langle q,q,1 \rangle^{11,01,10}

+ \langle 1,q,q \rangle^{10,11,01} + \langle q,q,1 \rangle^{11,10,01} + \langle 1,q^2,1 \rangle^{11,11,00}.

The partition of the x-variables of T into two groups naturally corresponds to a partition of the x-variables of T^{\otimes 2} into four groups, indexed by 0/1 vectors of length 2. These indices are used above to form the index triples appearing as superscripts. In fact, the dimensions of any of the constituent matrix multiplication tensors can be recovered from its index triple, so we can “summarize” T^{\otimes 2} by giving a list of all nine index triples:

(00,11,11) \; (01,10,11) \; (01,11,10) \; (10,01,11) \; (11,00,11) \; (11,01,10) \; (10,11,01) \; (11,10,01) \; (11,11,00).

When taking the Nth tensor power, there will be 3^N index triples, each consisting of three vectors of length N. Our task is to zero out some groups of variables so that the surviving index triples correspond to disjoint matrix multiplication tensors, and as many of these as possible. Put differently, if we denote by L_N the list of index triples and by X,Y,Z the remaining variables, then we want L_N \cap X \times Y \times Z to consist of index triples in which no x-index repeats, no y-index repeats, and no z-index repeats. If we manage to accomplish this, generating C_N surviving index triples, then the asymptotic sum inequality gives the bound

C_N (q^N)^{\omega/3} \leq (q+2)^N.

(Since each of the constituent tensors \langle n,m,p \rangle in T satisfies nmp = q, each constituent tensor \langle n',m',p' \rangle in T^{\otimes N} satisfies n'm'p' = q^N.)

Let \kappa_N be the maximum of C_N over all legal choices of X,Y,Z. It turns out that \kappa = \lim_{N\to\infty} \kappa_N^{1/N} = 3/2^{2/3}, and so the asymptotic sum inequality shows that (3/2^{2/3}) q^{\omega/3} \leq q+2. When q = 8, this gives the bound \omega < 2.404. We call the limit \kappa the capacity.

Upper bounds on the capacity

Coppersmith and Winograd used an ingenious construction to show that \kappa \geq 3/2^{2/3}, but they neglected to mention a matching upper bound showing that their construction is optimal. This upper bound, whose philosophy is essential to our approach, appears curiously enough in the paper by H. Cohn, R. Kleinberg, B. Szegedy and C. Umans on group-theoritic algorithms for matrix multiplication. Their basic idea is to use the fact that after zeroing out variables, no x-index repeats, and so \kappa_N is at most the number of x-indices. Since there are at most 2^N x-indices, this gives \kappa_N \leq 2^N and so \kappa \leq 2, a bound not as good as the one we had hoped for (3/2^{2/3} \approx 1.88988).

To improve on this trivial bound, we divide the surviving index triples into types. An index triple (I,J,K) has type (t_a,t_b,t_c) if it results from t_a copies of (0,1,1), t_b copies of (1,0,1) and t_c copies of (1,1,0). Within each given type, the number of distinct x-indices depends on the type. For example, when t_a=t_b=t_c=N/3, the number of distinct x-indices is \binom{N}{N/3}. It is not hard to check that for each choice of t_a,t_b,t_c, either the number of distinct x-indices is at most \binom{N}{N/3}, or the same is true for either the number of distinct y-indices or the number of distinct z-indices. Since there are \binom{N+2}{2} different types, we obtain the bound

\kappa_N \leq \binom{N+2}{2} \binom{N}{N/3}.

Stirling’s approximation then shows that \kappa \leq 3/2^{2/3}; note that the number of types \binom{N+2}{2} disappears when computing the limit \kappa_N^{1/N} since \binom{N+2}{2} is polynomial rather than exponential in N.

The Coppersmith–Winograd identity

The Coppersmith–Winograd identity improves on the previous identity by computing three more products at no cost in the border rank:

\underline{R}(\langle 1,1,q \rangle^{0,1,1} + \langle q,1,1 \rangle^{1,0,1} + \langle 1,q,1 \rangle^{1,1,0} + \langle 1,1,1 \rangle^{0,0,2} + \langle 1,1,1 \rangle^{0,2,0} + \langle 1,1,1 \rangle^{2,0,0}) \leq q+2.

This identity is the fount and source of all improvements on \omega since 1989. The two groups of x-variables x_0^{[0]};x_1^{[1]},\ldots,x_q^{[1]} appearing in the previous identity are joined by a new group x_{q+1}^{[2]} consisting of a single variable. The curious reader can try to spell out the actual identity on her own, or she can look it up in my lecture notes. We denote the corresponding tensor by T_{CW}.

The laser method can be applied to this identity in much the same way as before. After taking the Nth tensor power, we are left with 6^N different constituent tensors \langle n_i,m_i,p_i \rangle, each represented by an index triple (I_i,J_i,K_i). If the index triple contains M factors of one of the forms (0,1,1);(1,0,1);(1,1,0) then n_im_ip_i = q^M. Our task is to zero out some variables so that the remaining constituent tensors \langle n_{i'} m_{i'} p_{i'} \rangle are disjoint, and the quantity C_N = \sum_{i'} (n_{i'} m_{i'} p_{i'})^{\omega/3} is maximized. Since we don’t know \omega, we maximize instead \sum_{i'} (n_{i'} m_{i'} p_{i'})^{\rho/3} for some parameter \rho. If \kappa_{\rho,N} is the maximum of this quantity for given \rho, then the asymptotic sum inequality shows that \kappa_{\omega,N} \leq (q+2)^N.

It turns out that for each \rho, the limit \kappa_\rho = \lim_{N\to\infty} \kappa_{\rho,N}^{1/N} exists, and the asymptotic sum inequality shows that \kappa_\omega \leq N+2. This reduces the analysis of T_{CW} to the determination of its associated capacity \kappa_\rho. Coppersmith and Winograd show that

\displaystyle \kappa_\rho \geq 2^{H(\tfrac{2-\alpha}{3},\tfrac{2\alpha}{3},\tfrac{1-\alpha}{3})} q^{\alpha\rho/3}, \text{ where } \alpha = \frac{-3+\sqrt{32q^{-\rho}+1}}{8q^{-\rho}-2},

where H is the base 2 entropy function. Amazingly, the method of Cohn, Kleinberg, Szegedy and Umans shows that this lower bound is tight! When q = 6 and \rho = 2.388, one checks that \kappa_\rho > q+2, and since \kappa_\omega \leq q+2 and \kappa_\omega is increasing in \omega, this shows that \omega < 2.388.

Powers of the Coppersmith–Winograd identity

Coppersmith and Winograd managed to get an even better bound on \omega by considering the tensor square T_{CW}^{\otimes 2} whose border rank is at most (q+2)^2. If we retain the old partition into groups (so there are now nine groups of each of the x-variables, y-variables and z-variables), then a quick consideration of the laser method reveals that we should obtain the exact same bound on \omega. Instead, Coppersmith and Winograd merge the original nine groups into five groups by putting the original group i_1i_2 into the new group i_1+i_2. This corresponds to the following partition of the original groups:

0 = \{00\} ;\; 1=\{01,10\} ;\; 2=\{02,11,20\} ;\; 3=\{12,21\} ;\; 4=\{22\}.

The original 36 constituent tensors are now grouped to 15 constituent tensors:

  1. 3 terms of the form \langle 1,1,1 \rangle^{0,0,4}.
  2. 6 terms of the form \langle 1,1,2q \rangle^{0,1,3} resulting from the merger of \langle 1,1,q \rangle^{00,10,12} and \langle 1,1,q \rangle^{00,01,21} into a single matrix multiplication tensor.
  3. 3 terms of the form \langle 1,1,q^2+2 \rangle^{0,2,2} resulting from the merger of three original constituent tensors: \langle 1,1,q^2 \rangle^{00,11,11}, \langle 1,1,1 \rangle^{00,20,02} and \langle 1,1,1 \rangle^{00,02,20}.
  4. 3 terms whose index triple is of the form (1,2,2) that are not matrix multiplication tensors, one of which results from merging \langle 1,q,1 \rangle^{10,10,02}, \langle 1,q,1 \rangle^{01,01,20}, \langle q,1,q \rangle^{10,01,11} and \langle q,1,q \rangle^{01,10,11}.

The idea is now to apply the same method, but we encounter a problem: there are three tensors which are not matrix multiplication tensors. However, each of these tensors is itself a sum of matrix multiplication tensors, and so can be analyzed using the same method. In this recursive fashion, Coppersmith and Winograd were able to analyze this second power, and obtained the bound \omega < 2.376. What Stothers and Vassilevska Williams did was to codify this recursive application of the laser method, and using this codification they were able to handle larger powers, using a computer for the calculations. Their method has one shortcoming: when analyzing higher powers, the capacity problems encountered are too hard to solve exactly. Instead, they employ costly numerical methods that produce a good lower bound on the relevant capacities. Le Gall came up with a more efficient method which produces slightly inferior bounds but is applicable to higher tensor powers.

A different viewpoint: the laser method with merging

Our main innovation is a different way of looking at the analysis of powers of T_{CW}. The recipe of the laser method consists of taking a high tensor power, zeroing groups of variables so that the remaining constituent tensors are disjoint, and applying the asymptotic sum inequality. We add an additional step: after zeroing groups of variables, we allow merging sets of constituent tensors, as long as each merged set forms a matrix multiplication tensor, and the resulting merged constituent tensors are all disjoint (before the merging step, the surviving constituent tensors don’t need to be disjoint). After the merging step, we apply the asymptotic sum inequality as before. We call this scheme the laser method with merging. We already saw some examples of the merging step in the analysis of T_{CW}^{\otimes 2}: for example, the two constituent tensors \langle 1,1,q \rangle^{00,10,12} and \langle 1,1,q \rangle^{00,01,21} are merged into the matrix multiplication tensor \langle 1,1,2q \rangle^{\{(00,10,12),(00,01,21)\}}. Coppersmith and Winograd’s analysis of T_{CW}^{\otimes 2} is an example of the laser method with merging applied to the original tensor T_{CW}. To see this, let us examine their analysis. They first merge some tensors in T_{CW}^{\otimes 2}, and are left with a sum of matrix multiplication tensors and three “complicated” tensors, which we denote by T_4,T_5,T_6, each of which is by itself a sum of matrix multiplication tensors. They proceed to take an Nth tensor power of T_{CW}^{\otimes 2}, and zero some groups of variables (each group corresponds to several original groups) so that the resulting constituent tensors are disjoint. Each surviving constituent tensor contains a factor of the form (T_4 \otimes T_5 \otimes T_6)^{\otimes M} (the powers of T_4,T_5,T_6 are always the same in their construction), and by zeroing out groups of variables internal to this factor, they reduce it to a disjoint sum of matrix multiplication tensors. After this second step of zeroing, we are left with a disjoint sum of matrix multiplication tensors, to which the asymptotic sum inequality is finally applied. Instead of first merging the constituent tensors of T_{CW}^{\otimes 2}, taking an Nth tensor power, and then zeroing variables in two steps, we can reverse the order of steps. Starting with T_{CW}^{2N}, we first zero out variables, and then do the appropriate merging. We end up with the exact same disjoint sum of matrix multiplication tensors as before. This hopefully convinces the reader that the analysis of Coppersmith and Winograd can be put in our framework. The analysis of higher power by Stothers, Vassilevska Williams and Le Gall adds more levels of recursion but is otherwise the same, so these also fall into our framework. If we could prove an upper bound on the corresponding notion of capacity, we would obtain a limit on the upper bound on \omega provable using this framework.

Limits on the laser method with merging

Let \kappa_{\rho,N} be the maximum of \sum_i (n_im_ip_i)^{\rho/3} over all direct sums \bigoplus_i \langle n_i,m_i,p_i \rangle which can be obtained from T_{CW}^N by zeroing out groups of variables and merging the surviving constituent tensors in sets. The asymptotic sum inequality shows that \kappa_{\omega,N} \leq (q+2)^N. It turns out that the limit \kappa_\rho = \lim_{N\to\infty} \kappa_{\rho,N}^{1/N} exists (essentially due to the easy fact \kappa_{\rho,N_1 N_2} \geq \kappa_{\rho,N_1} \kappa_{\rho,N_2}), and so the bound obtained by the method is \kappa_\omega \leq q+2. Our goal in this section is to obtain an upper bound on the capacity \kappa_\rho, which correspond to a lower bound on the solution of the equation \kappa_\rho = q+2. The first step is to understand which mergers are possible — which sets of constituent tensors of T_{CW}^{\otimes N} can be merged to form a matrix multiplication tensor. Let’s take a look at the mergers taking place in Coppersmith and Winograd’s analysis of T_{CW}^{\otimes 2}:

(00,10,12) \; (00,01,21)

(00,11,11) \; (00,02,20) \; (00,20,02)

In both cases, the x-indices of the merged index triples consist entirely of zeroes.This is not the most general possible form of merging. Indeed, consider the following example for N=4:

(0012,1000,1210) \; (0021,1000,1201) \; (0012,0100,2110) \; (0021,0100,2101)

This results from the tensor product of (00,10,12) \; (00,01,21) and its rotation (12,00,10) \; (21,00,01). In this example, the first two positions are always zero in the x-indices, and the last two positions are always zero in the y-indices.

Generalizing, we say that a set of constituent tensors of T_{CW}^{\otimes N} is merged on zeroes if for each of the N positions, either all the x-indices are zero at the position, or all the y-indices are zero at the position, or all the z-indices are zero at the position. Using this nomenclature, we see that Coppersmith and Winograd’s analysis of T_{CW}^{\otimes 2} always results in sets of constituent tensors merged on zeroes. Looking closely at the works of Stothers, Vassilevska Williams and Le Gall, we see that the same is true for their analyses. Indeed, we managed to prove that this is always the case.

Lemma. If q > 1 then any set of constituent tensors of T_{CW}^{\otimes N} which merge to a matrix multiplication tensor is merged on zeroes.

The interested reader can look at the proof in our paper (Lemma 4.1 on page 11).

Given the lemma, our upper bound on the capacity broadly follows the ideas in the upper bound of Cohn, Kleinberg, Szegedy and Umans. Given N, consider a collection of disjoint tensors resulting from starting with T_{CW}^{\otimes N}, zeroing out some groups of variables, and merging some of the surviving tensors. We call each merged set of tensors a line. Each line is associated with a line type (t_x,t_y,t_z), where t_x is the number of positions at which the x-indices are always zero. Each constituent tensor inside each line has a tensor type which corresponds to how many of the original six constituent tensors were used to form it. Since the number of types is polynomial, we can focus our attention on an arbitrary line type and tensor type.

Consider for simplicity the case where the line type is (N/3,N/3,N/3) and the tensor type is of the form (\alpha,\alpha,\alpha,\beta,\beta,\beta) (i.e., it gives the same weight to constituent tensors of similar shape). We can calculate the number P of distinct x-indices and the number Q of distinct x-indices which can appear in any given line; Q<P since some of the coordinates are fixed at zero. Each constituent tensor in each line has the same dimensions \langle M,M,M \rangle, where the value of M depends on the parameters \alpha,\beta. Convexity considerations show that it is advantageous for all lines to contain the same number R \leq Q of distinct x,y,z-indices. The total number of x-variables, y-variables and z-variables in each line is RM^2 of each (since each x-index corresponds to M^2 x-variables), and so the line is isomorphic to \langle \sqrt{R}M,\sqrt{R}M,\sqrt{R}M \rangle. The total number of lines is at most P/R (since different lines have disjoint x-indices), and so the total contribution to \kappa_{\rho,N} is (P/R) (\sqrt{R}M)^\rho = P R^{\rho/2-1} M^\rho \leq PQ^{\rho/2-1} M^\rho.

We can similarly bound the contributions of other line types and tensor types, and through this we obtain an upper bound on \kappa_{\rho,N} and so on \kappa_\rho.

Theorem. For q=5 and T_{CW}, the solution to \kappa_\rho = q+2 satisfies \rho > 2.3078.

We obtain similar results for other values of q. Curiously, the results deteriorate as q gets smaller.

Le Gall used our technique to analyze the merged version of the tensor T_{CW}^{\otimes 16}. For this tensor, he obtained a significantly improved bound.

Theorem (Le Gall). For q=5 and the merged version of T_{CW}^{2^r} for any r, analyzed recursively as in prior works, cannot yield a bound better than \omega < 2.3725.

His result encompasses the analyses of T_{CW}^{\otimes N} à la Stothers, Vassilevska Williams and himself for all values of N; no value of N can prove the bound \omega < 2.3725.

What next?

We have seen that the Coppersmith–Winograd identity cannot be used to prove \omega = 2, at least not using current techniques. Anybody who wishes to prove \omega = 2 should therefore either look for a different identity, or develop a new technique. The second avenue has been tried by Cohn and Umans, who suggested a technique based on groups. Cohn, Kleinberg, Szegedy and Umans showed how to match the bound \omega < 2.376 obtained by Coppersmith and Winograd using the group-theoretic approach; their construction heavily relies on the methods of Coppersmith and Winograd, and in particular they need to use Coppersmith and Winograd’s lower bound on the capacity, which is the difficult part of both proofs. It is fair to say that so far, the group-theoretic techniques have not yielded any better upper bounds on \omega. It thus seems that the most promising avenue for major improvement is finding better identities replacing the Coppersmith–Winograd identity, which has served us for 25 years. Perhaps computers can be enlisted for the search?

Until we find a new identity or a new technique, here is another suggestion for obtaining an improved upper bound on \omega, albeit not \omega = 2. Let \omega_W be the best bound that can be obtained by analysing T_{CW}^{\otimes W} according to the previous techniques, and let \omega_\infty = \lim_{W\to\infty} \omega_W. The bound \omega_W is obtained as the solution to \lim_{N\to\infty} \kappa_{\omega_W,W,N}^{1/N} = q+2, where \kappa_{\rho,W,N} is the restriction of \kappa_{\rho,N} in which each merged set of tensors is a cartesian product of tensors of width W. The best bound \omega' obtained by the laser method with merging applied to T_{CW}, in contrast, is the solution to \lim_{N\to\infty} \kappa_{\omega',N,N}^{1/N} = q+2. It could be that

\lim_{N\to\infty} \kappa_{\rho,N,N}^{1/N} > \lim_{W\to\infty} \lim_{N\to\infty} \kappa_{\rho,W,N}^{1/N},

and in that case \omega' < \omega_\infty. In other words, it could be the case that allowing the merging width to grow with N results in a better bound on \omega. Unfortunately, so far we haven’t been able to obtain an upper bound on \omega along these lines.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: