On the recent progress on matrix multiplication algorithms (guest post by Virginia Vassilevska Williams)


A central question in the theory of algorithms is to determine the constant \omega, called the exponent of matrix multiplication. This constant is defined as the infimum of all real numbers such that for all \varepsilon>0 there is an algorithm for multiplying n\times n matrices running in time O(n^{\omega+\varepsilon}). Until the late 1960s it was believed that \omega=3, i.e. that no improvement can be found for the problem. In 1969, Strassen surprised everyone by showing that two n\times n matrices can be multiplied in O(n^{2.81}) time. This discovery spawned a twenty-year-long extremely productive time in which the upper bound on \omega was gradually lowered to 2.376. After a twenty-year stall, some very recent research has brought the upper bound down to 2.373.

Bilinear algorithms and recursion.
Strassen’s approach was to exploit the inherent recursive nature of matrix multiplication: the product of two kn\times kn matrices can be viewed as the product of two k\times k matrices, the entries of which are n\times n matrices. Suppose that we have an algorithm ALG that runs in o(k^3) time and multiplies two k\times k matrices. Then one can envision obtaining a fast recursive algorithm for multiplying k^i\times k^i matrices (for any integer i>1) as well: view the k^i\times k^i matrices as k\times k matrices the entries of which are k^{i-1}\times k^{i-1} matrices; then multiply the k\times k matrices using ALG and when ALG requires us to multiply two matrix entries, recurse.

This approach only works, provided that the operations that ALG performs on the matrix entries make sense as matrix operations: e.g. entry multiplication, taking linear combination of entries etc. One very general type of such algorithm is the so called bilinear algorithm: Given two matrices A and B, compute r products

P_l = (\sum_{i,j} u_{ijl} A[i,j])(\sum_{ij} v_{ijl} B[i,j]),

i.e. take r possibly different linear combinations of entries of A and multiply each one with a possibly different linear combination of entries of B. Then, compute each entry of the product AB as a linear combination of the P_l: AB[i,j]=\sum_{l} w_{ijl} P_l.

Given a bilinear algorithm ALG for multiplying two k\times k matrices (for constant k) that computes r products P_l, the recursive approach that multiplies k^i\times k^i matrices using ALG gives a bound \omega\leq \log_k r. To see this, notice that the number of additions that one has to do is no more than 3rk^2: at most 2k^2 to compute the linear combinations for each P_l and at most r for each of the k^2 outputs AB[i,j]. Since matrix addition takes linear time in the matrix size, we have a recurrence of the form T(k^i) = r T(k^{i-1}) + O(rk^{2i}).

As long as r<k^3 we get a nontrivial bound on \omega. Strassen’s famous algorithm used k=2 and r=7 thus showing that \omega\leq \log_2 7. A lot of work went into getting better and better “base algorithms” for varying constants k. Methods such as Pan’s method of trilinear aggregation were developed. This approach culminated in Pan’s algorithm (1978) for multiplying 70\times 70 matrices that used 143,640 products and hence showed that \omega\leq \log_{70} 143,640<2.796.

Approximate algorithms and Schonhage’s theorem.
A further step was to look at more general algorithms, so called approximate bilinear algorithms. In the definition of a bilinear algorithm the coefficients u_{ijl},v_{ijl},w_{ijl} were constants. In an approximate algorithm, these coefficients can be formal linear combinations of integer powers of an indeterminate, \lambda (e.g. 2\lambda^2-5\lambda^{-3}). The entries of the product AB are then only “approximately” computed, in the sense that AB[i,j]=\sum_{l} w_{ijl} P_l + O(\lambda), where the O(\lambda) term is a linear combination of positive powers of \lambda. The term “approximate” comes from the intuition that if you set \lambda to be close to 0, then the algorithm would get the product almost exactly.

Interestingly enough, Bini et al. (1980) showed that when dealing with the asymptotic complexity of matrix multiplication, approximate algorithms suffice for obtaining bounds on \omega. This is not obvious! What Bini et al. show, in a sense, is that as the size of the matrices grows, the “approximation” part can be replaced by a sort of bookkeeping which does not present an overhead asymptotically. The upshot is that if there is an approximate bilinear algorithm that computes r products P_l to compute the product of two k\times k matrices, then \omega\leq \log_k r.

Bini et al. (1979) gave the first approximate bilinear algorithm for a matrix product. Their algorithm used 10 entry products to multiply a 2\times 3 matrix with a 3\times 3 matrix. Although this algorithm is for rectangular matrices, it can easily be converted into one for square matrices: a 12\times 12 matrix is a 2\times 3 matrix with entries that are 3\times 2 matrices with entries that are 2\times 2 matrices, and so multiplying 12\times 12 matrices can be done recursively using Bini et al.’s algorithm three times, taking 1,000 entry products. Hence \omega\leq \log_{12} 1,000<2.78.

Schonhage (1981) developed a sophisticated theory involving the bilinear complexity of rectangular matrix multiplication that showed that approximate bilinear algorithms are even more powerful. His paper culminated in something called the Schonhage \tau-theorem, or the asymptotic sum inequality. This theorem is one of the most useful tools in designing and analyzing matrix multiplication algorithms.

Schonhage’s \tau-theorem says roughly the following. Suppose we have several instances of matrix multiplication, each involving matrices of possibly different dimensions, and we are somehow able to design an approximate bilinear algorithm that solves all instances and uses fewer products than would be needed when computing each instance separately. Then this bilinear algorithm can be used to multiply (larger) square matrices and would imply a nontrivial bound on \omega.

What is interesting about Schonhage’s theorem is that it is believed that when it comes to exact bilinear algorithms, one cannot use fewer products to compute several instances than one would use by just computing each instance separately. This is known as Strassen’s additivity conjecture. Schonhage showed that the additivity conjecture is false for approximate bilinear algorithms. In particular, he showed that one can approximately compute the product of a 3\times 1 by a 1\times 3 vector and the product of a 1\times 4 by a 4\times 1 vector together using only 10 entry products, whereas any exact bilinear algorithm would need at least 3\cdot 3+4 = 13 products. His theorem then implied \omega<2.55, and this was a huge improvement over the previous bound of Bini et al.

Using fast solutions for problems that are not matrix multiplications.
The next realization was that there is no immediate reason why the “base algorithm” that we use for our recursion has to compute a matrix product at all. Let us focus on the following family of computational problems. We are given two vectors x and y and we want to compute a third vector z. The dependence of z on x and y is given by a three-dimensional tensor t as follows: z_k = \sum_{ij} t_{ijk} x_i y_j. The vector z is a bilinear form. The tensor t can be arbitrary, but let us focus on the case where t_{ijk}\in \{0,1\}. Notice that t completely determines the computational problem. Some examples of such bilinear problems are polynomial multiplication and of course matrix multiplication. For polynomial multiplication, t_{ijk} = 1 if and only if j=k-i, and for matrix multiplication, t_{(i,i'),(j,j'),(k,k')}=1 if and only if i'=j, j'=k and k'=i.

The nice thing about these bilinear problems is that one can easily extend the theory of bilinear algorithms to them. A bilinear algorithm computing a problem instance for tensor t computes r products P_l of the form (\sum_{i} u_{il} x_i)(\sum_j v_{jl} y_j) and then sets z_k=\sum_k w_{kl} P_l. Here, an algorithm is nontrivial if the number of products r that it computes is less than the number of positions t_{ijk} where the tensor t is nonzero.

In order to be able to talk about recursion for general bilinear problems, it is useful to define the tensor product t\otimes t' of two tensors t and t': (t\otimes t')_{(i,i'),(j,j'),(k,k')} = t_{ijk}\cdot t_{i'j'k'}. Thus, the bilinear problem defined by t\otimes t' can be viewed as a bilinear problem z_k = \sum_{ij} t_{ijk} x_i y_j defined by t, where each product x_iy_j is actually itself a bilinear problem z_{ijk'}=\sum_{i'j'} t'_{i'j'k'} x_{ii'}y_{jj'} defined by t'.

This allows one to compute an instance of the problem defined by t\otimes t' using an algorithm for t and an algorithm for t'. One can similarly define the k^{th} tensor power of a tensor t as tensor-multiplying t by itself k times. Then any bilinear algorithm computing an instance defined by t using r entry products can be used recursively to compute the k^{th} tensor power of t using r^k products, just as in the case of matrix multiplication.

A crucial development in the study of matrix multiplication algorithms was the discovery that sometimes algorithms for bilinear problems that do not look at all like matrix products can be converted into matrix multiplication algorithms. This was first shown by Strassen in the development of his “laser method” and was later exploited in the work of Coppersmith and Winograd (1987,1990). The basic idea of the approach is as follows.

Consider a bilinear problem P for which you have a really nice approximate algorithm ALG that uses r entry products. Take the n^{th} tensor power P^n of P (for large n), and use ALG recursively to compute P^n using r^n entry products. P^n is a bilinear problem that computes a long vector z from two long vectors x and y. Suppose that we can embed the product C of two N\times N matrices A and B into P^n as follows: we put each entry of A into some position of x and set all other positions of x to 0, we similarly put each entry of B into some position of y and set all other positions of y to 0, and finally we argue that each entry of the product C is in some position of the computed vector z (all other z entries are 0). Then we would have a bilinear algorithm for computing the product of two N\times N matrices using r^n entry products, and hence \omega\leq \log_N r^n.

The goal is to make N as large of a function of n as possible, thus minimizing the upper bound on \omega.

Strassen’s laser method and Coppersmith and Winograd’s paper, and even Schonhage’s \tau-theorem, present ways of embedding a matrix product into a large tensor power of a different bilinear problem. The approaches differ in the starting algorithm and in the final matrix product embedding. We’ll give a very brief overview of the Coppersmith-Winograd algorithm.

The Coppersmith-Winograd algorithm.
The bilinear problem that Coppersmith and Winograd start with is as follows. Let q\geq 3 be an integer. Then we are given two vectors x and y of length q+2 and we want to compute a vector z of length q+2 defined as follows:

z_0 = \sum_{i=1}^q (x_iy_i+x_0y_{q+1})+x_{q+1}y_0,

z_i=x_iy_0+x_0y_i for i\in \{1,\ldots, q\}, and z_{q+1}=x_0y_0.

Notice that z is far from being a matrix product. However, it is related to 6 matrix products:

\sum_{i=1}^q x_iy_i which is the inner product of two q-length vectors,

x_0y_{q+1}, x_{q+1}y_0, and x_0y_0, which are three scalar products, and

the two matrix products computing x_iy_0 and x_0y_i for i\in \{1,\ldots, q\} which are both products of a vector with a scalar.

If we could somehow convert Coppersmith and Winograd’s bilinear problem into one of computing these 6 products as independent instances, then we would be able to use Schonhage’s \tau-theorem. Unfortunately, however, the 6 matrix products are merged in a strange way, and it is unclear whether one can get anything meaningful out of an algorithm that solves this bilinear problem.

Coppersmith and Winograd develop a multitude of techniques to show that when one takes a large tensor power of the starting bilinear problem, one can actually decouple these merged matrix products, and one can indeed apply the \tau-theorem. The \tau-theorem then gives the final embedding of a large matrix product into a tensor power of the original construction, and hence defines a matrix multiplication algorithm.

Their approach combines several impressive ingredients: sets avoiding 3-term arithmetic progressions, hashing and the probabilistic method. The algorithm computing their base bilinear problem is also impressive. The number of entry products it computes is q+2, which is exactly the length of the output vector z! That is, their starting algorithm is optimal for the particular problem that they are solving.

What is not optimal, however, is the analysis of how good of a matrix product algorithm one can obtain from the base algorithm. Coppersmith and Winograd noticed this themselves: They first applied their analysis to the original bilinear algorithm and obtained an embedding of an f(n)\times f(n) matrix product into the 2n-tensor power of the bilinear problem for some explicit function f. (Then they took n to go to infinity and obtained \omega< 2.388.) Then they noticed, that if one applies the analysis to the second tensor power of the original construction, then one obtains an embedding of an f'(n)\times f'(n) matrix product into the same 2n-tensor power, where f'(n)>f(n). That is, although one is considering embeddings into the same (2n) tensor power of the construction, the analysis crucially depends on which tensor power of the construction you start from! This led to the longstanding bound \omega<2.376. Coppersmith and Winograd left as an open problem what bound one can get if one starts from the third or larger tensor powers.

The recent improvements on \omega.
It seems that many researchers attempted to apply the analysis to the third tensor power of the construction, but this somehow did not improve bound on \omega. Because of this and since each new analysis required a lot of work, the approach was abandoned, at least until 2010. In 2010, Andrew Stothers  carried through the analysis on the fourth tensor power and discovered that the bound on \omega can be improved to 2.374.

As mentioned earlier, the original Coppersmith-Winograd bilinear problem was related to 6 different matrix multiplication problems that were merged together. The k^{th} tensor power of the bilinear problem is similarly composed of poly(k) merged instances of simpler bilinear ptoblems, however these instances may no longer be matrix multiplications. When applying a Coppersmith-Winograd-like analysis to the k^{th} tensor power, there are two main steps.

The first step involves analyzing each of the poly(k) bilinear problems, intuitively in terms of how close they are to matrix products; there is a formal definition of the similarity measure called the value of the bilinear form. The second step defines a family of matrix product embeddings in the n^{th} tensor power in terms of the values. These embeddings are defined via some variables and constraints, and each represents some matrix multiplication algorithm. Finally, one solves a nonlinear optimization program to find the best among these embeddings, essentially finding the best matrix multiplication algorithm in the search space.

Both the Coppersmith-Winograd paper and Stothers’ thesis perform an entirely new analysis for each new tensor power k. The main goal of my work was to provide a general framework so that the two steps of the analysis do not have to be redone for each new tensor power k. My paper first shows that the first step, the analysis of each of the values, can be completely automated by solving linear programs and simple systems of linear equations. This means that instead of proving poly(k) theorems one only needs to solve poly(k) linear programs and linear systems, a simpler task. My paper then shows that the second step of the analysis, the theorem defining the search space of algorithms, can also be replaced by just solving a simple system of linear equations. (Amusingly, the fact that matrix multiplication algorithms can be used to solve linear equations implies that good matrix multiplication algorithms can be used to search for better matrix multiplication algorithms.) Together with the final nonlinear program, this presents a fully automated approach to performing a Coppersmith-Winograd-like analysis.

After seeing Stothers’ thesis in the summer of last year, I was impressed by a shortcut he had used in the analysis of the values of the fourth tensor power. This shortcut gave a way to use recursion in the analysis, and I was able to incorporate it in my analysis to show that the number of linear programs and linear systems one would need to solve to compute the values for the k^{th} tensor power drops down to O(k^2), at least when k is a power of 2. This drop in complexity allowed me to analyze the 8^{th} tensor power, thus obtaining an improvement in the bound of \omega: \omega<2.3727.

There are several lingering open questions. The most natural one is, how does the bound on \omega change when applying the analysis to higher and higher tensor powers. I am currently working together with a Stanford undergraduate on this problem: we’ll apply the automated approach to several consecutive powers, hoping to uncover a pattern so that one can then mathematically analyze what bounds on \omega can be proven with this approach.

A second open question is more far reaching: the Coppersmith-Winograd analysis is not optimal– in a sense it computes an approximation to the best embedding of a matrix product in the n^{th} tensor power of their bilinear problem. What is the optimal embedding? Can one analyze it mathematically? Can one automate the search for it?

Finally, I am fascinated by automating the search for algorithms for problems. In the special case of matrix multiplication we were able to define a search space of algorithms and then use software to optimize over this search space. What other computational problems can one approach in this way?


One Response to “On the recent progress on matrix multiplication algorithms (guest post by Virginia Vassilevska Williams)”

  1. Yuval Says:

    The Salem-Spencer sets (those avoiding 3-term arithmetic progressions) can be avoided: see Section 15.7 of Algebraic Complexity Theory (Bürgisser et al.), which contains an alternative construction due to Strassen.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: