A general proof system in the sense of Cook and Reckhow (1979) can be understood as a nondeterministic guess-and-verify algorithm. The question whether there exist optimal proof systems essentially asks whether there exists a best such verification procedure. For practical purposes, such an optimal proof system would be extremely useful, as both the search for good verification algorithms as well as the quest for lower bounds to the proof size could concentrate on the optimal system. Thus the following question is of great significance: Do there exist optimal proof systems for a given language ?

Formally, a proof system for is optimal, if for any proof of in any proof system for there exists a proof of in the system that is at most polynomially longer than . If this transformation of into can even be computed efficiently, then is called p-optimal. Currently, it is only known that all languages in NP have optimal proof systems and all languages in P even admit p-optimal proof systems. However, it is open whether there exist languages outside NP with optimal proof systems or outside P with p-optimal proof systems.

Of central interest is the question whether there exists an optimal proof system for the coNP complete set of classical propositional tautologies. This question was posed by Krajíček and Pudlák (1989). Understanding the question better through characterizations is an important line of research with connections to a number of different topics. The first result in this area is due to Krajíček and Pudlák (1989) who showed the equivalence between the existence of p-optimal proof systems for propositional tautologies and the existence of optimal acceptors (algorithms that are optimal on the positive instances) for this problem. This equivalence was generalized to other problems by Sadowski (1999) and Messner (1999). Beyersdorff, Köbler, and Messner (2009) showed that optimality implies p-optimality for any system and any language if and only if the natural proof system for SAT (where proofs are just satisfying assignments) is p-optimal; the existence of an optimal system would imply the existence of a p-optimal system if there is some p-optimal system for SAT.

Recently, Chen and Flum (2012) uncovered further surprising relations of optimal proof systems to descriptive complexity and parameterized complexity. The link between these fields is provided by studying listings, i.e., enumerations of machines that compute all easy subsets of intractable problems like TAUT. Through this link Chen and Flum relate optimal proof systems to the existence of bounded logics for complexity classes such as polynomial time as well as deterministic and nondeterministic logarithmic space.

There are also interesting connections to core questions of complexity theory. As already mentioned, an optimal system for propositional tautologies would allow to reduce the NP vs coNP question to proving proof size bounds for just this optimal proof system. Optimal proof systems also imply the existence of complete problems for various promise complexity classes as disjoint NP pairs (Razborov 1994, Pudlák 2003, Glaßer et al. 2005, Beyersdorff 2007), NPcoNP (Sadowski 1997) and probabilistic classes such as BPP (Köbler et al. 2003). Further to these implications, Itsykson (2010) has shown the surprising result that the average-case version of BPP has a complete problem.

Computational complexity also provides sufficient conditions for the existence of (p-)optimal proof systems; however, these are not as widely believed as structural assumptions like NPcoNP. Krajíček and Pudlák (1989) showed that the existence of optimal (resp., p-optimal) propositional proof systems is implied by NE=coNE (resp., E=NE), and Köbler, Messner, and Torán (2003) weakened these assumptions to double exponential time.

Other recent research concentrated on modified versions of the problem, where a number of surprising positive results have been obtained. Cook and Krajíček (2007) showed that (p-)optimal proof systems for tautologies exist if we allow just one bit of non-uniform advice. This result generalizes to arbitrary languages (Beyersdorff et al. 2011), but does not translate into optimal algorithms or acceptors. Hirsch et al. (2012) showed that optimal acceptors exist in a heuristic setting where we have a probability distribution on the complement of the language, but it is open whether this translates into optimal heuristic proof systems. Still another positive result was obtained by Pitassi and Santhanam (2010) who show that there exists an optimal quantified propositional proof system under a weak notion of simulation.

Another interesting relation is to weak first-order arithmetic theories, so-called bounded arithmetic. Propositional proof systems are known to enjoy a close relationship to suitable theories of bounded arithmetic, e.g. extended Frege systems EF correspond to the theory . This correspondence has two facets: 1) proofs of first-order statements in can be translated into polynomial-size EF proofs and 2) every proof system for which can prove the consistency is simulated by EF. Therefore, from the point of view of , there exists an optimal propositional proof system, namely EF. Likewise, from the point of view of , there exists a complete disjoint NP-pair (the canonical pair of EF) etc.

In general, the question whether optimal proof systems exist is wide open; however, most researchers seem to conjecture a negative answer. Confirming such a negative answer seems out of reach with current techniques as this would imply a separation of complexity classes. On the other hand, while a positive answer would have interesting consequences (optimal problems for promise classes), these would not be as dramatic as e.g. a collapse of the polynomial hierarchy and therefore would not seem to be in sharp conflict with beliefs of complexity theorists. Thus, we will probably not see the answer to the question soon; however, research on this topic will hopefully continue to uncover interesting connections between complexity and logic.

]]>**The laser method**

Schönhage’s asymptotic sum inequality allows us to obtain an upper bound on given an upper bound on the border rank of a disjoint sum of matrix multiplication tensors . Strassen’s laser method is a framework for achieving the same for *non-disjoint* sums of matrix multiplication tensors. It is best illustrated with the following example from the paper of Coppersmith and Winograd:

Here is some integer parameter. The superscripts on the variables partition them into groups. For example, the two groups of x-variables are and . We can write this identity succinctly as

The superscripts have the same meaning, for example the first constituent tensor involves the x-variables from group , the y-variables from group , and the z-variables from group . Each constituent tensor uses all variables from each group, and this convention gives a unique interpretation of the short form of the identity.

Denote the tensor whose border rank is estimated by . This tensor is a sum of three matrix multiplication tensors, but we cannot apply the asymptotic sum inequality since these constituent tensors are not disjoint: for example, the first two constituent tensors share the z-variables. Strassen’s idea was to take a high tensor power of the original identity, and zero variables so that the remaining constituent tensors are disjoint (Strassen actually considered a different operation, *degeneration*, instead of zeroing, but Coppersmith and Winograd preferred to use the simpler operation of zeroing). The th tensor power of has constituent tensors, and border rank at most . For example, when , we obtain

The partition of the x-variables of into two groups naturally corresponds to a partition of the x-variables of into four groups, indexed by vectors of length . These indices are used above to form the *index triples* appearing as superscripts. In fact, the dimensions of any of the constituent matrix multiplication tensors can be recovered from its index triple, so we can “summarize” by giving a list of all nine index triples:

When taking the th tensor power, there will be index triples, each consisting of three vectors of length . Our task is to zero out some groups of variables so that the surviving index triples correspond to disjoint matrix multiplication tensors, and as many of these as possible. Put differently, if we denote by the list of index triples and by the *remaining* variables, then we want to consist of index triples in which no x-index repeats, no y-index repeats, and no z-index repeats. If we manage to accomplish this, generating surviving index triples, then the asymptotic sum inequality gives the bound

(Since each of the constituent tensors in satisfies , each constituent tensor in satisfies .)

Let be the maximum of over all legal choices of . It turns out that , and so the asymptotic sum inequality shows that . When , this gives the bound . We call the limit the *capacity*.

**Upper bounds on the capacity**

Coppersmith and Winograd used an ingenious construction to show that , but they neglected to mention a matching upper bound showing that their construction is optimal. This upper bound, whose philosophy is essential to our approach, appears curiously enough in the paper by H. Cohn, R. Kleinberg, B. Szegedy and C. Umans on group-theoritic algorithms for matrix multiplication. Their basic idea is to use the fact that after zeroing out variables, no x-index repeats, and so is at most the number of x-indices. Since there are at most x-indices, this gives and so , a bound not as good as the one we had hoped for ().

To improve on this trivial bound, we divide the surviving index triples into types. An index triple has type if it results from copies of , copies of and copies of . Within each given type, the number of distinct x-indices depends on the type. For example, when , the number of distinct x-indices is . It is not hard to check that for each choice of , either the number of distinct x-indices is at most , or the same is true for either the number of distinct y-indices or the number of distinct z-indices. Since there are different types, we obtain the bound

Stirling’s approximation then shows that ; note that the number of types disappears when computing the limit since is polynomial rather than exponential in .

**The Coppersmith–Winograd identity**

The Coppersmith–Winograd identity improves on the previous identity by computing three more products at no cost in the border rank:

This identity is the fount and source of all improvements on since 1989. The two groups of x-variables appearing in the previous identity are joined by a new group consisting of a single variable. The curious reader can try to spell out the actual identity on her own, or she can look it up in my lecture notes. We denote the corresponding tensor by .

The laser method can be applied to this identity in much the same way as before. After taking the th tensor power, we are left with different constituent tensors , each represented by an index triple . If the index triple contains factors of one of the forms then . Our task is to zero out some variables so that the remaining constituent tensors are disjoint, and the quantity is maximized. Since we don’t know , we maximize instead for some parameter . If is the maximum of this quantity for given , then the asymptotic sum inequality shows that .

It turns out that for each , the limit exists, and the asymptotic sum inequality shows that . This reduces the analysis of to the determination of its associated capacity . Coppersmith and Winograd show that

where is the base 2 entropy function. Amazingly, the method of Cohn, Kleinberg, Szegedy and Umans shows that this lower bound is tight! When and , one checks that , and since and is increasing in , this shows that .

**Powers of the Coppersmith–Winograd identity**

Coppersmith and Winograd managed to get an even better bound on by considering the tensor square whose border rank is at most . If we retain the old partition into groups (so there are now nine groups of each of the x-variables, y-variables and z-variables), then a quick consideration of the laser method reveals that we should obtain the exact same bound on . Instead, Coppersmith and Winograd merge the original nine groups into five groups by putting the original group into the new group . This corresponds to the following partition of the original groups:

The original 36 constituent tensors are now grouped to 15 constituent tensors:

- 3 terms of the form .
- 6 terms of the form resulting from the merger of and into a single matrix multiplication tensor.
- 3 terms of the form resulting from the merger of three original constituent tensors: , and .
- 3 terms whose index triple is of the form that are not matrix multiplication tensors, one of which results from merging , , and .

The idea is now to apply the same method, but we encounter a problem: there are three tensors which are not matrix multiplication tensors. However, each of these tensors is itself a sum of matrix multiplication tensors, and so can be analyzed using the same method. In this recursive fashion, Coppersmith and Winograd were able to analyze this second power, and obtained the bound . What Stothers and Vassilevska Williams did was to codify this recursive application of the laser method, and using this codification they were able to handle larger powers, using a computer for the calculations. Their method has one shortcoming: when analyzing higher powers, the capacity problems encountered are too hard to solve exactly. Instead, they employ costly numerical methods that produce a good lower bound on the relevant capacities. Le Gall came up with a more efficient method which produces slightly inferior bounds but is applicable to higher tensor powers.

**A different viewpoint: the laser method with merging**

Our main innovation is a different way of looking at the analysis of powers of . The recipe of the laser method consists of taking a high tensor power, zeroing groups of variables so that the remaining constituent tensors are disjoint, and applying the asymptotic sum inequality. We add an additional step: after zeroing groups of variables, we allow merging sets of constituent tensors, as long as each merged set forms a matrix multiplication tensor, and the resulting merged constituent tensors are all disjoint (before the merging step, the surviving constituent tensors *don’t* need to be disjoint). After the merging step, we apply the asymptotic sum inequality as before. We call this scheme *the laser method with merging*. We already saw some examples of the merging step in the analysis of : for example, the two constituent tensors and are merged into the matrix multiplication tensor . Coppersmith and Winograd’s analysis of is an example of the laser method with merging *applied to the original tensor *. To see this, let us examine their analysis. They first merge some tensors in , and are left with a sum of matrix multiplication tensors and three “complicated” tensors, which we denote by , each of which is by itself a sum of matrix multiplication tensors. They proceed to take an th tensor power of , and zero some groups of variables (each group corresponds to several original groups) so that the resulting constituent tensors are disjoint. Each surviving constituent tensor contains a factor of the form (the powers of are always the same in their construction), and by zeroing out groups of variables internal to this factor, they reduce it to a disjoint sum of matrix multiplication tensors. After this second step of zeroing, we are left with a disjoint sum of matrix multiplication tensors, to which the asymptotic sum inequality is finally applied. Instead of first merging the constituent tensors of , taking an th tensor power, and then zeroing variables in two steps, we can reverse the order of steps. Starting with , we first zero out variables, and then do the appropriate merging. We end up with the exact same disjoint sum of matrix multiplication tensors as before. This hopefully convinces the reader that the analysis of Coppersmith and Winograd can be put in our framework. The analysis of higher power by Stothers, Vassilevska Williams and Le Gall adds more levels of recursion but is otherwise the same, so these also fall into our framework. If we could prove an upper bound on the corresponding notion of capacity, we would obtain a limit on the upper bound on provable using this framework.

**Limits on the laser method with merging**

Let be the maximum of over all direct sums which can be obtained from by zeroing out groups of variables and merging the surviving constituent tensors in sets. The asymptotic sum inequality shows that . It turns out that the limit exists (essentially due to the easy fact ), and so the bound obtained by the method is . Our goal in this section is to obtain an upper bound on the capacity , which correspond to a lower bound on the solution of the equation . The first step is to understand which mergers are possible — which sets of constituent tensors of can be merged to form a matrix multiplication tensor. Let’s take a look at the mergers taking place in Coppersmith and Winograd’s analysis of :

In both cases, the x-indices of the merged index triples consist entirely of zeroes.This is not the most general possible form of merging. Indeed, consider the following example for :

This results from the tensor product of and its rotation . In this example, the first two positions are always zero in the x-indices, and the last two positions are always zero in the y-indices.

Generalizing, we say that a set of constituent tensors of is *merged on zeroes* if for each of the positions, either all the x-indices are zero at the position, or all the y-indices are zero at the position, or all the z-indices are zero at the position. Using this nomenclature, we see that Coppersmith and Winograd’s analysis of always results in sets of constituent tensors merged on zeroes. Looking closely at the works of Stothers, Vassilevska Williams and Le Gall, we see that the same is true for their analyses. Indeed, we managed to prove that this is *always* the case.

**Lemma.** If then any set of constituent tensors of which merge to a matrix multiplication tensor is merged on zeroes.

The interested reader can look at the proof in our paper (Lemma 4.1 on page 11).

Given the lemma, our upper bound on the capacity broadly follows the ideas in the upper bound of Cohn, Kleinberg, Szegedy and Umans. Given , consider a collection of disjoint tensors resulting from starting with , zeroing out some groups of variables, and merging some of the surviving tensors. We call each merged set of tensors a *line*. Each line is associated with a *line type* , where is the number of positions at which the x-indices are always zero. Each constituent tensor inside each line has a *tensor type* which corresponds to how many of the original six constituent tensors were used to form it. Since the number of types is polynomial, we can focus our attention on an arbitrary line type and tensor type.

Consider for simplicity the case where the line type is and the tensor type is of the form (i.e., it gives the same weight to constituent tensors of similar shape). We can calculate the number of distinct x-indices and the number of distinct x-indices which can appear in any given line; since some of the coordinates are fixed at zero. Each constituent tensor in each line has the same dimensions , where the value of depends on the parameters . Convexity considerations show that it is advantageous for all lines to contain the same number of distinct x,y,z-indices. The total number of x-variables, y-variables and z-variables in each line is of each (since each x-index corresponds to x-variables), and so the line is isomorphic to . The total number of lines is at most (since different lines have disjoint x-indices), and so the total contribution to is .

We can similarly bound the contributions of other line types and tensor types, and through this we obtain an upper bound on and so on .

**Theorem.** For and , the solution to satisfies .

We obtain similar results for other values of . Curiously, the results deteriorate as gets smaller.

Le Gall used our technique to analyze the merged version of the tensor . For this tensor, he obtained a significantly improved bound.

**Theorem (Le Gall).** For and the merged version of for any , analyzed recursively as in prior works, cannot yield a bound better than .

His result encompasses the analyses of à la Stothers, Vassilevska Williams and himself for *all* values of ; no value of can prove the bound .

**What next?**

We have seen that the Coppersmith–Winograd identity cannot be used to prove , at least not using current techniques. Anybody who wishes to prove should therefore either look for a different identity, or develop a new technique. The second avenue has been tried by Cohn and Umans, who suggested a technique based on groups. Cohn, Kleinberg, Szegedy and Umans showed how to match the bound obtained by Coppersmith and Winograd using the group-theoretic approach; their construction heavily relies on the methods of Coppersmith and Winograd, and in particular they need to use Coppersmith and Winograd’s lower bound on the capacity, which is the difficult part of both proofs. It is fair to say that so far, the group-theoretic techniques have not yielded any better upper bounds on . It thus seems that the most promising avenue for major improvement is finding better identities replacing the Coppersmith–Winograd identity, which has served us for 25 years. Perhaps computers can be enlisted for the search?

Until we find a new identity or a new technique, here is another suggestion for obtaining an improved upper bound on , albeit not . Let be the best bound that can be obtained by analysing according to the previous techniques, and let . The bound is obtained as the solution to , where is the restriction of in which each merged set of tensors is a cartesian product of tensors of width . The best bound obtained by the laser method with merging applied to , in contrast, is the solution to . It could be that

and in that case . In other words, it could be the case that allowing the merging width to grow with results in a better bound on . Unfortunately, so far we haven’t been able to obtain an upper bound on along these lines.

]]>

In this blog post we describe a result due to Coppersmith and Winograd that implies that a certain class of techniques *provably* cannot yield an optimal exponent, i.e. an algorithm, namely all algorithms which result from an invocation of Schönhage’s *asymptotic sum inequality*. This class includes Strassen’s original algorithm and all algorithms hence up to Strassen’s *laser method*, used in the celebrated algorithm of Coppersmith and Winograd, which corresponds to infinitely many invocations of the asymptotic sum inequality, and so is not subject to this limitation. The proof proceeds by showing how any identity (to which the asymptotic sum inequality can be applied) can be improved to another identity yielding a better bound on .

**Strassen’s original algorithm and the language of tensors**

In 1969, Strassen stunned the world of science by describing a matrix multiplication algorithm running in time . The algorithm relies on an identity showing how to multiply two matrices in non-commuting variables using “essential” multiplications (that is, not including multiplication by constants). Running this algorithm recursively (by treating a matrix as a matrix whose entries are matrices) allows us to multiply two matrices using essential multiplications, and the resulting algorithm has complexity .

Later on, Strassen and others developed a framework in which the basic identity can be expressed succinctly as

Generalizing the recursive construction gives Strassen’s theorem

It is now time to explain the notations and . The notation , or more generally , represents a certain *tensor*, a three-dimensional analog of a matrix:

The expression on the right-hand side represents a three-dimensional array whose rows are indexed by the elements , whose columns are indexed by the elements , and whose third dimension is indexed by the elements . Each term in the sum indexes a particular cell in this three-dimensional array, and represents the fact that this cell has value . All other cells have value . For comparison, a matrix could be encoded in a similar fashion by the double sum . The tensor represents the multiplication of an matrix (whose entries are indexed by the variables) by an matrix (whose entries are indexed by the variables), resulting in an matrix (whose entries are indexed by the variables).

The notation stands for the *rank* of the tensor , which is the smallest number of rank one tensors which sum to . A rank one tensor is an outer product of the form

The usual matrix rank can also be defined in this fashion, and this is the easiest way to show that row rank equals column rank for matrices.

Here is Strassen’s identity in this language:

Tensorial notation makes it clear that , and the rank is the same for the other four permutations as well.

Two important operations on tensors are *direct sum* and *tensor product*. The direct sum of two tensors is a “block diagonal tensor” which has as one block and as the other block. In tensorial notation, after ensuring that variables appearing in both tensors are disjoint, . For example,

The variable names we chose are arbitrary; in fact, the equals sign really means “equal up to isomorphism”, or rather “equal up to choice of indices for rows, columns and the third dimension”. The two constituent tensors represent an outer product and an inner product of vectors, respectively, and will figure out prominently in the sequel. It is not hard to check that in general, , but this is not always tight, as the example shows.

The tensor product of two tensors corresponds to the Kronecker product of matrices. Instead of explaining it formally, here is an example:

Formally speaking, the rows of the product tensor are indexed by pairs of row indices, and similarly for columns and the third dimension; the entry of the product tensor is equal to . It is not hard to check that .

The tensor product can be iterated to produce the *tensor power* . In terms of rank, .

**Border rank and the asymptotic sum inequality**

The rank operator on matrices is continuous, that is, if a matrix has rank , then all matrices close enough to will have rank . Surprisingly, this is not the case for tensors: the identity

shows that for any , the rank of the tensor on the left is , and on the other hand one can check that when the rank is . The *border rank* of a tensor , denoted , is the minimum rank of a sequence of tensors converging to . Here is an example due to Schönhage:

In this identity, is some expression only involving powers of which are or higher. If we divide both sides by and let , we recover on the left-hand side. Since there are terms on the right-hand side, we conclude that

.

This is rather surprising, since clearly . At the cost of an increase of one in the border rank, we are somehow able to compute a large disjoint inner product!

It is a deep fact due to Strassen that *all* upper bounds on the border rank arise as “approximate identities”, which are identities like Schönhage’s identity mentioned above (the power of on the left-hand side can be different); all powers of appearing in such identities are positive integers. This shows that and for all tensors .

How is Schönhage’s identity useful for multiplying two square matrices?

**Asymptotic sum inequality:** For a set of triples, not all equal to ,

This result of Schönhage is a vast generalization of Strassen’s theorem mentioned above, which corresponds to the special case in which there is only one triple (with border rank replaced by the rank). Unfortunately, explaining the proof of Schönhage’s identity will take us too far afield. The proof can be found in many sources, such as my lecture notes. One interesting aspect is that the proof of the asymptotic sum inequality does *not* result in an algorithm, where is the upper bound on resulting from the inequality. Rather, the proof gives a sequence of algorithms whose exponents *converge* to .

Choosing in Schönhage’s identity and applying the asymptotic sum inequality, we obtain

Solving for , we get the upper bound .

**The asymptotic sum inequality isn’t optimal**

Finally we are ready to state Coppersmith and Winograd’s result.

**Main theorem. **Let be the upper bound on obtained from an application of the asymptotic sum inequality. Then .

In other words, a single invocation of the asymptotic sum inequality cannot yield the optimal value of . One way around this limitation is to apply the asymptotic sum inequality to a sequence of identities — the course of action taken by Strassen’s laser method.

In the rest of this post, we provide an (almost) complete proof of this theorem. We start with a result generalizing Schönhage’s identity, which shows that in certain situations one can compute an additional inner product for free or at only a small cost in the border rank. This result is strong enough to prove a baby version of the main theorem.

**Baby theorem.** For all , .

This shows that no single identity à la Strassen’s original identity can yield the optimal value of .

Proving the main theorem itself requires more work, but the main ideas are contained in the proof of the baby theorem.

**Main construction and proof of baby theorem**

All the results of Coppersmith and Winograd rely on the following construction. We say that a tensor involves *independent* x-variables if it involves x-variables, and it cannot be written in terms of linear combinations of the x-variables.

**Main construction. **Let be a tensor involving independent x-variables and independent y-variables, and suppose that

*, where .*

*Then , and furthermore this is witnessed by a decomposition of the form*

*where , the z-variable of the inner product, doesn’t appear in , and*

*.*

Before proving the main construction, let us see why it is useful.

**Corollary. **For all , putting we have

This corollary generalizes Schönhage’s identity: taking and using , we obtain .

**Proof of corollary.**** **Consider a decomposition . Choose vectors and such that is non-zero for all ; here is the outer product of and , and is the result of substituting the transpose of the matrix for the matrix variables : . Such vectors exist since due to the minimality of the decomposition. Consider now the following decomposition of length :

Notice that

since we substituted . On the other hand, also

This shows that the premises of the main construction are satisfied, with . Since , the corollary follows directly from the main construction.

Using the corollary, the baby theorem follows quite easily.

**Proof of baby theorem.**** **The proof uses the standard inequality , which can be proved using the so-called *substitution method*. Let , so that . For each integer , . If for some then Strassen’s theorem shows that , so assume that for all . The corollary shows that

and the asymptotic sum inequality shows that

Since and so , for large enough we have , and so

which shows that .

**Proof of main construction**

Let us briefly recall the assumptions of the construction: is a tensor involving independent x-variables and independent y-variables, decomposes as a sum of rank one tensors , and .

For each y-index , let be the corresponding standard basis vector. Substituting for the y-variables, we obtain

We rephrase this in the language of matrices. Let be the matrix whose columns are , and let be the column vector whose entries are . The displayed equation is equivalent to . Since the x-variables are independent, has rank , and so the subspace of column vectors annihilating has dimension . The independence of the y-variables implies that the vectors are linearly independent, and so we can find vectors that together with form a basis for the subspace of column vectors annihilating .

We can apply the same argument the other way around. Let be the matrix whose rows are , and let be the row vector whose entries are . As before, , and the vectors are linearly independent. Furthermore, the subspace of row vectors annihilating has dimension , and so can be completed to a basis for this subspace by the addition of vectors .

Consider now the matrix whose rows are and the matrix whose columns are . The matrix has full row rank and the matrix has full column rank, and so the matrix has rank at least . On the other hand, the last rows of are simply the matrix , and the last columns of simply form the matrix . This shows that is a block diagonal matrix of the form , where is an square block. Since has rank at least , this block must be invertible. In fact, by choosing a different basis instead of , we can guarantee that

The next step is to lift to linear combinations of new x-variables and y-variables : for each , define

The linear combination has the same relation to the as has to the . The identity stated above for can be recast as

This allows us to add the additional inner product to the original identity. Let be a new z-variable, and consider the identity

This shows that . Furthermore, the coefficient of is exactly . This completes the proof of the main construction.

**Proof of main theorem**

The proof of the main theorem proceeds in three steps. In the first step we show how to absorb a tensor into another tensor satisfying the requirements of the main construction. In the second step we iterate the first step to obtain a result similar to the corollary used above to prove the baby theorem. Finally, in the third step we prove the main theorem itself, using a result akin to the lower bound .

We start with the first step.

**Auxiliary construction. **Suppose that is a tensor along with a decomposition satisfying the requirements of the main construction, and that is an arbitrary tensor (on different variables) whose border rank satisfies (notation as in the main construction). Then there is a decomposition of of length satisfying the requirements of the main construction.

**Proof.**** **Apply the main construction to to obtain a decomposition

Furthermore, let be a decomposition of . By adjusting one of the decompositions, we can assume without loss of generality that .

Denote by the substitution that sets for and otherwise, and define similarly. Applying these substitutions to the decomposition of above, we get

Moreover, the coefficient of in the first expression is exactly

We can rewrite the combined decomposition as

where the coefficient of vanishes. Removing , we obtain a decomposition satisfying the requirements of the main construction, completing the proof.

The next step is iterating the auxiliary construction, with an (almost) arbitrary starting point. Applying the main construction at the very end, we obtain the following corollary.

**Auxiliary corollary. **Suppose is a tensor involving independent x-variables and independent y-variables, whose border rank is at least . Then for every integer ,

In the statement of the corollary, is the direct sum of copies of .

**Proof.**** **Consider some decomposition , and extend it to the decomposition

of length , which satisfies the requirements of the main construction. By assumption , and so we can apply the auxiliary lemma to obtain a decomposition of of length satisfying the requirements of the main construction with independent x-variables and independent y-variables. Once again, by assumption , and so we can apply the auxiliary lemma again (to and ) to obtain a decomposition of of length satisfying the requirements of the main construction. In this way, for every we can obtain a decomposition of of length satisfying the requirements of the main construction with independent x-variables and independent y-variables. A final application of the main construction completes the proof.

Before we can prove the main theorem, we need one simple result.

**Lemma. **Suppose that applying the asymptotic sum inequality to the tensors yields the same upper bound on . If then applying the asymptotic sum inequality to shows that , and otherwise the same upper bound on is obtained. Similarly, if then applying the asymptotic sum inequality to shows that , and otherwise the same upper bound on is obtained.

* Proof.* Let have border rank and let have border rank . By assumption,

Applying the asymptotic sum inequality to the direct sum gives

If then this gives the bound , and otherwise a better bound is obtained (recall that ).

Applying the asymptotic sum inequality to the tensor product gives

If then this gives the bound , and otherwise a better bound is obtained (recall that ).

We can now complete the proof of the main theorem, using the above lemma freely.

**Proof of main theorem.**** **Let be an arbitrary tensor such that for some , denote the border rank of by , and let be the solution to the equation

Our goal is to show that .

Any tensor can be “rotated” by replacing the x,y,z-variables by y,z,x-variables. Any such permutation of variables results in a tensor with the same rank and border rank. In particular, the tensor

has border rank , and the same number of x-variables, y-variables and z-variables, say of each. If then the asymptotic sum inequality shows that , so we can assume that . A lower bound proved using the substitution method shows that (this is Theorem 6.1 in the paper of Coppersmith and Winograd). Pick a power so large that , and put . Note that has each of x-variables, y-variables and z-variables, and its border rank is at most . If the border rank is smaller then the asymptotic sum inequality shows that , so we can assume that .

For brevity, put , , and . Also, let , where

The auxiliary corollary shows that for every , . Applying the asymptotic sum inequality gives

Choose so large that . Then

showing that . This completes the proof.

]]>A word of warning before we start: The below constructions only work in some models of complexity. One of the models in which the results will not apply is the Turing Machine model. I will mention the model requirements as we go along but if you’d like a more detailed discussion of this topic, I refer you to Gurevich’s paper in the references.

Without further ado let’s jump straight into Levin’s neat little trick, performed by a combination of an interpreter and a program enumerator.

The program that we’ll design in this section takes an input *x* and runs infinitely, outputting an infinite sequence of values. Our program will output a new number in the sequence every *k* steps for some constant *k*. The sequence produced will turn out to be quite a wonderful characterization of *x* (if you love computational complexity). I’ll use the name *iseq(x)* for the infinite sequence generated on input *x*.

To design our program – let’s call it *iseqprog* – we’ll need two other programs to start from: A program enumerator and an interpreter for the enumerated programs.

The program enumerator, *progen*, takes as input a pair *(i,x)* and returns the initial configuration of the program with index *i* on input *x*. We’ll expect this operation to be constant-time when either *i=0* or we already called progen on *(i-1,x)*. In other words: *progen* is more like a method of an object (with internal state) which expects to be called with inputs *(0,x), (1,x), (2,x),…* and is able to process each item in such a call sequence in constant-time.

The interpreter we’ll need cannot be any old interpreter. In these modern times we can expect a certain service level. The interpreter should work like a slot machine in the arcades: Whenever I put in a new coin I continue my game with three more lives. In other words, when I give the interpreter the configuration of program *p* after *t* steps on input *x*, it returns an updated configuration representing the state of program *p* after *t+1* steps on input *x*. It also tells me if *p* terminated in step *t+1* and, if so, the return value of *p* on *x*. All of this happens in constant time. After all, the interpreter only needs to simulate one single step of *p* on *x*.

Comment: Almost any old interpreter *can* be used for Levin’s construction, but the exposition would become more complex.

Now I’ll describe the computation of *iseqprog* on input *x*. The computation proceeds in *rounds*, and each round consists of a number of *stages*. There is an infinite number of rounds. The number of stages in each round is finite but not constant across rounds.

Round 1 has only 1 stage. In this first stage of the first round, *iseqprog* runs *progen* on (0,x) and gets back the initial configuration of program 0 on input *x*. *iseqprog* then uses the interpreter to interpret just 1 step of program 0 on input *x*. If program 0 happens to terminate on input *x* in that first step, *iseqprog* immediately outputs program 0’s output on input *x*. Regardless of whether the interpretation of program 0 terminated, *iseqprog* itself does not terminate; it is on its way to generate an infinite sequence. If the interpretation of program 0 on input *x* did not terminate in its first step, *iseqprog* outputs the value 0 before continuing, providing us with a signal that it’s still live and running. This concludes the first (1-stage) round of *iseqprog*’s computation on *x*.

The computation continues in an infinite sequence of rounds. In each round, *iseqprog* calls *progen* once, adding one new item to the set of program configurations it has accumulated during the previous rounds. Each of these program configurations is interpreted for a number of steps. Every time the interpreter has executed one step of a program *i*, *iseqprog* outputs one value. The output value will be 0 or program *i*’s output on *x*. Whatever program you may think of, *iseqprog* will eventually generate its output on *x* (in between a lot of 0’s and many other values).

If we use this shorthand:

- “
*+i*”: means “create the initial configuration for program*i*on input*x*, then interpret 1 step on that configuration and output one value” - “
*i*” means “interpret 1 more step on the stored configuration for program*i*and output one value”

then we can sum up *iseqprog’s* first rounds like this:

Round 1: +1

Round 2: 11+2

Round 3: 111122+3

Round 4: 11111111222233+4

Round 5: 111111111111111122222222333344+5

Round 6: 32 1’s, 16 2’s, 8 3’s, 4 4’s, 2 5’s, and one +6

I hope it has become clear why *iseqprog* should be able to generate a new item of *iseq* every *k* steps or less for some constant *k*. Apart from the administrative work of looking up and saving configurations in some table, each step involves at most one call to the program enumerator and one call to the interpreter. These calls were assumed to be constant-time. The administrative work I wlll simply assume to be constant-time as well. *iseqprog* cannot work as intended in all complexity models; in particular, it doesn’t work for Turing machines.

Now let’s have a look at the sequence *iseq(x)* itself. The key observation is that although any individual program does not get much attention from *iseqprog*, it does get a specific percentage of attention that is not dependent on the input *x*. For instance, program 3 accounts for of the interpreter calls made by *iseqprog* regardless of the input *x*. The percentage is tied only to the program’s index number according to the program enumerator. From this observation we can derive (proof left to the reader) the salient feature of *iseq(x)*:

If program

poutputsyon inputxin timet, thenyappears iniseq(x)at an index less thanctforcdepending only onp.

I think this is great! Whatever you want to compute from *x*, you’ll find it in *iseq(x)*. What’s more: Your answer appears quite early in the sequence – so early, in fact, that you might as well just run through *iseq(x)* rather than perform the computation itself! That’s why I decided to call *iseq(x)* a wonderful sequence.

It’s too good to be true…if it wasn’t for two caveats. First, how do you recognize the value that you’re looking for? And second, what about that constant *c*? We’ll address these two questions below.

Comment: Another caveat is that the above doesn’t apply to all complexity models, in particular to Turing Machines. For most of the common complexity models, I expect that the result will be true if you replace *ct* by *poly(t)* where *poly* is a polynomial depending only on *p*

I’ll end this section with a simple specialization of the above that is too nice not to mention:

For any function

finP, there is a polynomialpsuch that

And yes, *iseqprog* generates a new item of *iseq(x)* in *k* steps or less for some constant *k*!

So what good is all this if you cannot recognize the value you’re looking for. Luckily there are some situations where validating a correct answer is simpler than producing it – yes, I’m thinking about SAT. A satisfying assignment for a boolean formula can be validated in linear time. How can we exploit Levin’s idea to create an optimal solver for SAT?

The simplest answer is to modify the program enumerator. Our new program enumerator, call it *progenSAT*, wraps each program generated by the original program enumerator in a SAT validator. The computation of *progenSAT(i,x)* will proceed in two phases like this:

Phase 1: Run *progen(i,x)* and assign its output value to variable *y*.

Phase 2: If *y* is a satisfying assignment for the boolean formula *x* then output *y* else loop infinitely.

If we plug *progenSAT* into *iseqprog* we get a new program *iseqprogSAT* generating a new sequence *iseqSAT(x)* on input *x*.

Like the original *iseqprog*, our new program *iseqprogSAT* generates a new item every *k* steps or less for some constant *k*. I’m assuming that *progenSAT* also takes constant time to generate each new program configuration. Let us adapt the key observation about *iseq(x)* to the sequence *iseqSAT(x)* (once again, I’ll leave the proof to the reader):

If program

poutputsyon inputxin timet, andyis a satisfying assigment for the boolean formulax, thenyappears iniseqSAT(x)at an index less thanc'(t+|x|)forc’depending only onp.

This is remarkable! This means we have a concrete program that is optimal (up to a constant factor) for solving SAT. As a consequence, The question of P vs. NP boils down to a question about this single program’s running time. Define to be the number of steps program *p* takes to generate a nonzero value on input *x*. Now P=NP if and only if there is a polynomial p such that for every satisfiable boolean formula *x*.

In other words, there may be 1 million US$ waiting for you if you’re able to analyze *iseqprogSAT*‘s running time in detail.

Now we’ll have a look at the other caveat about Levin’s idea: The constant factor. In the 1990’s, under the supervision of Neil Jones and Stephen Cook, I worked on implementing a program enumerator that would get *iseqprog* to actually terminate on some toy problems. The problem, of course, is that the constant factors involved are so large you’ll be tempted to never use big-O-notation ever again. Let’s assume that your programs are sequences of *k* different instructions, and that every sequence of instructions is a valid program. Then the index of a program *p* is roughly . The constant factor *c* is then approximately i.e. doubly exponential in *|p|*. So to get an answer from *iseqprog* the useful programs need to be really short.

Actually I found that *iseqprog* favours short programs so much that it sometimes fails to find program that actually computes the function you’re looking for. In one case, half of the inputs caused one little program, *p’*, to give the correct result while the other half of the inputs caused another little program, *p’’*, to give *iseqprog* its output. A program that tested the input then continued as either *p’* or *p’’* was too long to ever get simulated.

It’s actually possible to reduce the constant factor *c* by a lot, if you’re willing to sacrifice the optimality in asymptotical running time. By revising the strategy used to pick which program to interpret, you wil obtain different tradeoffs between constant factor and asymptotical relation. For instance, consider the variant of *iseq(x)*, call it *iseq_triangle(x)* obtained by using the following simple strategy in Levin’s construction:

Round 1: +1

Round 2: 1+2

Round 3: 12+3

Round 4: 123+4

Round 5: 1234+5

I’ll postulate the following, leaving the proof to the reader: If program *p* outputs *y* on input *x* in time *t*, then *y* appears in *iseq_triangle(x)* at an index less than .

I once identified a few strategies of this kind but never got around to clarifying in more detail which tradeoffs are possible; or indeed optimal. Could the “triangle” strategy be improved so that the expression above instead would be ? I doubt it, but have no proof. It seems like a small but interesting math exercise.

In one variation of *iseqprog* the programs are actually enumerated in the order of their descriptive complexity. See the references below for details on that.

Claus-Peter Schnorr analyzed applications of Levin’s result in a 1976 ICALP paper. In particular, he was interested in defining a class of predicates that do not allow asymptotical speedup. The contrast to the above results should be noticed: It is an actual predicate, a 2-valued function, that does not have speedup.

I have not been able to prove Schnorr’s main result (the paper’s proof is missing a few details) but I’d like to outline his central idea because it is interesting, and maybe one of the readers can be helpful by providing a proof in the comments of this blog post. I have simplified his definition a bit and refer you to the ICALP paper for the general definition, and for his results on graph isomorphism and speedup in the Turing machine space model.

Let us adapt some notation from the previous blog post by Amir Ben-Amram and define the *complexity set* of a function *f* to be

In the remainder of this section, all data will be bit strings, and *P* will designate a binary predicate, i.e. . You may think of SAT as a prime example of the kind of predicates Schnorr analyzes. The decision function for *P* will be defined by

A function *w* is a witness function for *P* if and only if

The idea behind Schnorr’s result is to consider a class of predicates, *P* for which there is a tight connection between the complexity set and the complexity sets of the associated witness functions:

The class in question is the class of (polynomial-time) *self-reducible* predicates. The criteria for being self-reducible are a bit complex. I will provide a simplified, less general, version here. *P* is self-reducible if implies and there is a polynomial-time function mapping a pair of (bit string, bit) to a bit string such that

*Theorem* (Schnorr, 1976, Theorem 2.4, rewritten): When *P* is self-reducible, there is an integer *k* and witness function *w* for *P* such that

This theorem is not too hard to prove. To find a witness *y* for an *x*, you figure out the bits of *y* one at a time. It takes rounds in which we test both 0 and 1 as the potential “next bit” of a witness. For the details, I refer you to Schnorr’s paper.

The main theorem of interest to this blog post is Schnorr’s Theorem 2.7. A precise statement of the Theorem requires more technical detail than I’m able to provide here, but its essence is this: For a self-reducible predicate *P*, the decision problem cannot be sped up by a factor of .

As mentioned above, I’ve not been able to construct a proof based on the ICALP paper, so I’ll leave this a homework to the readers! It certainly seems like all of the necessary constructions have been lined up, but at the place where “standard methods of diagonalization” should be applied I cannot find a satisfactory interpretation of how to combine the big-O notation with the quantification of the variable *i*. I’d be very interested in hearing from readers that succeeded in proving this Theorem.

All papers mentioned below appear in this blog’s bibliography

Leonid Levin introduced the idea in (Levin, 1973). I must admit that I’ve never read the original Russian paper nor its translation in (Trakhtenbrot, 1984), so I rely on (Gurevich, 1988) and (Li and Vitányi, 1993) in the following. The paper presented his “Universal Search Theorem” as a result concerning resource-bounded descriptional complexity. There was no proof in the paper, but he provided the proof in private communications to Gurevich. Levin’s paper uses an advanced strategy for selecting which program to generate in each round. This strategy causes the constant factor associated with a program *p* to be where *K(p)* is the prefix complexity of *p* and for some constant *k*. This is explained in Section 7.5 of (Li and Vitányi, 1993).

Schnorr’s paper (Schnorr, 1976) is the earliest English exposition on this topic that I know of, and it seems to be the first application of Levin’s idea to predicates rather than functions with arbitrarily complex values. Gurevich dedicated most of (Gurevich, 1988) to explaining Levin’s idea which seems to have been largely unknown at the time. A major topic in Gurevich’s discussion is the complexity models in which Levin’s idea can be utilized. Amir Ben-Amram wrote a clear and precise exposition on Levin’s idea in Neil Jones’s complexity book (Ben-Amram, 1997), in his guest chapter “The existence of optimal programs”.

There have been some experiments with practical implementation of Levin’s idea. (Li and Vitányi, 1993) mentions work from the 1980’s that combines Levin’s algorithm with machine learning. My own experiments (Christensen, 1999) were done without knowledge of this prior and does not use machine learning but focuses on tailored programming languages and efficient implementations.

I’m a Ph.D. of computer science based in Copenhagen, Denmark. I am currently a Senior System Engineer working on developing highly scalable, distributed systems for Issuu, the leading digital publishing platform (see http://issuu.com/about). My interest in complexity theory was nurtured by Neil D. Jones and I was brought up on his book “Computability and Complexity From a Programming Perspective”. I recently had the pleasure of co-authoring a paper with Amir Ben-Amran and Jakob G. Simonsen for the Chicago Journal of Theoretical Computer Science, see http://cjtcs.cs.uchicago.edu/articles/2012/7/contents.html

]]>Then, there is a third result, which seems to be less widely known, though it definitely ought to: I will call it **The Fundamental Theorem of Complexity Theory**, a name given by Meyer and Winklmann to a theorem they published in 1979, following related work including (Meyer and Fischer, 1972) and (Schnorr and Stumpf, 1975). As with the NP story, similar ideas were being invented by Levin in the USSR at about the same time. Several years later, with Levin relocated to Boston, these lines of research united and the ultimate, tightest, polished form of the theorem was formed, and presented in a short paper by Levin and, thankfully, a longer “complete exposition” by Seiferas and Meyer (here – my thanks to Janos Simon for pointing this paper out to me). Seiferas and Meyer did not name it *The Fundamental Theorem*, perhaps to avoid ambiguity, but I think that it does deserve a name more punchy than “a characterization of realizable space complexities” (the title of the article).

My purpose in this blog post is to give a brief impression of this result and its significance to the study of speedup phenomena. The reader who becomes sufficiently interested can turn to the complete exposition mentioned (another reason to do so is the details I omit, for instance concerning partial functions).

Some Definitions:Analgorithmwill mean, henceforth, a Turing machine with a read-only input tape and a worktape whose alphabet is (the machine can sense the end of the tape – so no blank necessary).Programalso means such a machine, but emphasizes that its “code” is written out as a string.Complexitywill mean space complexity as measured on the worktape. For a machine , its space complexity function is denoted by . A function that is the of some is known asspace-constructible.Note that up to a certain translation overhead, results will apply to any Turing-equivalent model and a suitable complexity measure (technically, a Blum measure).

The Seiferas-Meyer paper makes use of a variety of clever programming techniques for space-bounded Turing machines; one example is universal simulation with a constant

additiveoverhead.

I will argue that this theorem formulates an (almost) ultimate answer to the following (vague) question: “what can be said about optimal algorithms and speedup in general, that is, not regarding specific problems of interest?”

Examples of things that can be said are **Blum’s speedup theorem**: there exist problems with arbitrarily large speedup; the **Hierarchy theorem**: there do exist problems that have an optimal algorithm, at various levels of complexity (the type of result they prove is called a *compression theorem*, which unfortunately creates confusion with the well-known tape compression theorem. The appelation *hierarchy theorem* may bring the right kind of theorem to mind).

As shown by Seiferas and Meyer, these results, among others, can all be derived from the Fundamental Theorem, and for our chosen measure of space, the results are particularly tight: their Compression Theorem states that for any space-constructible function , there are algorithms of that complexity that cannot be sped up by more than an *additive* constant. So, even a tiny multiplicative speedup is ruled out for these algorithms – and the algorithm may be assumed to compute a predicate (so, the size of the output is one bit and is not responsible for the complexity).

An important step to this result is the choice of a clever way to express the notion of “a problem’s complexity” (more specifically, the complexity of computing a given function). To the readership of this blog it may be clear that such a description cannot be, as one may naïvely assume, a single function that describes the complexity of a single, best algorithm for the given problem. The good answer is a so-called *complexity set*. This is the set of all functions for machines that solve the given function (Meyer and Fischer introduced a similar concept, *complexity sequence*, which is specifically intended to describe problems with speedup).

How can a complexity set be specified? Since we are talking about a set of *computable* functions here (in fact, *space-constructible*), it can be given as a set of *programs *that compute the functions. This is called a *complexity specification*. The gist of the Theorem is that it tells us which sets of programs *do* represent the complexity of something – moreover, it offers a choice of equivalent characterizations (an always-useful type of result).

Clearly, if a can be computed in space , it can also be computed in a larger space bound; it can also be computed in a space bound smaller by a constant (a nice exercise in TM hacking – note that we have fixed the alphabet). If and are space bounds that suffice, then does too (another Turing-machine programming trick). So, we can assume that a set of programs represent the

closureof the corresponding set of functions under the above rules. We can also allow it to include programs that compute functions which are not space-constructible: they will not be space bounds themselves, but will imply that constructible functions above them are. So, even a single program can represent an infinite family of time bounds: specifically, the bounds lower-bounded (up to a constant) by the given function.

**Theorem.** Both of the following are ways to specify all existing complexity sets:

- Sets of
*programs*described by predicates (i.e., , where is a decidable predicate). - Singletons.

The last item I find the most surprising. It is also very powerful. For any machine , the fact that is a complexity specification tells us there there is a function (in fact, a predicate) computable in space **if and only if** . Here is our Compression theorem!

The first characterization is important when descriptions by an infinite number of functions are considered. For example, let us prove the following:

**Theorem.** There is a decision problem, solvable in polynomial (in fact, quadratic) space, that has no best (up to constant factors) algorithm.

Proof. Let be a program that computes , written so that is just a hard-coded constant. The idea is for the set of to be recursively enumerable. Note that all these functions are constructible, that is, prospective space bounds.

Then, by the fundamental theorem, there is a decision problem solvable in space if and only if is at least one of these functions (up to an additive constant). If some algorithm for this problem takes at least space, then there is also a solution in , and so forth.

As exciting as I find this theorem, it has its limitations. Not all the speedup-related results seem to follow from it; for instance, the other Levin’s theorem doesn’t (or I couldn’t see how). Also, results like those given here and here about the measure, or topological category, of sets like the functions that have or do not have speedup, do not seem to follow. In fact, Seiferas and Meyer only prove a handful of “classic” results like Blum speedup, the Compression theorem and a Gap theorem. What new questions about complexity sets *can* be asked and answered using these techniques?

Another limitation is that for complexity measures other than space we do not have such tight results. So, for example, for Turing-machine *time* we are stuck with the not-so-tight hierarchy theorems proved by diagonalization, padding etc. (see references in the bibliography). Is this a problem with our proof methods? Or could some surprising speedup phenomenon be lurking there?

[Oct 16, 2012. Fixed error in last theorem]

]]>**Bilinear algorithms and recursion.**

Strassen’s approach was to exploit the inherent recursive nature of matrix multiplication: the product of two matrices can be viewed as the product of two matrices, the entries of which are matrices. Suppose that we have an algorithm *ALG* that runs in time and multiplies two matrices. Then one can envision obtaining a fast recursive algorithm for multiplying matrices (for any integer ) as well: view the matrices as matrices the entries of which are matrices; then multiply the matrices using *ALG* and when *ALG* requires us to multiply two matrix entries, recurse.

This approach only works, provided that the operations that *ALG* performs on the matrix entries make sense as matrix operations: e.g. entry multiplication, taking linear combination of entries etc. One very general type of such algorithm is the so called *bilinear* algorithm: Given two matrices and , compute products

i.e. take possibly different linear combinations of entries of and multiply each one with a possibly different linear combination of entries of . Then, compute each entry of the product as a linear combination of the : .

Given a bilinear algorithm *ALG* for multiplying two matrices (for constant ) that computes products , the recursive approach that multiplies matrices using *ALG* gives a bound . To see this, notice that the number of additions that one has to do is no more than : at most to compute the linear combinations for each and at most for each of the outputs . Since matrix addition takes linear time in the matrix size, we have a recurrence of the form .

As long as we get a nontrivial bound on . Strassen’s famous algorithm used and thus showing that . A lot of work went into getting better and better “base algorithms” for varying constants . Methods such as Pan’s method of trilinear aggregation were developed. This approach culminated in Pan’s algorithm (1978) for multiplying matrices that used products and hence showed that **.**

**Approximate algorithms and Schonhage’s theorem.**

A further step was to look at more general algorithms, so called *approximate* bilinear algorithms. In the definition of a bilinear algorithm the coefficients were constants. In an approximate algorithm, these coefficients can be formal linear combinations of integer powers of an indeterminate, (e.g. ). The entries of the product are then only “approximately” computed, in the sense that , where the term is a linear combination of *positive* powers of . The term “approximate” comes from the intuition that if you set to be close to , then the algorithm would get the product almost exactly.

Interestingly enough, Bini et al. (1980) showed that when dealing with the asymptotic complexity of matrix multiplication, approximate algorithms suffice for obtaining bounds on . This is not obvious! What Bini et al. show, in a sense, is that as the size of the matrices grows, the “approximation” part can be replaced by a sort of bookkeeping which does not present an overhead asymptotically. The upshot is that if there is an *approximate* bilinear algorithm that computes products to compute the product of two matrices, then .

Bini et al. (1979) gave the first approximate bilinear algorithm for a matrix product. Their algorithm used entry products to multiply a matrix with a matrix. Although this algorithm is for rectangular matrices, it can easily be converted into one for square matrices: a matrix is a matrix with entries that are matrices with entries that are matrices, and so multiplying matrices can be done recursively using Bini et al.’s algorithm three times, taking entry products. Hence .

Schonhage (1981) developed a sophisticated theory involving the bilinear complexity of rectangular matrix multiplication that showed that approximate bilinear algorithms are even more powerful. His paper culminated in something called the Schonhage -theorem, or the asymptotic sum inequality. This theorem is one of the most useful tools in designing and analyzing matrix multiplication algorithms.

Schonhage’s -theorem says roughly the following. Suppose we have several instances of matrix multiplication, each involving matrices of possibly different dimensions, and we are somehow able to design an approximate bilinear algorithm that solves all instances and uses fewer products than would be needed when computing each instance separately. Then this bilinear algorithm can be used to multiply (larger) square matrices and would imply a nontrivial bound on .

What is interesting about Schonhage’s theorem is that it is believed that when it comes to *exact* bilinear algorithms, one cannot use fewer products to compute several instances than one would use by just computing each instance separately. This is known as Strassen’s additivity conjecture. Schonhage showed that the additivity conjecture is false for *approximate* bilinear algorithms. In particular, he showed that one can approximately compute the product of a by a vector and the product of a by a vector together using only entry products, whereas any exact bilinear algorithm would need at least products. His theorem then implied , and this was a huge improvement over the previous bound of Bini et al.

**Using fast solutions for problems that are not matrix multiplications.**

The next realization was that there is no immediate reason why the “base algorithm” that we use for our recursion has to compute a matrix product at all. Let us focus on the following family of computational problems. We are given two vectors and and we want to compute a third vector . The dependence of on and is given by a three-dimensional tensor as follows: . The vector is a bilinear form. The tensor can be arbitrary, but let us focus on the case where . Notice that completely determines the computational problem. Some examples of such bilinear problems are polynomial multiplication and of course matrix multiplication. For polynomial multiplication, if and only if , and for matrix multiplication, if and only if and .

The nice thing about these bilinear problems is that one can easily extend the theory of bilinear algorithms to them. A bilinear algorithm computing a problem instance for tensor computes products of the form and then sets . Here, an algorithm is nontrivial if the number of products that it computes is less than the number of positions where the tensor is nonzero.

In order to be able to talk about recursion for general bilinear problems, it is useful to define the tensor product of two tensors and : . Thus, the bilinear problem defined by can be viewed as a bilinear problem defined by , where each product is actually itself a bilinear problem defined by .

This allows one to compute an instance of the problem defined by using an algorithm for and an algorithm for . One can similarly define the tensor power of a tensor as tensor-multiplying by itself times. Then any bilinear algorithm computing an instance defined by using entry products can be used recursively to compute the tensor power of using products, just as in the case of matrix multiplication.

A crucial development in the study of matrix multiplication algorithms was the discovery that sometimes algorithms for bilinear problems that do not look at all like matrix products can be converted into matrix multiplication algorithms. This was first shown by Strassen in the development of his “laser method” and was later exploited in the work of Coppersmith and Winograd (1987,1990). The basic idea of the approach is as follows.

Consider a bilinear problem for which you have a really nice approximate algorithm *ALG* that uses entry products. Take the tensor power of (for large ), and use* ALG* recursively to compute using entry products. is a bilinear problem that computes a long vector from two long vectors and . Suppose that we can embed the product of two matrices and into as follows: we put each entry of into some position of and set all other positions of to , we similarly put each entry of into some position of and set all other positions of to , and finally we argue that each entry of the product is in some position of the computed vector (all other entries are ). Then we would have a bilinear algorithm for computing the product of two matrices using entry products, and hence .

The goal is to make as large of a function of as possible, thus minimizing the upper bound on .

Strassen’s laser method and Coppersmith and Winograd’s paper, and even Schonhage’s -theorem, present ways of embedding a matrix product into a large tensor power of a different bilinear problem. The approaches differ in the starting algorithm and in the final matrix product embedding. We’ll give a very brief overview of the Coppersmith-Winograd algorithm.

**The Coppersmith-Winograd algorithm.**

The bilinear problem that Coppersmith and Winograd start with is as follows. Let be an integer. Then we are given two vectors and of length and we want to compute a vector of length defined as follows:

,

for , and .

Notice that is far from being a matrix product. However, it is related to matrix products:

which is the inner product of two -length vectors,

, , and , which are three scalar products, and

the two matrix products computing and for which are both products of a vector with a scalar.

If we could somehow convert Coppersmith and Winograd’s bilinear problem into one of computing these products as *independent *instances, then we would be able to use Schonhage’s -theorem. Unfortunately, however, the matrix products are merged in a strange way, and it is unclear whether one can get anything meaningful out of an algorithm that solves this bilinear problem.

Coppersmith and Winograd develop a multitude of techniques to show that when one takes a large tensor power of the starting bilinear problem, one can actually decouple these merged matrix products, and one can indeed apply the -theorem. The -theorem then gives the final embedding of a large matrix product into a tensor power of the original construction, and hence defines a matrix multiplication algorithm.

Their approach combines several impressive ingredients: sets avoiding -term arithmetic progressions, hashing and the probabilistic method. The algorithm computing their base bilinear problem is also impressive. The number of entry products it computes is , which is exactly the length of the output vector ! That is, their starting algorithm is optimal for the particular problem that they are solving.

What is not optimal, however, is the analysis of how good of a matrix product algorithm one can obtain from the base algorithm. Coppersmith and Winograd noticed this themselves: They first applied their analysis to the original bilinear algorithm and obtained an embedding of an matrix product into the -tensor power of the bilinear problem for some explicit function . (Then they took to go to infinity and obtained .) Then they noticed, that if one applies the analysis to the second tensor power of the original construction, then one obtains an embedding of an matrix product into the same -tensor power, where . That is, although one is considering embeddings into the same () tensor power of the construction, the analysis crucially depends on which tensor power of the construction you start from! This led to the longstanding bound . Coppersmith and Winograd left as an open problem what bound one can get if one starts from the third or larger tensor powers.

**The recent improvements on .**

It seems that many researchers attempted to apply the analysis to the third tensor power of the construction, but this somehow did not improve bound on . Because of this and since each new analysis required a lot of work, the approach was abandoned, at least until 2010. In 2010, Andrew Stothers carried through the analysis on the fourth tensor power and discovered that the bound on can be improved to .

As mentioned earlier, the original Coppersmith-Winograd bilinear problem was related to different matrix multiplication problems that were merged together. The tensor power of the bilinear problem is similarly composed of merged instances of simpler bilinear ptoblems, however these instances may no longer be matrix multiplications. When applying a Coppersmith-Winograd-like analysis to the tensor power, there are two main steps.

The first step involves analyzing each of the bilinear problems, intuitively in terms of how close they are to matrix products; there is a formal definition of the similarity measure called the *value *of the bilinear form. The second step defines a family of matrix product embeddings in the tensor power in terms of the values. These embeddings are defined via some variables and constraints, and each represents some matrix multiplication algorithm. Finally, one solves a nonlinear optimization program to find the best among these embeddings, essentially finding the best matrix multiplication algorithm in the search space.

Both the Coppersmith-Winograd paper and Stothers’ thesis perform an entirely new analysis for each new tensor power . The main goal of my work was to provide a general framework so that the two steps of the analysis do not have to be redone for each new tensor power . My paper first shows that the first step, the analysis of each of the values, can be completely automated by solving linear programs and simple systems of linear equations. This means that instead of proving theorems one only needs to solve linear programs and linear systems, a simpler task. My paper then shows that the second step of the analysis, the theorem defining the search space of algorithms, can also be replaced by just solving a simple system of linear equations. (Amusingly, the fact that matrix multiplication algorithms can be used to solve linear equations implies that good matrix multiplication algorithms can be used to search for better matrix multiplication algorithms.) Together with the final nonlinear program, this presents a fully automated approach to performing a Coppersmith-Winograd-like analysis.

After seeing Stothers’ thesis in the summer of last year, I was impressed by a shortcut he had used in the analysis of the values of the fourth tensor power. This shortcut gave a way to use recursion in the analysis, and I was able to incorporate it in my analysis to show that the number of linear programs and linear systems one would need to solve to compute the values for the tensor power drops down to , at least when is a power of . This drop in complexity allowed me to analyze the tensor power, thus obtaining an improvement in the bound of : .

There are several lingering open questions. The most natural one is, how does the bound on change when applying the analysis to higher and higher tensor powers. I am currently working together with a Stanford undergraduate on this problem: we’ll apply the automated approach to several consecutive powers, hoping to uncover a pattern so that one can then mathematically analyze what bounds on can be proven with this approach.

A second open question is more far reaching: the Coppersmith-Winograd analysis is not optimal– in a sense it computes an approximation to the best embedding of a matrix product in the tensor power of their bilinear problem. What is the optimal embedding? Can one analyze it mathematically? Can one automate the search for it?

Finally, I am fascinated by automating the search for algorithms for problems. In the special case of matrix multiplication we were able to define a search space of algorithms and then use software to optimize over this search space. What other computational problems can one approach in this way?

]]>…

While thinking about your blog, two things occurred to me that might be worth mentioning:

1. If a function f(x) has speedup, then any lower bound on its computation can be improved by a corresponding amount. For example, if every program for computing f(x) can be sped up to run twice as fast (on all but a finite number of integers), then any lower bound G(x) on its run time can be raised from G(x) to 2G(x) (on all but a finite number of integers). For another example, if any program for computing f(x) can be sped up by a sqrt (so that any run time F(x) can be reduced to a runtime of at most sqrt(F(x)), then any lower bound G(x)on its run time can be raised to [G(x)]^2, etc. This is all easy to see.

2. Much harder to see is a curious relation between speedup and inductive inference, which has to do with inferring an algorithm from observation of the sequence of integers that it generates. Theorem: there exists an inductive inference algorithm for inferring all sequences that have optimal algorithms (i.e. have programs that cannot be sped up)! This was quite a surprise (and a breakthrough) for me. Still is. To explain it though, I’d have to explain inductive inference, etc, and this would take me a bit of time. Some day…

Anyway, thanks again for your blog.

Best wishes and warm regards,

manuel (blum)

]]>Despite the algebraic nature of the matrix multiplication problem, many of the suggested routes to proving are combinatorial. This post is about connections between one such combinatorial conjecture (the “Strong Uniquely Solvable Puzzle” conjecture of Cohn, Kleinberg, Szegedy and Umans) and some more well-known combinatorial conjectures. These results appear in this recent paper with Noga Alon and Amir Shpilka.

We start with a well-known, and stubborn, combinatorial problem, the “cap-set problem.” A cap-set in is a subset of vectors in with the property that no three vectors (not all the same) sum to 0. The best lower bound is due to Edel. Let us denote by the “cap set conjecture” the assertion that one can actually find sets of size . It appears that there is no strong consensus on whether this conjecture is true or false, but one thing that we do know is that it is a popular blog topic (see here, here, here, and the end of this post).

Now it happens that the only triples of elements of that sum to 0 are and — triples that are “all the same, or all different.” So the cap set conjecture can be rephrased as the conjecture that there are large subsets of vectors in so that for any in the set (not all equal), there is some coordinate such that are “not all the same, and not all different.” Generalizing from this interpretation, we arrive at a family of much stronger assertions, one for each : let’s denote by the “ conjecture” the assertion that there are subsets of vectors in , of size with the property that for any in the set (not all equal), there is some coordinate such that are “not all the same, and not all different.” Such sets of size in imply size -size sets of this type in by viewing each symbol in as a block of symbols in , so the conjecture is stronger (i.e. it implies the cap-set conjecture). Indeed, if you play around with constructions you can see that as gets larger it seems harder and harder to have large sets avoiding triples of vectors for which every coordinate has “all different.” Thus one would probably guess that the conjecture is false.

One of the results in our paper is that the conjecture is in fact *equivalent* to falsifying the following well-known sunflower conjecture of Erdos-Szemeredi: there exists an such that any family of at least subsets of contains a 3-sunflower (i.e., three sets whose pairwise intersections are all equal).

So the intuition that the conjecture is false agrees with Erdos and Szemeredi’s intuition, which is a good sign.

Now let’s return to the cap-set conjecture in . Being a cap set is a condition on *all* triples of vectors in the set. If one restricts to a condition on only some of the triples of vectors, then a construction becomes potentially easier. We will be interested in a “multicolored” version in which vectors in the set are assigned one of three colors (say red, blue, green), and we only require that no triple of vectors sums to 0 with being red, being blue, and being green. But a moment’s thought reveals that the problem has become too easy: one can take (say) the red vectors to be all vectors beginning with , the green vectors to be all vectors beginning with , and the the blue vectors to be all vectors beginning with . A solution that seems to recover the original flavor of the problem, is to insist that the vectors come from a collection of red, green, blue triples with ; we then require that every red, green, blue triple *except those in the original collection* not sum to 0. So, let’s denote by the “multicolored cap-set conjecture” the assertion that there are subsets of *triples of vectors* from of size , with each triple summing to 0, and with the property that for any three triples in the set (not all equal), .

If is a cap set in , then the collection of triples constitutes a multicolored cap-set of the same size, so the multicolored version of the conjecture is indeed weaker (i.e. it is implied by the cap-set conjecture).

The SUSP conjecture of Cohn, Kleinberg, Szegedy, and Umans is the following: there exist subsets of *triples of vectors* from of size , with each triple summing (in the integers) to the all-ones vector, and with the property that for any three triples in the set (not all equal), there is a coordinate such that .

The mysterious constant comes from the fact , and it is easy to see that one cannot have multiple triples with the same “ vector” (or , or …). More specifically, “most” triples that sum to the all-ones vector are balanced (meaning that each have weight ), and there can be at most balanced triples, without repeating an vector. So the conjecture is that there are actually subsets satisfying the requirements, whose sizes approach this easy upper bound.

If in the statement of the SUSP conjecture, one replaces “there is a coordinate such that ” with “there is a coordinate such that ” one gets the weaker “Uniquely Solvable Puzzle” conjecture instead of the “Strong Uniquely Solvable Puzzle” conjecture. Here one is supposed to interpret the triples as triples of “puzzle pieces” that fit together (i.e. their 1’s exactly cover the coordinates without overlap); the main requirement then ensures that no other way of “assembling the puzzle” fits together in this way, thus it is “uniquely solvable.” It is a consequence of the famous Coppersmith-Winograd paper that the Uniquely Solvable Puzzle conjecture is indeed true. Cohn, Kleinberg, Szegedy and Umans showed that if the stronger SUSP conjecture is true, then .

Two of the main results in our paper are that (1) the conjecture implies the SUSP conjecture, and (2) the SUSP conjecture implies the multicolored cap-set conjecture.

So by (1), *disproving the SUSP* conjecture is as hard as proving the Erdos-Szemeredi conjecture (which recall is equivalent to disproving the conjecture). Of course we hope the SUSP conjecture is true, but if it is not, it appears that will be difficult to prove.

And by (2), proving the SUSP conjecture entails proving the multicolored cap-set conjecture. So apart from being a meaningful weakening of the famous cap-set conjecture in , the multicolored cap-set conjecture can be seen as a “warm-up” to (hopefully) proving the SUSP conjecture and establishing . As a start, we show in our paper a lower bound of for multicolored cap-sets, which beats the lower bound of Edel for ordinary cap sets.

Returning to “speedup,” the topic of this blog, notice that the SUSP conjecture, as well as the cap-set, the multicolored cap-set, and the conjectures, all assert that there exist sets of size with certain properties, for various constants . In all cases is easily ruled out, and so all one can hope for is a sequence of sets, one for each , whose sizes approach as grows. If the SUSP conjecture is true, then it is this sequence of sets that directly corresponds to a sequence of matrix multiplication algorithms with running times for approaching . This would then be a concrete manifestation of the “speedup” phenomenon for matrix multiplication.

]]>