Levin’s and Schnorr’s optimality results

March 22, 2013 by

This week on “Speedup in Computational Complexity” we’re going to learn how to write an optimal solver for SAT. Thanks to Leonid Levin we know that a very partial answer to “Are there familiar computational problems with no best algorithm?” is “Well, computing a satisfying assignment for a boolean formula does have a best algorithm, that’s for sure!”. In fact we’ll explore several variations of an idea by Levin all of which provide us with computational problems that each have a best algorithm. One particular variation, by Claus-Peter Schnorr, applies to the class of self-reducible problems.

A word of warning before we start: The below constructions only work in some models of complexity. One of the models in which the results will not apply is the Turing Machine model. I will mention the model requirements as we go along but if you’d like a more detailed discussion of this topic, I refer you to Gurevich’s paper in the references.

Without further ado let’s jump straight into Levin’s neat little trick, performed by a combination of an interpreter and a program enumerator.

A wonderful sequence of values

The program that we’ll design in this section takes an input x and runs infinitely, outputting an infinite sequence of values. Our program will output a new number in the sequence every k steps for some constant k. The sequence produced will turn out to be quite a wonderful characterization of x (if you love computational complexity). I’ll use the name iseq(x) for the infinite sequence generated on input x.

To design our program – let’s call it iseqprog – we’ll need two other programs to start from: A program enumerator and an interpreter for the enumerated programs.

The program enumerator, progen, takes as input a pair (i,x) and returns the initial configuration of the program with index i on input x. We’ll expect this operation to be constant-time when either i=0 or we already called progen on (i-1,x). In other words: progen is more like a method of an object (with internal state) which expects to be called with inputs (0,x), (1,x), (2,x),… and is able to process each item in such a call sequence in constant-time.

The interpreter we’ll need cannot be any old interpreter. In these modern times we can expect a certain service level. The interpreter should work like a slot machine in the arcades: Whenever I put in a new coin I continue my game with three more lives. In other words, when I give the interpreter the configuration of program p after t steps on input x, it returns an updated configuration representing the state of program p after t+1 steps on input x. It also tells me if p terminated in step t+1 and, if so, the return value of p on x. All of this happens in constant time. After all, the interpreter only needs to simulate one single step of p on x.

Comment: Almost any old interpreter can be used for Levin’s construction, but the exposition would become more complex.

Now I’ll describe the computation of iseqprog on input x. The computation proceeds in rounds, and each round consists of a number of stages. There is an infinite number of rounds. The number of stages in each round is finite but not constant across rounds.

Round 1 has only 1 stage. In this first stage of the first round, iseqprog runs progen on (0,x) and gets back the initial configuration of program 0 on input x. iseqprog then uses the interpreter to interpret just 1 step of program 0 on input x. If program 0 happens to terminate on input x in that first step, iseqprog immediately outputs program 0’s output on input x. Regardless of whether the interpretation of program 0 terminated, iseqprog itself does not terminate; it is on its way to generate an infinite sequence. If the interpretation of program 0 on input x did not terminate in its first step, iseqprog outputs the value 0 before continuing, providing us with a signal that it’s still live and running. This concludes the first (1-stage) round of iseqprog’s computation on x.

The computation continues in an infinite sequence of rounds. In each round, iseqprog calls progen once, adding one new item to the set of program configurations it has accumulated during the previous rounds. Each of these program configurations is interpreted for a number of steps. Every time the interpreter has executed one step of a program i, iseqprog outputs one value. The output value will be 0 or program i’s output on x. Whatever program you may think of, iseqprog will eventually generate its output on x (in between a lot of 0’s and many other values).

If we use this shorthand:

  • +i”: means “create the initial configuration for program i on input x, then interpret 1 step on that configuration and output one value”
  • i” means “interpret 1 more step on the stored configuration for program i and output one value”

then we can sum up iseqprog’s first rounds like this:

Round 1: +1
Round 2: 11+2
Round 3: 111122+3
Round 4: 11111111222233+4
Round 5: 111111111111111122222222333344+5
Round 6: 32 1′s, 16 2′s, 8 3′s, 4 4′s, 2 5′s, and one +6

I hope it has become clear why iseqprog should be able to generate a new item of iseq every k steps or less for some constant k. Apart from the administrative work of looking up and saving configurations in some table, each step involves at most one call to the program enumerator and one call to the interpreter. These calls were assumed to be constant-time. The administrative work I wlll simply assume to be constant-time as well. iseqprog cannot work as intended in all complexity models; in particular, it doesn’t work for Turing machines.

Now let’s have a look at the sequence iseq(x) itself. The key observation is that although any individual program does not get much attention from iseqprog, it does get a specific percentage of attention that is not dependent on the input x. For instance, program 3 accounts for \frac{1}{2^3}=\frac{1}{8} of the interpreter calls made by iseqprog regardless of the input x. The percentage is tied only to the program’s index number according to the program enumerator. From this observation we can derive (proof left to the reader) the salient feature of iseq(x):

If program p outputs y on input x in time t, then y appears in iseq(x) at an index less than ct for c depending only on p.

I think this is great! Whatever you want to compute from x, you’ll find it in iseq(x). What’s more: Your answer appears quite early in the sequence – so early, in fact, that you might as well just run through iseq(x) rather than perform the computation itself! That’s why I decided to call iseq(x) a wonderful sequence.

It’s too good to be true…if it wasn’t for two caveats. First, how do you recognize the value that you’re looking for? And second, what about that constant c? We’ll address these two questions below.

Comment: Another caveat is that the above doesn’t apply to all complexity models, in particular to Turing Machines. For most of the common complexity models, I expect that the result will be true if you replace ct by poly(t) where poly is a polynomial depending only on p

I’ll end this section with a simple specialization of the above that is too nice not to mention:

For any function f in P, there is a polynomial p such that \min \{ i | {\it iseq(x)}_i = f(x) \} < p(|x|)

And yes, iseqprog generates a new item of iseq(x) in k steps or less for some constant k!

The optimal solver for SAT

So what good is all this if you cannot recognize the value you’re looking for. Luckily there are some situations where validating a correct answer is simpler than producing it – yes, I’m thinking about SAT. A satisfying assignment for a boolean formula can be validated in linear time. How can we exploit Levin’s idea to create an optimal solver for SAT?

The simplest answer is to modify the program enumerator. Our new program enumerator, call it progenSAT, wraps each program generated by the original program enumerator in a SAT validator. The computation of progenSAT(i,x) will proceed in two phases like this:

Phase 1: Run progen(i,x) and assign its output value to variable y.
Phase 2: If y is a satisfying assignment for the boolean formula x then output y else loop infinitely.

If we plug progenSAT into iseqprog we get a new program iseqprogSAT generating a new sequence iseqSAT(x) on input x.

Like the original iseqprog, our new program iseqprogSAT generates a new item every k steps or less for some constant k. I’m assuming that progenSAT also takes constant time to generate each new program configuration. Let us adapt the key observation about iseq(x) to the sequence iseqSAT(x) (once again, I’ll leave the proof to the reader):

If program p outputs y on input x in time t, and y is a satisfying assigment for the boolean formula x, then y appears in iseqSAT(x) at an index less than c’(t+|x|) for c’ depending only on p.

This is remarkable! This means we have a concrete program that is optimal (up to a constant factor) for solving SAT. As a consequence, The question of P vs. NP boils down to a question about this single program’s running time. Define \it time^{\neg 0}_p(x) to be the number of steps program p takes to generate a nonzero value on input x. Now P=NP if and only if there is a polynomial p such that \it time^{\neg 0}_{\it iseqprogSAT}(x) < p(|x|) for every satisfiable boolean formula x.

In other words, there may be 1 million US$ waiting for you if you’re able to analyze iseqprogSAT‘s running time in detail.

Notes for the experimentalists

Now we’ll have a look at the other caveat about Levin’s idea: The constant factor. In the 1990’s, under the supervision of Neil Jones and Stephen Cook, I worked on implementing a program enumerator that would get iseqprog to actually terminate on some toy problems. The problem, of course, is that the constant factors involved are so large you’ll be tempted to never use big-O-notation ever again. Let’s assume that your programs are sequences of k different instructions, and that every sequence of instructions is a valid program. Then the index of a program p is roughly k^{|p|}. The constant factor c is then approximately 2^{k^{|p|}} i.e. doubly exponential in |p|. So to get an answer from iseqprog the useful programs need to be really short.

Actually I found that iseqprog favours short programs so much that it sometimes fails to find program that actually computes the function you’re looking for. In one case, half of the inputs caused one little program, p’, to give the correct result while the other half of the inputs caused another little program, p’’, to give iseqprog its output. A program that tested the input then continued as either p’ or p’’ was too long to ever get simulated.

It’s actually possible to reduce the constant factor c by a lot, if you’re willing to sacrifice the optimality in asymptotical running time. By revising the strategy used to pick which program to interpret, you wil obtain different tradeoffs between constant factor and asymptotical relation. For instance, consider the variant of iseq(x), call it iseq_triangle(x) obtained by using the following simple strategy in Levin’s construction:

Round 1: +1
Round 2: 1+2
Round 3: 12+3
Round 4: 123+4
Round 5: 1234+5

I’ll postulate the following, leaving the proof to the reader: If program p outputs y on input x in time t, then y appears in iseq_triangle(x) at an index less than {\it index}_p^2 t^2.

I once identified a few strategies of this kind but never got around to clarifying in more detail which tradeoffs are possible; or indeed optimal. Could the “triangle” strategy be improved so that the expression above instead would be {\it index}_p^2 t \log (t)? I doubt it, but have no proof. It seems like a small but interesting math exercise.

In one variation of iseqprog the programs are actually enumerated in the order of their descriptive complexity. See the references below for details on that.

Schnorr’s self-reducible predicates

Claus-Peter Schnorr analyzed applications of Levin’s result in a 1976 ICALP paper. In particular, he was interested in defining a class of predicates that do not allow asymptotical speedup. The contrast to the above results should be noticed: It is an actual predicate, a 2-valued function, that does not have speedup.

I have not been able to prove Schnorr’s main result (the paper’s proof is missing a few details) but I’d like to outline his central idea because it is interesting, and maybe one of the readers can be helpful by providing a proof in the comments of this blog post. I have simplified his definition a bit and refer you to the ICALP paper for the general definition, and for his results on graph isomorphism and speedup in the Turing machine space model.

Let us adapt some notation from the previous blog post by Amir Ben-Amram and define the complexity set T_f of a function f to be

T_f \stackrel{\it def}= \{ {\it time}_p(x)\ |\ p\ {\it computes}\ f \}

In the remainder of this section, all data will be bit strings, and P will designate a binary predicate, i.e. P(x,y) \in \{ 0, 1\}. You may think of SAT as a prime example of the kind of predicates Schnorr analyzes. The decision function for P will be defined by

{\it decide}_P \stackrel{\it def}= \lambda x.\ \exists y: P(x,y)

A function w is a witness function for P if and only if

\forall x: P(x, w(x))\ \vee (\forall y: \neg P(x,y))

The idea behind Schnorr’s result is to consider a class of predicates, P for which there is a tight connection between the complexity set T_{{\it decide}_P} and the complexity sets of the associated witness functions:

\bigcup \{ T_w\ |\ w\ \textrm{is a witness function for}\ P\}

The class in question is the class of (polynomial-time) self-reducible predicates. The criteria for being self-reducible are a bit complex. I will provide a simplified, less general, version here. P is self-reducible if P(x,y) implies |x| = |y| and there is a polynomial-time function {\it spec}_P mapping a pair of (bit string, bit) to a bit string such that

  1. P(x, y_1 \cdots y_n) = P({\it spec}_P(x, y1), y_2 \cdots y_n)
  2. |{\it spec}_P(x, b)| < |x|

Theorem (Schnorr, 1976, Theorem 2.4, rewritten): When P is self-reducible, there is an integer k and witness function w for P such that

t \in T_{{\it decide}_P} \Rightarrow \exists t' \in T_w: t' = \mathcal{O}(\lambda x .\ |x|^k t(x))

This theorem is not too hard to prove. To find a witness y for an x, you figure out the bits of y one at a time. It takes |x| rounds in which we test both 0 and 1 as the potential “next bit” of a witness. For the details, I refer you to Schnorr’s paper.

The main theorem of interest to this blog post is Schnorr’s Theorem 2.7. A precise statement of the Theorem requires more technical detail than I’m able to provide here, but its essence is this: For a self-reducible predicate P, the decision problem {\it decide}_P cannot be sped up by a factor of |x|.

As mentioned above, I’ve not been able to construct a proof based on the ICALP paper, so I’ll leave this a homework to the readers! It certainly seems like all of the necessary constructions have been lined up, but at the place where “standard methods of diagonalization” should be applied I cannot find a satisfactory interpretation of how to combine the big-O notation with the quantification of the variable i. I’d be very interested in hearing from readers that succeeded in proving this Theorem.

Historical notes and references

All papers mentioned below appear in this blog’s bibliography

Leonid Levin introduced the idea in (Levin, 1973). I must admit that I’ve never read the original Russian paper nor its translation in (Trakhtenbrot, 1984), so I rely on (Gurevich, 1988) and (Li and Vitányi, 1993) in the following. The paper presented his “Universal Search Theorem” as a result concerning resource-bounded descriptional complexity. There was no proof in the paper, but he provided the proof in private communications to Gurevich. Levin’s paper uses an advanced strategy for selecting which program to generate in each round. This strategy causes the constant factor associated with a program p to be 2^{K(p)+1} where K(p) is the prefix complexity of p and K(p) \leq |p| + 2 \log |p| + k for some constant k. This is explained in Section 7.5 of (Li and Vitányi, 1993).

Schnorr’s paper (Schnorr, 1976) is the earliest English exposition on this topic that I know of, and it seems to be the first application of Levin’s idea to predicates rather than functions with arbitrarily complex values. Gurevich dedicated most of (Gurevich, 1988) to explaining Levin’s idea which seems to have been largely unknown at the time. A major topic in Gurevich’s discussion is the complexity models in which Levin’s idea can be utilized. Amir Ben-Amram wrote a clear and precise exposition on Levin’s idea in Neil Jones’s complexity book (Ben-Amram, 1997), in his guest chapter “The existence of optimal programs”.

There have been some experiments with practical implementation of Levin’s idea. (Li and Vitányi, 1993) mentions work from the 1980’s that combines Levin’s algorithm with machine learning. My own experiments (Christensen, 1999) were done without knowledge of this prior and does not use machine learning but focuses on tailored programming languages and efficient implementations.

About the author

I’m a Ph.D. of computer science based in Copenhagen, Denmark. I am currently a Senior System Engineer working on developing highly scalable, distributed systems for Issuu, the leading digital publishing platform (see http://issuu.com/about). My interest in complexity theory was nurtured by Neil D. Jones and I was brought up on his book “Computability and Complexity From a Programming Perspective”. I recently had the pleasure of co-authoring a paper with Amir Ben-Amran and Jakob G. Simonsen for the Chicago Journal of Theoretical Computer Science, see http://cjtcs.cs.uchicago.edu/articles/2012/7/contents.html

A Fundamental Theorem on Optimality and Speedup (guest post by Amir Ben-Amram)

October 11, 2012 by

Leonid Levin is known for several fundamental contributions to Complexity Theory. The most widely known is surely the notion of “universal search problem,” a concept similar to (and developed concurrently with) NP-completeness. Next, we might cite the proof of the existence of optimal “honest” algorithms for search problems (an honest algorithm uses a given verifier to check its result; expositions of this theorem can be found, among else, here and here).  This is an important result regarding the tension between speedup and the existence of optimal algorithms, and will surely be discussed here in the future.

Then, there is a third result, which seems to be less widely known, though it definitely ought to: I will call it The Fundamental Theorem of Complexity Theory, a name given by Meyer and Winklmann to a theorem they published in 1979, following related work including (Meyer and Fischer, 1972) and (Schnorr and Stumpf, 1975). As with the NP story, similar ideas were being invented by Levin in the USSR at about the same time.  Several years later, with Levin relocated to Boston, these lines of research united and the ultimate, tightest, polished form of the theorem was formed, and presented in a short paper by Levin and, thankfully, a longer “complete exposition” by Seiferas and Meyer (here – my thanks to Janos Simon for pointing this paper out to me). Seiferas and Meyer did not name it The Fundamental Theorem, perhaps to avoid ambiguity, but I think that it does deserve a name more punchy than “a characterization of realizable space complexities” (the title of the article).

My purpose in this blog post is to give a brief impression of this result and its significance to the study of speedup phenomena.  The reader who becomes sufficiently interested can turn to the complete exposition mentioned (another reason to do so is the details I omit, for instance concerning partial functions).

Why this theorem is really fundamental

Some Definitions: An algorithm will mean, henceforth, a Turing machine with a read-only input tape and a worktape whose alphabet is \{0,1\} (the machine can sense the end of the tape – so no blank necessary).  Program also means such a machine, but emphasizes that its “code” is written out as a string. Complexity will mean space complexity as measured on the worktape. For a machine e, its space complexity function is denoted by S_e. A function that is the S_e of some e is known as space-constructible.

Note that up to a certain translation overhead, results will apply to any Turing-equivalent model and a suitable complexity measure (technically, a Blum measure).

The Seiferas-Meyer paper makes use of a variety of clever programming techniques for space-bounded Turing machines; one example is universal simulation with a constant additive overhead.

I will argue that this theorem formulates an (almost) ultimate answer to the following (vague) question: “what can be said about optimal algorithms and speedup in general, that is, not regarding specific problems of interest?”

Examples of things that can be said are Blum’s speedup theorem: there exist problems with arbitrarily large speedup; the Hierarchy theorem: there do exist problems that have an optimal algorithm, at various levels of complexity (the type of result they prove is called a compression theorem, which unfortunately creates confusion with the well-known tape compression theorem. The appelation hierarchy theorem may bring the right kind of theorem to mind).

As shown by Seiferas and Meyer, these results, among others, can all be derived from the Fundamental Theorem, and for our chosen measure of space, the results are particularly tight: their Compression Theorem states that for any space-constructible function S, there are algorithms of that complexity that cannot be sped up by more than an additive constant. So, even a tiny (1-\varepsilon) multiplicative speedup is ruled out for these algorithms – and the algorithm may be assumed to compute a predicate (so, the size of the output is one bit and is not responsible for the complexity).

An important step to this result is the choice of a clever way to express the notion of “a problem’s complexity” (more specifically, the complexity of computing a given function). To the readership of this blog it may be clear that such a description cannot be, as one may naïvely assume, a single function that describes the complexity of a single, best algorithm for the given problem. The good answer is a so-called complexity set. This is the set of all functions S_e for machines e that solve the given function (Meyer and Fischer introduced a similar concept, complexity sequence, which is specifically intended to describe problems with speedup).

How can a complexity set be specified? Since we are talking about a set of computable functions here (in fact, space-constructible), it can be given as a set \cal E of programs that compute the functions. This is called a complexity specification. The gist of the Theorem is that it tells us which sets of programs do represent the complexity of something – moreover, it offers a choice of equivalent characterizations (an always-useful type of result).

Clearly, if a f can be computed in space S, it can also be computed in a larger space bound; it can also be computed in a space bound smaller by a constant (a nice exercise in TM hacking – note that we have fixed the alphabet). If S_1 and S_2 are space bounds that suffice, then \min(S_1,S_2) does too (another Turing-machine programming trick). So, we can assume that a set of programs \cal E represent the closure of the corresponding set of functions under the above rules. We can also allow it to include programs that compute functions which are not space-constructible: they will not be space bounds themselves, but will imply that constructible functions above them are. So, even a single program can represent an infinite family of time bounds: specifically, the bounds S_e lower-bounded (up to a constant) by the given function.

A statement of the theorem (roughly) and consequences

Theorem. Both of the following are ways to specify all existing complexity sets:

  • Sets \cal E of programs described by \Sigma_2 predicates (i.e., {\cal E} = \{ e \mid \exists a \forall b \, P(a,b,e)\}, where P is a decidable predicate).
  • Singletons.

The last item I find the most surprising. It is also very powerful. For any machine e, the fact that \{e\} is a complexity specification tells us there there is a function (in fact, a predicate) computable in space S if and only if S \ge S_e - O(1). Here is our Compression theorem!

The first characterization is important when descriptions by an infinite number of functions are considered. For example, let us prove the following:

Theorem. There is a decision problem, solvable in polynomial (in fact, quadratic) space, that has no best (up to constant factors) algorithm.

Proof. Let e_k be a program that computes \lceil n^{1+\frac{1}{k}} \rceil, written so that k is just a hard-coded constant. The idea is for the set of e_k to be recursively enumerable. Note that all these functions are constructible, that is, prospective space bounds.

Then, by the fundamental theorem, there is a decision problem solvable in space S if and only if S is at least one of these functions (up to an additive constant). If some algorithm for this problem takes at least \lceil n^{1+\frac{1}{k}} \rceil space, then there is also a solution in \lceil n^{1+\frac{1}{k+1}} \rceil, and so forth.

Limitations and some loose ends

As exciting as I find this theorem, it has its limitations. Not all the speedup-related results seem to follow from it; for instance, the other Levin’s theorem doesn’t (or I couldn’t see how). Also, results like those given here and here about the measure, or topological category, of sets like the functions that have or do not have speedup, do not seem to follow. In fact, Seiferas and Meyer only prove a handful of “classic” results like Blum speedup, the Compression theorem and a Gap theorem. What new questions about complexity sets can be asked and answered using these techniques?

Another limitation is that for complexity measures other than space we do not have such tight results. So, for example, for Turing-machine time we are stuck with the not-so-tight hierarchy theorems proved by diagonalization, padding etc. (see references in the bibliography). Is this a problem with our proof methods? Or could some surprising speedup phenomenon be lurking there?

[Oct 16, 2012. Fixed error in last theorem]

On the recent progress on matrix multiplication algorithms (guest post by Virginia Vassilevska Williams)

September 22, 2012 by

A central question in the theory of algorithms is to determine the constant \omega, called the exponent of matrix multiplication. This constant is defined as the infimum of all real numbers such that for all \varepsilon>0 there is an algorithm for multiplying n\times n matrices running in time O(n^{\omega+\varepsilon}). Until the late 1960s it was believed that \omega=3, i.e. that no improvement can be found for the problem. In 1969, Strassen surprised everyone by showing that two n\times n matrices can be multiplied in O(n^{2.81}) time. This discovery spawned a twenty-year-long extremely productive time in which the upper bound on \omega was gradually lowered to 2.376. After a twenty-year stall, some very recent research has brought the upper bound down to 2.373.

Bilinear algorithms and recursion.
Strassen’s approach was to exploit the inherent recursive nature of matrix multiplication: the product of two kn\times kn matrices can be viewed as the product of two k\times k matrices, the entries of which are n\times n matrices. Suppose that we have an algorithm ALG that runs in o(k^3) time and multiplies two k\times k matrices. Then one can envision obtaining a fast recursive algorithm for multiplying k^i\times k^i matrices (for any integer i>1) as well: view the k^i\times k^i matrices as k\times k matrices the entries of which are k^{i-1}\times k^{i-1} matrices; then multiply the k\times k matrices using ALG and when ALG requires us to multiply two matrix entries, recurse.

This approach only works, provided that the operations that ALG performs on the matrix entries make sense as matrix operations: e.g. entry multiplication, taking linear combination of entries etc. One very general type of such algorithm is the so called bilinear algorithm: Given two matrices A and B, compute r products

P_l = (\sum_{i,j} u_{ijl} A[i,j])(\sum_{ij} v_{ijl} B[i,j]),

i.e. take r possibly different linear combinations of entries of A and multiply each one with a possibly different linear combination of entries of B. Then, compute each entry of the product AB as a linear combination of the P_l: AB[i,j]=\sum_{l} w_{ijl} P_l.

Given a bilinear algorithm ALG for multiplying two k\times k matrices (for constant k) that computes r products P_l, the recursive approach that multiplies k^i\times k^i matrices using ALG gives a bound \omega\leq \log_k r. To see this, notice that the number of additions that one has to do is no more than 3rk^2: at most 2k^2 to compute the linear combinations for each P_l and at most r for each of the k^2 outputs AB[i,j]. Since matrix addition takes linear time in the matrix size, we have a recurrence of the form T(k^i) = r T(k^{i-1}) + O(rk^{2i}).

As long as r<k^3 we get a nontrivial bound on \omega. Strassen’s famous algorithm used k=2 and r=7 thus showing that \omega\leq \log_2 7. A lot of work went into getting better and better “base algorithms” for varying constants k. Methods such as Pan’s method of trilinear aggregation were developed. This approach culminated in Pan’s algorithm (1978) for multiplying 70\times 70 matrices that used 143,640 products and hence showed that \omega\leq \log_{70} 143,640<2.796.

Approximate algorithms and Schonhage’s theorem.
A further step was to look at more general algorithms, so called approximate bilinear algorithms. In the definition of a bilinear algorithm the coefficients u_{ijl},v_{ijl},w_{ijl} were constants. In an approximate algorithm, these coefficients can be formal linear combinations of integer powers of an indeterminate, \lambda (e.g. 2\lambda^2-5\lambda^{-3}). The entries of the product AB are then only “approximately” computed, in the sense that AB[i,j]=\sum_{l} w_{ijl} P_l + O(\lambda), where the O(\lambda) term is a linear combination of positive powers of \lambda. The term “approximate” comes from the intuition that if you set \lambda to be close to 0, then the algorithm would get the product almost exactly.

Interestingly enough, Bini et al. (1980) showed that when dealing with the asymptotic complexity of matrix multiplication, approximate algorithms suffice for obtaining bounds on \omega. This is not obvious! What Bini et al. show, in a sense, is that as the size of the matrices grows, the “approximation” part can be replaced by a sort of bookkeeping which does not present an overhead asymptotically. The upshot is that if there is an approximate bilinear algorithm that computes r products P_l to compute the product of two k\times k matrices, then \omega\leq \log_k r.

Bini et al. (1979) gave the first approximate bilinear algorithm for a matrix product. Their algorithm used 10 entry products to multiply a 2\times 3 matrix with a 3\times 3 matrix. Although this algorithm is for rectangular matrices, it can easily be converted into one for square matrices: a 12\times 12 matrix is a 2\times 3 matrix with entries that are 3\times 2 matrices with entries that are 2\times 2 matrices, and so multiplying 12\times 12 matrices can be done recursively using Bini et al.’s algorithm three times, taking 1,000 entry products. Hence \omega\leq \log_{12} 1,000<2.78.

Schonhage (1981) developed a sophisticated theory involving the bilinear complexity of rectangular matrix multiplication that showed that approximate bilinear algorithms are even more powerful. His paper culminated in something called the Schonhage \tau-theorem, or the asymptotic sum inequality. This theorem is one of the most useful tools in designing and analyzing matrix multiplication algorithms.

Schonhage’s \tau-theorem says roughly the following. Suppose we have several instances of matrix multiplication, each involving matrices of possibly different dimensions, and we are somehow able to design an approximate bilinear algorithm that solves all instances and uses fewer products than would be needed when computing each instance separately. Then this bilinear algorithm can be used to multiply (larger) square matrices and would imply a nontrivial bound on \omega.

What is interesting about Schonhage’s theorem is that it is believed that when it comes to exact bilinear algorithms, one cannot use fewer products to compute several instances than one would use by just computing each instance separately. This is known as Strassen’s additivity conjecture. Schonhage showed that the additivity conjecture is false for approximate bilinear algorithms. In particular, he showed that one can approximately compute the product of a 3\times 1 by a 1\times 3 vector and the product of a 1\times 4 by a 4\times 1 vector together using only 10 entry products, whereas any exact bilinear algorithm would need at least 3\cdot 3+4 = 13 products. His theorem then implied \omega<2.55, and this was a huge improvement over the previous bound of Bini et al.

Using fast solutions for problems that are not matrix multiplications.
The next realization was that there is no immediate reason why the “base algorithm” that we use for our recursion has to compute a matrix product at all. Let us focus on the following family of computational problems. We are given two vectors x and y and we want to compute a third vector z. The dependence of z on x and y is given by a three-dimensional tensor t as follows: z_k = \sum_{ij} t_{ijk} x_i y_j. The vector z is a bilinear form. The tensor t can be arbitrary, but let us focus on the case where t_{ijk}\in \{0,1\}. Notice that t completely determines the computational problem. Some examples of such bilinear problems are polynomial multiplication and of course matrix multiplication. For polynomial multiplication, t_{ijk} = 1 if and only if j=k-i, and for matrix multiplication, t_{(i,i'),(j,j'),(k,k')}=1 if and only if i'=j, j'=k and k'=i.

The nice thing about these bilinear problems is that one can easily extend the theory of bilinear algorithms to them. A bilinear algorithm computing a problem instance for tensor t computes r products P_l of the form (\sum_{i} u_{il} x_i)(\sum_j v_{jl} y_j) and then sets z_k=\sum_k w_{kl} P_l. Here, an algorithm is nontrivial if the number of products r that it computes is less than the number of positions t_{ijk} where the tensor t is nonzero.

In order to be able to talk about recursion for general bilinear problems, it is useful to define the tensor product t\otimes t' of two tensors t and t': (t\otimes t')_{(i,i'),(j,j'),(k,k')} = t_{ijk}\cdot t_{i'j'k'}. Thus, the bilinear problem defined by t\otimes t' can be viewed as a bilinear problem z_k = \sum_{ij} t_{ijk} x_i y_j defined by t, where each product x_iy_j is actually itself a bilinear problem z_{ijk'}=\sum_{i'j'} t'_{i'j'k'} x_{ii'}y_{jj'} defined by t'.

This allows one to compute an instance of the problem defined by t\otimes t' using an algorithm for t and an algorithm for t'. One can similarly define the k^{th} tensor power of a tensor t as tensor-multiplying t by itself k times. Then any bilinear algorithm computing an instance defined by t using r entry products can be used recursively to compute the k^{th} tensor power of t using r^k products, just as in the case of matrix multiplication.

A crucial development in the study of matrix multiplication algorithms was the discovery that sometimes algorithms for bilinear problems that do not look at all like matrix products can be converted into matrix multiplication algorithms. This was first shown by Strassen in the development of his “laser method” and was later exploited in the work of Coppersmith and Winograd (1987,1990). The basic idea of the approach is as follows.

Consider a bilinear problem P for which you have a really nice approximate algorithm ALG that uses r entry products. Take the n^{th} tensor power P^n of P (for large n), and use ALG recursively to compute P^n using r^n entry products. P^n is a bilinear problem that computes a long vector z from two long vectors x and y. Suppose that we can embed the product C of two N\times N matrices A and B into P^n as follows: we put each entry of A into some position of x and set all other positions of x to 0, we similarly put each entry of B into some position of y and set all other positions of y to 0, and finally we argue that each entry of the product C is in some position of the computed vector z (all other z entries are 0). Then we would have a bilinear algorithm for computing the product of two N\times N matrices using r^n entry products, and hence \omega\leq \log_N r^n.

The goal is to make N as large of a function of n as possible, thus minimizing the upper bound on \omega.

Strassen’s laser method and Coppersmith and Winograd’s paper, and even Schonhage’s \tau-theorem, present ways of embedding a matrix product into a large tensor power of a different bilinear problem. The approaches differ in the starting algorithm and in the final matrix product embedding. We’ll give a very brief overview of the Coppersmith-Winograd algorithm.

The Coppersmith-Winograd algorithm.
The bilinear problem that Coppersmith and Winograd start with is as follows. Let q\geq 3 be an integer. Then we are given two vectors x and y of length q+2 and we want to compute a vector z of length q+2 defined as follows:

z_0 = \sum_{i=1}^q (x_iy_i+x_0y_{q+1})+x_{q+1}y_0,

z_i=x_iy_0+x_0y_i for i\in \{1,\ldots, q\}, and z_{q+1}=x_0y_0.

Notice that z is far from being a matrix product. However, it is related to 6 matrix products:

\sum_{i=1}^q x_iy_i which is the inner product of two q-length vectors,

x_0y_{q+1}, x_{q+1}y_0, and x_0y_0, which are three scalar products, and

the two matrix products computing x_iy_0 and x_0y_i for i\in \{1,\ldots, q\} which are both products of a vector with a scalar.

If we could somehow convert Coppersmith and Winograd’s bilinear problem into one of computing these 6 products as independent instances, then we would be able to use Schonhage’s \tau-theorem. Unfortunately, however, the 6 matrix products are merged in a strange way, and it is unclear whether one can get anything meaningful out of an algorithm that solves this bilinear problem.

Coppersmith and Winograd develop a multitude of techniques to show that when one takes a large tensor power of the starting bilinear problem, one can actually decouple these merged matrix products, and one can indeed apply the \tau-theorem. The \tau-theorem then gives the final embedding of a large matrix product into a tensor power of the original construction, and hence defines a matrix multiplication algorithm.

Their approach combines several impressive ingredients: sets avoiding 3-term arithmetic progressions, hashing and the probabilistic method. The algorithm computing their base bilinear problem is also impressive. The number of entry products it computes is q+2, which is exactly the length of the output vector z! That is, their starting algorithm is optimal for the particular problem that they are solving.

What is not optimal, however, is the analysis of how good of a matrix product algorithm one can obtain from the base algorithm. Coppersmith and Winograd noticed this themselves: They first applied their analysis to the original bilinear algorithm and obtained an embedding of an f(n)\times f(n) matrix product into the 2n-tensor power of the bilinear problem for some explicit function f. (Then they took n to go to infinity and obtained \omega< 2.388.) Then they noticed, that if one applies the analysis to the second tensor power of the original construction, then one obtains an embedding of an f'(n)\times f'(n) matrix product into the same 2n-tensor power, where f'(n)>f(n). That is, although one is considering embeddings into the same (2n) tensor power of the construction, the analysis crucially depends on which tensor power of the construction you start from! This led to the longstanding bound \omega<2.376. Coppersmith and Winograd left as an open problem what bound one can get if one starts from the third or larger tensor powers.

The recent improvements on \omega.
It seems that many researchers attempted to apply the analysis to the third tensor power of the construction, but this somehow did not improve bound on \omega. Because of this and since each new analysis required a lot of work, the approach was abandoned, at least until 2010. In 2010, Andrew Stothers  carried through the analysis on the fourth tensor power and discovered that the bound on \omega can be improved to 2.374.

As mentioned earlier, the original Coppersmith-Winograd bilinear problem was related to 6 different matrix multiplication problems that were merged together. The k^{th} tensor power of the bilinear problem is similarly composed of poly(k) merged instances of simpler bilinear ptoblems, however these instances may no longer be matrix multiplications. When applying a Coppersmith-Winograd-like analysis to the k^{th} tensor power, there are two main steps.

The first step involves analyzing each of the poly(k) bilinear problems, intuitively in terms of how close they are to matrix products; there is a formal definition of the similarity measure called the value of the bilinear form. The second step defines a family of matrix product embeddings in the n^{th} tensor power in terms of the values. These embeddings are defined via some variables and constraints, and each represents some matrix multiplication algorithm. Finally, one solves a nonlinear optimization program to find the best among these embeddings, essentially finding the best matrix multiplication algorithm in the search space.

Both the Coppersmith-Winograd paper and Stothers’ thesis perform an entirely new analysis for each new tensor power k. The main goal of my work was to provide a general framework so that the two steps of the analysis do not have to be redone for each new tensor power k. My paper first shows that the first step, the analysis of each of the values, can be completely automated by solving linear programs and simple systems of linear equations. This means that instead of proving poly(k) theorems one only needs to solve poly(k) linear programs and linear systems, a simpler task. My paper then shows that the second step of the analysis, the theorem defining the search space of algorithms, can also be replaced by just solving a simple system of linear equations. (Amusingly, the fact that matrix multiplication algorithms can be used to solve linear equations implies that good matrix multiplication algorithms can be used to search for better matrix multiplication algorithms.) Together with the final nonlinear program, this presents a fully automated approach to performing a Coppersmith-Winograd-like analysis.

After seeing Stothers’ thesis in the summer of last year, I was impressed by a shortcut he had used in the analysis of the values of the fourth tensor power. This shortcut gave a way to use recursion in the analysis, and I was able to incorporate it in my analysis to show that the number of linear programs and linear systems one would need to solve to compute the values for the k^{th} tensor power drops down to O(k^2), at least when k is a power of 2. This drop in complexity allowed me to analyze the 8^{th} tensor power, thus obtaining an improvement in the bound of \omega: \omega<2.3727.

There are several lingering open questions. The most natural one is, how does the bound on \omega change when applying the analysis to higher and higher tensor powers. I am currently working together with a Stanford undergraduate on this problem: we’ll apply the automated approach to several consecutive powers, hoping to uncover a pattern so that one can then mathematically analyze what bounds on \omega can be proven with this approach.

A second open question is more far reaching: the Coppersmith-Winograd analysis is not optimal– in a sense it computes an approximation to the best embedding of a matrix product in the n^{th} tensor power of their bilinear problem. What is the optimal embedding? Can one analyze it mathematically? Can one automate the search for it?

Finally, I am fascinated by automating the search for algorithms for problems. In the special case of matrix multiplication we were able to define a search space of algorithms and then use software to optimize over this search space. What other computational problems can one approach in this way?

Several Comments from Manuel Blum

September 13, 2012 by

Dear Hunter — What a nice idea. And many of your examples below are both new to me and quite interesting. Thanks! And good luck! — manuel

While thinking about your blog, two things occurred to me that might be worth mentioning:

1. If a function f(x) has speedup, then any lower bound on its computation can be improved by a corresponding amount. For example, if every program for computing f(x) can be sped up to run twice as fast (on all but a finite number of integers), then any lower bound G(x) on its run time can be raised from G(x) to 2G(x) (on all but a finite number of integers). For another example, if any program for computing f(x) can be sped up by a sqrt (so that any run time F(x) can be reduced to a runtime of at most sqrt(F(x)), then any lower bound G(x)on its run time can be raised to [G(x)]^2, etc. This is all easy to see.

2. Much harder to see is a curious relation between speedup and inductive inference, which has to do with inferring an algorithm from observation of the sequence of integers that it generates. Theorem: there exists an inductive inference algorithm for inferring all sequences that have optimal algorithms (i.e. have programs that cannot be sped up)! This was quite a surprise (and a breakthrough) for me. Still is. To explain it though, I’d have to explain inductive inference, etc, and this would take me a bit of time. Some day…

Anyway, thanks again for your blog.

Best wishes and warm regards,

manuel (blum)

Combinatorial problems related to matrix multiplication (guest post by Chris Umans)

September 10, 2012 by

Determining the exponent of matrix multiplication, \omega, is one of the most prominent open problems in algebraic complexity. Strassen famously showed that \omega < 2.81 in 1969, and after a sequence of works culminating in work of Stothers and Williams the best current upper bound is \omega < 2.3727. It is believed that indeed \omega = 2, meaning that there is a family of algorithms with running time O(n^{2 + \epsilon}) for every \epsilon > 0. In this sense, if \omega =2, matrix multiplication exhibits “speedup”: there is no single best asymptotic algorithm, only a sequence of algorithms having running times O(n^a), with a approaching 2. In fact Coppersmith and Winograd showed in 1982 that this phenomenon occurs whatever the true value of \omega is (two, or larger than two), a result that is often summarized by saying “the exponent of matrix multiplication is an infimum, not a minimum.”

Despite the algebraic nature of the matrix multiplication problem, many of the suggested routes to proving \omega = 2 are combinatorial. This post is about connections between one such combinatorial conjecture (the “Strong Uniquely Solvable Puzzle” conjecture of Cohn, Kleinberg, Szegedy and Umans) and some more well-known combinatorial conjectures. These results appear in this recent paper with Noga Alon and Amir Shpilka.

The cap set conjecture, a strengthening, and a weakening

We start with a well-known, and stubborn, combinatorial problem, the “cap-set problem.” A cap-set in Z_3^n is a subset of vectors in Z_3^n with the property that no three vectors (not all the same) sum to 0. The best lower bound is 2.21...^n due to Edel. Let us denote by the “cap set conjecture” the assertion that one can actually find sets of size (3 - o(1))^n. It appears that there is no strong consensus on whether this conjecture is true or false, but one thing that we do know is that it is a popular blog topic (see herehere, here, and the end of this post).

Now it happens that the only triples of elements of Z_3 that sum to 0 are \{0,0,0\}, \{1,1,1\}, \{2,2,2\} and \{1,2,3\} — triples that are “all the same, or all different.” So the cap set conjecture can be rephrased as the conjecture that there are large subsets of vectors in Z_3^n so that for any x, y,z in the set (not all equal), there is some coordinate i such that (x_i, y_i, z_i) are “not all the same, and not all different.” Generalizing from this interpretation, we arrive at a family of much stronger assertions, one for each D: let’s denote by the “Z_D^n conjecture” the assertion that there are subsets of vectors in Z_D^n, of size (D - o(1))^n with the property that for any x, y,z in the set (not all equal), there is some coordinate i such that (x_i, y_i, z_i) are “not all the same, and not all different.” Such sets of size (D - o(1))^n in Z_D^n imply size (3 - o(1))^n-size sets of this type in Z_3^n by viewing each symbol in Z_D as a block of \log_3 D symbols in Z_3, so the Z_D^n conjecture is stronger (i.e. it implies the cap-set conjecture). Indeed, if you play around with constructions you can see that as D gets larger it seems harder and harder to have large sets avoiding triples of vectors for which every coordinate i has (x_i, y_i, z_i) “all different.” Thus one would probably guess that the Z_D^n conjecture is false.

One of the results in our paper is that the Z_D^n conjecture is in fact *equivalent* to falsifying the following well-known sunflower conjecture of Erdos-Szemeredi: there exists an \epsilon > 0 such that any family of at least 2^{(1 - \epsilon)n} subsets of [n] contains a 3-sunflower (i.e., three sets whose pairwise intersections are all equal).

So the intuition that the Z_D^n conjecture is false agrees with Erdos and Szemeredi’s intuition, which is a good sign.

Now let’s return to the cap-set conjecture in Z_3^n. Being a cap set is a condition on *all* triples of vectors in the set. If one restricts to a condition on only some of the triples of vectors, then a construction becomes potentially easier. We will be interested in a “multicolored” version in which vectors in the set are assigned one of three colors (say red, blue, green), and we only require that no triple of vectors x,y,z sums to 0 with x being red, y being blue, and z being green. But a moment’s thought reveals that the problem has become too easy: one can take (say) the red vectors to be all vectors beginning with 01, the green vectors to be all vectors beginning with 00, and the the blue vectors to be all vectors beginning with 1. A solution that seems to recover the original flavor of the problem, is to insist that the vectors come from a collection of red, green, blue triples (x, y,z) with x+y+z = 0; we then require that every red, green, blue triple *except those in the original collection* not sum to 0. So, let’s denote by the “multicolored cap-set conjecture” the assertion that there are subsets of *triples of vectors* from Z_3^n of size (3 - o(1))^n, with each triple summing to 0, and with the property that for any three triples (x_1, y_1,z_1), (x_2, y_2, z_2), (x_3, y_3, z_3) in the set (not all equal), x_1+y_2+z_3 \neq 0.

If S is a cap set in Z_3^n, then the collection of triples \{(x,x,x): x \in S\} constitutes a multicolored cap-set of the same size, so the multicolored version of the conjecture is indeed weaker (i.e. it is implied by the cap-set conjecture).

A matrix multiplication conjecture between the two

The SUSP conjecture of Cohn, Kleinberg, Szegedy, and Umans is the following: there exist subsets of *triples of vectors* from \{0,1\}^n of size (27/4 - o(1))^{n/3}, with each triple summing (in the integers) to the all-ones vector, and with the property that for any three triples [(x_1, y_1,z_1), (x_2, y_2, z_2), (x_3, y_3, z_3)] in the set (not all equal), there is a coordinate i such that x_1[i]+y_2[i]+z_3[i] = 2.

The mysterious constant (27/4)^{1/3} comes from the fact {n \choose {n/3}} \approx (27/4)^{n/3}, and it is easy to see that one cannot have multiple triples with the same “x vector” (or y, or z…). More specifically, “most” triples (x, y, z) that sum to the all-ones vector are balanced (meaning that x, y, z each have weight n/3), and there can be at most {n \choose {n/3}} balanced triples, without repeating an x vector. So the conjecture is that there are actually subsets satisfying the requirements, whose sizes approach this easy upper bound.

If in the statement of the SUSP conjecture, one replaces “there is a coordinate i such that x_1[i]+y_2[i]+z_3[i] = 2” with “there is a coordinate i such that x_1[i]+y_2[i]+z_3[i] \neq 1” one gets the weaker “Uniquely Solvable Puzzle” conjecture instead of the “Strong Uniquely Solvable Puzzle” conjecture. Here one is supposed to interpret the (x,y,z) triples as triples of “puzzle pieces” that fit together (i.e. their 1′s exactly cover the n coordinates without overlap); the main requirement then ensures that no other way of “assembling the puzzle” fits together in this way, thus it is “uniquely solvable.” It is a consequence of the famous Coppersmith-Winograd paper that the Uniquely Solvable Puzzle conjecture is indeed true. Cohn, Kleinberg, Szegedy and Umans showed that if the stronger SUSP conjecture is true, then \omega = 2.

Two of the main results in our paper are that (1) the Z_D^n conjecture implies the SUSP conjecture, and (2) the SUSP conjecture implies the multicolored cap-set conjecture.

So by (1), *disproving the SUSP* conjecture is as hard as proving the Erdos-Szemeredi conjecture (which recall is equivalent to disproving the Z_D^n conjecture). Of course we hope the SUSP conjecture is true, but if it is not, it appears that will be difficult to prove.

And by (2), proving the SUSP conjecture entails proving the multicolored cap-set conjecture. So apart from being a meaningful weakening of the famous cap-set conjecture in Z_3^n, the multicolored cap-set conjecture can be seen as a “warm-up” to (hopefully) proving the SUSP conjecture and establishing \omega = 2. As a start, we show in our paper a lower bound of 2.51^n for multicolored cap-sets, which beats the 2.21^n lower bound of Edel for ordinary cap sets.

Returning to “speedup,” the topic of this blog, notice that the SUSP conjecture, as well as the cap-set, the multicolored cap-set, and the Z_D^n conjectures, all assert that there exist sets of size (c-o(1))^n with certain properties, for various constants c. In all cases c^n is easily ruled out, and so all one can hope for is a sequence of sets, one for each n, whose sizes approach c^n as n grows. If the SUSP conjecture is true, then it is this sequence of sets that directly corresponds to a sequence of matrix multiplication algorithms with running times O(n^a) for a approaching 2. This would then be a concrete manifestation of the “speedup” phenomenon for matrix multiplication.

Speedup in Computational Complexity

September 10, 2012 by

Welcome to this new blog aimed at promoting discussion of speedup for natural computational problems, that is, the question in computational complexity of whether a problem has a best algorithm. The intention is to have a series of guest posts by researchers in the field and to provide a shared, user maintained bibliography of work on the topic.

I believe this blog is necessary because the topic is interesting and underappreciated. It is a fundamental property of a computational problem as to whether it does or does not have an optimal algorithm. While a few classic results on speedup versus optimality are widely known, such as Blum’s speedup for artificially constructed languages and Levin’s optimal NP search algorithm, other key results on speedup for natural problems are not familiar. For instance, there is no best Strassen-style bilinear identity for matrix multiplication. The existence of efficient algorithms for MM is linked to key problems in combinatorics such as the Erdos-Rado sunflower conjecture (see Alon et al) and efforts to lower the MM exponent have recently resumed. Curiously, four of the five problems with a gap between their monotone and nonmonotone circuit complexity rely upon integer or matrix multiplication, which are conjectured to have speedup. As another example, Stockmeyer’s PhD thesis presented a result on speedup for natural problems that had been virtually forgotten. It has also been established that if there is an infinitely often (i.o.) superpolynomial speedup for the time required to accept any (paddable) coNP-complete language, then all (paddable) coNP-complete languages have this speedup, which is equivalent to a superpolynomial speedup in proof length in propositional proof systems for tautologies.

I have argued that there is a correspondence between the properties “has no algorithm at all” and “has no best algorithm” by demonstrating that a resource-bounded version of the statement “no algorithm recognizes all non-halting Turing machines” is equivalent to an infinitely often (i.o.) superpolynomial speedup for the time required to accept any (paddable) coNP-complete language. A literature review on speedup can be found here.

Please feel free to contribute by commenting on blog posts or adding papers to the shared bibliography and also let me know if you would like to make a guest post on a topic of general interest or on your latest research.

Hunter Monroe


Follow

Get every new post delivered to your Inbox.