Why Tail-Recursive Functions are Loops

One story every computing enthusiast should hear is the lesson of how loops and tail-recursion are equivalent. We like recursive functions because they’re amenable to induction, and we can derive them in a way that is in direct correspondence with the definition of the datatype over which they recur. We like loops because they’re fast and make intuitive sense as long as variables don’t change in too tricky a way.

In general, recursive functions are slower than loops because they push stack frames: the performance of most programs today is dominated by memory reads/writes. The data we touch the most lives in the cache–we do not want to evict a ton of stuff from the cache, under any circumstance. In a direct-style implementation of a recursive function, the recursive call has to push a stack frame to remember what to do once the function returns:

;; Racket
(define (sum l)
  (if (empty? l)
      0
      (+ (first l) (sum (rest l)))))

// C 
int sum(int *l, int length) {
    if (length == 0)
        return 0;
    else
        return l[0] + sum(l + 1, length - 1);
}

When we get to (+ (first l) (sum (rest l))), we first call (first l) (which returns the first element). While we’re making that call, we have to remember to come back and do (sum (rest l))–to be fully precise, we remember that we need to do (rest l), then take its result x and call (sum x), remembering to come back and finally take that result and add it to the result of (first l). The reason we have to do this is because we need to remember those partial results (in this case the result of (first l)): we have to store them somewhere after all, and each time we make the recursive call, we need to remember the result of (first l) from this call–we need O(n) stack space for a list of size n.

Of course, if we use iteration this all goes away:

;; Racket
(define (sum l)
  (define x 0)
  (for ([elt l])
    (set! x (+ x elt)))
  x)

// C
int sum(const int *l, int length) {
    int x = 0;
    for (int i = 0; i < length; i++) {
        x += l[i];
    }
    return x;
}

We all have an intuitive sense of what the loop is doing: once we hit the end of the loop, we do not make a recursive call (we never issue a call instruction in assembly), we simply jump up to the beginning of the loop. The key is that x is being used as an accumulator, growing a partial result in a bottom-up fashion as the computation proceeds, eventually yielding the final value at the end. Instead of keeping partial results on the stack, the loop takes a constant amount of space but linear time.

In a tail-recursive implementation, the rule is that every recursive call must be a tail call. Intuitively, a tail call is a call which is “immediately returned.” More formally, a subexpression of an expression is in tail position if the return value from that expression is the return value from the whole expression. For example, in (if guard e-t e-f), both e-t and e-f are in tail position, but the guard is not: after we decide which branch to take, we’re committed:

(define (foo ...)
  ...
  (if guard
    (f x y ...)
    (g z ...)))

Once we finish executing guard, it would be useless (but correct) to (a) push a stack frame, (b) wait on the result of the subordinate call, and (c) merely return that same result, because all we’d be doing is copying the return value from the callee and propagating it back as the return value of the caller. Being a tail call is a syntactic property of a callsite: we (and the compiler) can easily look at a piece of code and cheaply decide when a call is a tail call versus not.

This reasoning above generalizes to any call expression in tail position: because a tail call will necessarily evaluate to its result, administratively copying it up/down the stack is extensionally a no-op. Now, the tail-recursive version uses a simple trick I teach to all of my students: (a) identify an accumulator variable, (b) instead of computing with the result of the recursive call, compute with the current accumulator, (c) return the accumulator in the base case:

;; Racket
(define (sum l acc) ;; note: acc got added
  (if (empty? l)
      acc ;; this is the *true* return!
      (sum (rest l) (+ acc (first l)))))

// C -- we pass in length manually because we're using arrays
int sum(const int* l, int length, int acc) {
    if (length == 0) return acc;
    return sum(l + 1, length - 1, acc + l[0]);
}

Both of these functions are tail recursive: because the only recursive call to sum is also the return value from sum (or, more directly: because both calls to sum are in tail position). Since the compiler knows that these are tail calls, a compiler with tail-call optimization will ensure that both of these tail calls compile into jmp statements–with zero implication on stack usage–rather than the more burdensome (on the cache, stack, etc.) direct-style calls. Something that should concern you is this: if the function is using constant stack space, how are the variables being updated / represented!? The answer is that the arguments get stomped over, and mutably updated, yielding the exact same performance profile as a loop!.

Now time for an exercise, what about this program, can you convert it to using tail-recursion?

;; return a pair (cons cell) of the number of even numbers,
;; and the number of odd numbers.
;; HINT: use multiple accumulators. 
(define (even-odd l)
  (if (empty? l)
      (cons 0 0)
      (let ([v (even-odd (rest l))])
            (if (first l)
                (cons (add1 (car v)) (cdr v))
                (cons (car v) (add1 (cdr v))))))))

What about this program?

;; flattens a tree? into a list
;; It's hard because there are *two* calls to linearize--can you do anything?
(define (linearize t)
  (match t
    ['empty '()]
    [`(node ,v ,t0 ,t1) (append (list v) (linearize t0) (linearize t1))]))

One surprising fact is that we can systematically compile any program so that every call is a tail-call, by completely transforming the program into continuation-passing-style (CPS), this essentially eliminates the stack. Indeed, some compilers for functional languages work precisely this way: those languages cannot fall prey to a stack overflow, because they have essentially traded a monolithic (efficient, array-like) stack for a deeply-nested stack, strewn throughout the heap—because the continuations will be heap-allocated and nested. There are various exciting trade-offs here, but for now I will leave this as is–we will continue next week.