notes

Implementing efficient guards

When accessing a value from the stack, the implementation needs to check for underflow. In an interpreter, this is done while executing the stack-accessing instruction. A compiler can do better, though.

For example, this program has three instructions that may underflow, without knowing anything about the size of the stack before l1:

l1:
    div         # access 2
    mul         # access 3
    printi
    dup         # access 4
    printi
    jmp l2

It consists of a single basic block, that would be lowered via the Nebula strategy of registerizing stack operations in a basic block to something like this:

l1:
    guard underflow(2), source(div, line 2)
    %0 = load_stack 0
    %1 = load_stack 1
    %2 = div_lazy %0, %1
    guard underflow(3), source(mul, line 3)
    %3 = load_stack 2
    %4 = mul_lazy %3, %2
    eval %4
    printi %4
    guard underflow(4), source(dup, line 5)
    %5 = load_stack 3
    eval %5
    printi %5
    drop 3
    jmp l2

When the stack has a size less than 2, then the predicate underflow(2) is true and guard underflow(2), source(div, line 2) produces an error, identifying the div on line 2 as the originating instruction. guard can also take a label as the second parameter, in which case it jumps there when the predicate is true.

This program currently checks the stack size for each operation that accesses lower on the stack. For this small program, it’s only three guards, but it would be many more for larger programs. None of the guard operations can be simply combined, though, or the effects would be out of order.

If we assume that nearly all executions never underflow and that an underflow would occur later in execution, we can optimize for the branch that won’t underflow. First, duplicate the basic block and replace the largest underflow guard with an error, making it the terminator. Then, in the original basic block, move that corresponding guard to the top and eliminate the other underflow guards.

l1:
    guard underflow(4), .l1__underflow
    %0 = load_stack 0
    %1 = load_stack 1
    %2 = div_lazy %0, %1
    %3 = load_stack 2
    %4 = mul_lazy %3, %2
    eval %4
    printi %4
    %5 = load_stack 3
    eval %5
    printi %5
    drop 3
    jmp l2

.l1__underflow:
    guard underflow(2), source(div, line 2)
    %0 = load_stack 0
    %1 = load_stack 1
    %2 = div_lazy %0, %1
    guard underflow(3), source(mul, line 3)
    %3 = load_stack 2
    %4 = mul_lazy %3, %2
    eval %4
    printi %4
    error underflow(4), source(dup, line 5)

In the worst case, when few underflow guards can be removed by analysis, this doubles the program size, but it reduces the number of underflow checks from one per stack-accessing instructions to one per basic block. When an execution underflows, it will still check for underflow on any lower access, but only for that final basic block. As it is not possible for it to loop at this point, it is generally a small overhead.

After a strictness analysis, the order of evaluation is made explicit and guards are inserted for zero divisors:

l1:
    guard underflow(4), .l1__underflow
    %0 = load_stack 2
    eval %0
    %1 = load_stack 1
    eval %1
    guard zero(%1), source(div, line 2)
    %2 = load_stack 0
    eval %2
    %3 = div %2, %1
    %4 = mul %0, %3
    printi %4
    %5 = load_stack 3
    eval %5
    printi %5
    drop 3
    jmp l2

.l1__underflow:
    guard underflow(2), source(div, line 2)
    guard underflow(3), source(mul, line 3)
    %0 = load_stack 2
    eval %0
    %1 = load_stack 1
    eval %1
    guard zero(%1), source(div, line 2)
    %2 = load_stack 0
    eval %2
    %3 = div %2, %1
    %4 = mul %0, %3
    printi %4
    error underflow(4), source(dup, line 5)

Since the analyzer has the full program available, in most cases, the stack can be statically determined to not underflow, so most stack guards will be eliminated.