notes

ægraphs: Acyclic E-graphs for Efficient Optimization in a Production Compiler

Notes on “ægraphs: Acyclic E-graphs for Efficient Optimization in a Production Compiler” by Chris Fallin at PLDI 2023 in the EGRAPHS workshop.

[video] [conference]

E-graph per basic block

The naïve approach is to use one e-graph per basic block, but that limits rewrite scope and sharing/amortization and eliminates control optimizations.

E-graph sea of nodes

Use one e-graph for the whole program or one per function.

Represent control flow as structured nodes, like an if node, in a nested tree of regions. Like Graal’s structured loops or Relooper.

Memory stores “somehow” need edges in the e-graph. ~ Sounds like what sea of nodes already does!

Benefits:

Downsides

If writing the compiler from scratch, he’d be more interested in taking a regions-based approach.

Jamey Sharp’s optir prototype uses this approach. It uses Regionalized Value State Dependency Graphs and e-graphs

My reflection

Since my IR is sea of nodes in an arena, I already have the building blocks for a whole-program e-graph representation. I think I could avoid a two-way conversion between e-graphs and a CFG. NodeTable hashes my nodes for unique insertion, so I could rework it to a hashcons structure to power the e-graph.

Perhaps I could make it work without structuring control flow. I’ve written several irreducible programs, that I’d want to compile, so it would be valuable. If the terminators are also nodes, the dependence could be used. Cyclic edges might then be problematic… hinted at by the name “acyclic e-graphs”.

I think this approach would make optimizing across basic blocks easier. It could recognize something like if x < y { x } else { y }, even when unstructured, and rewrite it to min(x, y).

With the load-store edges explicit, it may be able to subsume redundant load elimination and store-to-load forwarding. Once the values of addresses are optimized by the e-graph, it could more precisely determine which addresses are aliasing.

E-graph with CFG skeleton

Uses an e-graph scope for the whole function body, but keeps the CFG around. Effectful operations, like memory stores, remain in the CFG and are ordered. CFG nodes point into the e-graph, which contains the pure nodes. This sounds very close to my current approach.

CFG skeleton contains:

e-graph contains:

Benefits:

Downsides:

Elaboration

To lower the e-graph with CFG skeleton to a CFG (“elaboration”): …

His convention is to identify e-classes as ec0, ec1, …, ecn, and elaborated SSA values as v0, v1, …, vn.

When moving an expression that is used by multiple, but not all, branches, to an earlier point, such that it can be shared by all branches, it creates a “partial redundancy”.

Dominance

A dominates B if all paths to B first pass through A. The immediate dominator is the closest dominator, that dominates a given block. Dominance forms a tree.

SSA requires the invariant that a value’s definition dominates its uses.

GVN: if an operator dominates a duplicate copy of itself, reuse the original. Implemented with domtree pre-order traversal and a scoped map (pushing a scope on pre-order, then popping the scope with its changes on post-order). This map allows lookup of computations to get the SSA value and reuse.

What data structure is used for the scoped hash map? A persistent hash map like Rust im::HashMap?

Scoped elaboration

Domtree pre-order traversal, building the same e-class-to-elaborated map, except it’s a scoped hash map.

Anything elaborated by a dominator is available for use in a child.

Subsumes:

How would something like:

v3 = v1 + v2
v4 = v1 + v4 + v2

be optimized? It should be v4 = v3 + v4, but that requires commuting. The e-graph, if saturated, would have found that, but how would it know that it should extract that term, instead of an equivalent one? How is the cost computed?

Union nodes

E-graphs perform rewrites in a batch, then repair invariants. Fixup requires backlinks (parent pointers) and re-interning, which are costly. Parent lists, duplicated node storage, and parent list merging and deduping should be eliminated. Instead, perform all rewrites eagerly.

Cycles occur in e-graphs, even if the original e-graph is acyclic (e.g., from SSA). For example, x + 0 => x has a self-edge.

To solve this, æ-graphs don’t join e-classes or rewrite a node and instead represent e-classes as trees of union nodes. As the e-graph is built, track the latest ID for a given value. Rewrite rules are invoked when a node is created and only the final unioned ID is entered into the hashcons map.

Combination of insights

Æ-graphs combine multiple insights:

E-graphs vs æ-graphs

egg-style e-graph: batched rewriting and repair

Æ-graphs: eager rewriting and immutable union nodes

Would I still need parent pointers for other kinds of optimizations? If so, that would negate some of the benefits of æ-graphs.

Compromises

This is at least as powerful as traditional rewrites, but has solved phase ordering and can make use of multi-version, cost-based extraction.

Influences from e-graphs

Æ-graphs take several things from e-graphs:

Performance

Most slowdowns come from instruction scheduling (#6250) or missing opt rules, such as magic div constants (#6049).

Future

Instruction scheduler as an extraction pass

Optimization through blockparms (phis)

Non-greedy instruction selection

Fused and unrolled rewrites

Instruction scheduling

Work with Cranelift

They mentor students and collaborate with researchers.