notes

Tree-sitter

How Tree-sitter works

Tree-sitter - a new parsing system for programming tools, Max Brunsfeld, 2018.

Tree-sitter parses without backtracking like LR parsing, but uses a generalized form, GLR parsing, which forks the parse stack for every possibility to resolve ambiguities. This reminds me of the Pike VM for regexp matching, in that all possibilities are simultaneously matched without backtracking.

When an error is encountered, it proceeds by inserting error nodes and forking for each way to recover, then selecting the branch that most closely matches a valid parse.

Incremental parsing marks nodes in the old tree, where text was modified, then parses the file again, reusing nodes that were not marked.

Citations

The research papers, that are cited as greatly influencing the design of Tree-sitter:

Practical Algorithms for Incremental Software Development Environments, Tim A. Wagner, 1998
Context Aware Scanning for Parsing Extensible Languages, Eric R. Van Wyk and August C. Schwerdfeger, 2007
Efficient and Flexible Incremental Parsing, Tim A. Wagner and Susan L. Graham, 1998
Incremental Analysis of Real Programming Languages, Tim A. Wagner and Susan L. Graham, 1997
Error Detection and Recovery in LR Parsers, what-when-how, c. 2012
Error Recovery for LR Parsers, Thomas Julian Pennello, 1977

Ideas

Has error recovery
Tree is arena-allocated and referenced by ID
- Easily serializable
Zero-copy
Usable as a concrete or abstract syntax tree
Parses incrementally

Flat AST

AST fork tree. Arena with u32 IDs. Operations: commit, fork, and prune. Each sub-arena in the fork tree starts its IDs just above its parent’s last ID. When a sub-tree is committed, it is merged into the tree under it, which is extended with its elements.

Could fork points converge to start with common nodes?

Since ASTs have span information, each token is unique and no unique table is needed. Thus, tokens do not need to be unified between branches. However, for incremental parsing, a unique table would be useful if not all state can be resumed from the AST.

pub struct AstArena { … }

impl AstArena {
    fn new() -> Self;
    fn edit(&self) -> AstBuilder<'_>;
}

pub struct AstBuilder<'arena> {
    children: Vec<AstBuilder<'arena>>,
}

impl<'arena> AstBuilder<'arena> {
    fn push(&self, node: Node) -> NodeId;
    fn fork(&self) -> AstBuilder<'arena>;
    fn commit(self);
}

Variable fidelity

Concrete (syn fidelity) vs abstract (compiler fidelity) syntax trees.

Spans

Store only lengths of nodes and reconstruct the absolute spans on descent, as opposed to storing them directly, for reusability with incremental updates and space savings. Bracket pair colorization in VS Code stores the line and column lengths of nodes, compressed as a single integer, (instead of the byte length) and balances the tree for logarithmic access.

Research questions

How does Tree-sitter store its ASTs?
What error recovery strategies does Tree-sitter use? Error Detection and Recovery in LR Parsers describes several. Max Brunsfield describes error recovery by forking at a high level at Strange Loop 2018, but mentions none of those strategies.
Since GLR parsing resembles PikeVM, what optimizations from regular expression searching can be applied to parsing?
What about languages that require extra state to parse? Like Ruby disambiguating / between division and regular expressions with name resolution or C++ templates needing expansion during parsing.

Chumsky

Why can’t error-tolerant parsers also be easy to write?, Joshua Barretto, 2022

Chumsky is a parser combinator library in Rust built on PEG, which has error recovery and emits multiple errors as once. It cannot handle off-side rule indentation (and suggests PEG, in general cannot, although Python and Pest can). Ambiguity is not analyzed.

This site is open source. Improve this page.