Andreas Haas, Andreas Rossberg, Derek L. Schuff, Ben L. Titzer (Google); Michael Holman (Microsoft); Dan Gohman, Luke Wagner, Alon Zakai (Mozilla); JF Bastien (Apple)
Proc. of the 38th Conference on Programming Language Design and Implementation (PLDI), Barcelona, Spain, Jun. 2017.
[paper], [PDF], [video], [conference]
To their knowledge, Wasm is the first industrial-strength language or VM designed with a formal semantics from the start.
tee_local
writes to a local while keeping the input on the stack (like
set_local
preceded by dup
)load
/store
access by 32-bit address; may be unaligned, albeit
slowerblock
, loop
, and if
/else
terminated with end
and inner
instructions are a blockif
pops operand; else
may be omittedloop
does not automatically iterate its block and needs explicit
branchesblock
or if
,
and backward, if loop
, to a nesting depthbr_if
conditional jumpbr_table
selects target from label/index list, with last label for
out-of-bounds index caseFor example, a switch
with fallthrough:
switch (x) {
case 0: ...A...
case 1: ...B... break;
default: ...C...
}
is translated to:
block block block block
br_table 0 1 2
end ...A...
end ...B... br 1
end ...C...
end
return
is a shorthand for thiscall
pops function args and pushes result upon returncall_indirect
executes the dynamically-indexed function in a table
of functions, whose signatures needn’t necessarily match
grow_memory
returns -1 on engine out-of-memory failurecall
and call_indirect
may have stack overflow, but that’s not
semantically-observable from within WasmThese restrictions have since been lifted (see “Multi-Value All the Wasm!”):
(i32.const 1) (i32.const 2) i32.add
, the type of the
middle instruction is weakened from ε -> i32
to i32 -> i32 i32
to
compose with the other two.unreachable
, br
, br_table
, return
), so expressions
such as a + b
can be simply compositionally compiled as
compile(a) compile(b) i32.add
, without worrying about whether a
or
b
contain branches or traps.Wasm is an abstraction over hardware, not over a programming language, so it doesn’t provide a built-in object model. Producers map their data types to numbers or the memory and can define an ABI so that modules can interoperate in heterogeneous applications.
V8 (Chrome), SpiderMonkey (Firefox), and JavaScriptCode (WebKit) compile modules AOT before instantiation. Chakra (Edge) interprets modules on first execution and later JIT compiles the hottest functions.
The SpiderMonkey fast baseline JIT emits machine code and validates in a single pass with greedy register allocation. The IonMonkey optimizing JIT compiles it in parallel. V8 and SpiderMonkey achieve 5-6x compilation speed improvement with 8 compilation threads.
Wasm can be decoded to SSA form in a single linear pass. This is done by V8 and SpiderMonkey. V8’s TurboFan compiler uses a sea of nodes graph-based IR (like Graal!).
Reference interpreter is written in OCaml.
The memory size changes so infrequently that the JIT can constant fold it and embed it into the machine code, then patch it later if it changes. Deoptimization of the code is not necessary for that.
JavaScript API can store compiled Wasm modules as blob in IndexedDB to
later query for cached module. This seems like it should be a built-in,
automatic capability in the browser, including expanding this to include
the compiled machine code for JavaScript files, since I think that most
sites would benefit from this caching. I think this could only be
reliably-implemented for resources that include a cache key, such as the
ETag
HTTP header.
Other memory safety models:
Wasm, on the other hand, enforces memory safety with simple bounds checks or virtual memory techniques.
Wasm bytecode speed and simplicity of validation informed by JVM and CIL stack machine experience. The description of Wasm verification takes one page of spec, but 150 pages for the JVM.
JVM has a potentially O(n^3) worst-case for the iterative dataflow
approach that is a consequence of unrestricted goto
s and other
idiosyncrasies that had to be fixed by adding stack maps.
JVM, CIL, and Android Dalvik allow irreducible loops and unbalanced locking structures in bytecode and just relegate those constructs to an interpreter.